Video Generation

Generate AI videos from text, images, or reference media — with image-to-video, video editing, and PixVerse lip sync.

Generate AI videos from a text prompt or reference materials using the listenhub video CLI. Animate a still image, edit an existing clip, or drive a character's lips with audio or text-to-speech. Three model families cover different jobs: HappyHorse, SeeDance, and PixVerse.

Trigger

Invoke this skill with /video-gen, or use any of these phrases:

Phrase	Language
`video generation` / `text to video` / `create video`	English
`video edit` / `lipsync` / `pixverse`	English
`生成视频` / `做视频` / `视频生成`	Chinese
`视频编辑` / `口型` / `对口型`	Chinese

Requires ListenHub Skills to be installed — see Getting Started.

For narrated explainer videos with AI visuals, use /explainer instead.

Quick Example

Generate a video: cyberpunk city at night, 16:9, 5 seconds

The AI walks you through the mode and parameters one question at a time, shows a cost estimate, and asks you to confirm before generating. Generation takes minutes — the job runs in the background and the AI notifies you when the video is ready, with a URL, duration, resolution, and credit cost.

Video generation always runs with --no-wait, so the CLI returns a task id immediately and the AI polls in the background (10s interval). If you only have a task id, check progress with listenhub video get <taskId> --json.

Models

Pick the model by the job. HappyHorse is the default and the only family that edits existing video; SeeDance adds last-frame and reference-audio support; PixVerse is the only family with lip sync, plus a set of atomic capabilities (mimic, restyle, fusion, transition, marketing agent).

Capability	HappyHorse (default)	SeeDance	PixVerse
Text-to-video	Yes	Yes	Yes (`text_to_video`)
Image-to-video (first-frame)	Yes	Yes (+ last-frame)	Yes (`image_to_video`)
Reference image	Yes (1–9, `[Image N]` syntax)	Yes	Yes (`fusion`, `@refName`)
Video edit	Yes	No	No
Lip sync	No	No	Yes (`lip_sync`, audio or TTS)
Motion transfer / mimic	No	No	Yes (`mimic`, locked 720p)
Restyle	No	No	Yes (`restyle`)
Transition (first → last)	No	Yes (frame mode)	Yes (`transition` / `multi_transition`)
Reference video	No (use video edit)	Yes	Yes (mimic / lip_sync source)
Reference audio	No	Yes	Yes (lip_sync)
Max resolution	1080p	1080p	1080p
Resolution options	720p, 1080p	480p, 720p, 1080p	360p, 540p, 720p, 1080p
Duration range	3–15s	4–15s	1–60s (agent: 20/30/60)
Aspect ratios	16:9, 9:16, 1:1, 4:3, 3:4, 4:5, 5:4	16:9, 9:16, 1:1, 4:3, 3:4, 21:9	9:16, 16:9, 1:1, 4:3, 3:4

Lip sync, mimic, restyle, fusion, transition, and the marketing agent are PixVerse-only. HappyHorse and SeeDance do not support them.

SeeDance model variants

When you pick SeeDance, choose between two variants:

Model	Notes
`doubao-seedance-2-pro`	Higher quality; required for 1080p; supports last-frame and reference-audio
`doubao-seedance-2-fast`	Faster; selecting 1080p auto-upgrades to `pro`

PixVerse is OpenAPI-only — it lives under listenhub openapi video pixverse and uses public-network URLs for all media (no local file upload). If you want lip sync, mimic, restyle, fusion, transition, or the marketing agent but only have internal-auth login configured, set up an API key first with listenhub openapi config set-key.

Modes

The AI routes to a mode based on what reference material you have. HappyHorse and SeeDance share the listenhub video create command; PixVerse uses listenhub openapi video pixverse generate with an explicit --capability.

Generate a video from a text prompt only, no reference media. Available on all three model families.

listenhub video create \
  --prompt "cyberpunk city at night, neon reflections on wet streets" \
  --model "happyhorse" \
  --resolution "1080p" \
  --ratio "16:9" \
  --duration 5 \
  --no-wait --json

Animate a still image as the first frame. SeeDance can also take a last-frame image to interpolate a transition between two stills.

Image requirements: jpg, jpeg, png, or webp; local files up to 20 MB; width and height ≥ 300px; aspect ratio between 1:2.5 and 2.5:1.

listenhub video create \
  --prompt "bring the scene to life with smooth motion" \
  --model "happyhorse" \
  --resolution "1080p" \
  --duration 5 \
  --first-frame "/path/to/scene.png" \
  --no-wait --json

For SeeDance frame mode, add --last-frame and use a doubao-seedance-2-* model.

HappyHorse image-to-video has no --ratio — the output ratio is determined by the input image. SeeDance still accepts --ratio.

Supply 1–9 reference images to guide style or characters. With HappyHorse, refer to specific images in the prompt using [Image 1], [Image 2], and so on.

Image requirements: jpg, jpeg, png, or webp; up to 20 MB each; HappyHorse recommends short edge ≥ 400px.

listenhub video create \
  --prompt "[Image 1]'s character walking through [Image 2]'s street" \
  --model "happyhorse" \
  --resolution "1080p" \
  --ratio "16:9" \
  --duration 5 \
  --reference-image "/path/to/character.png" \
  --reference-image "/path/to/scene.png" \
  --no-wait --json

SeeDance reference mode additionally accepts up to 3 reference videos (mp4/mov, ≤ 50 MB) and up to 3 reference audios (mp3/wav, ≤ 20 MB, paired with an image or video).

PixVerse only. Drive a character's lips with either an audio file or text-to-speech. The source video must already exist on PixVerse — reference it by --source-video-id or by a prior succeeded task with --source-task-id.

Drive with an audio file (one public audio URL, 5–60s):

listenhub openapi video pixverse generate \
  --capability lip_sync \
  --source-video-id "abc123" \
  --audio "https://example.com/voice.mp3" \
  --quality 720p \
  --no-wait --json

Drive with text-to-speech (nested tts, no --audio):

listenhub openapi video pixverse generate \
  --capability lip_sync \
  --source-task-id "task_xyz" \
  --pixverse-json '{"tts":{"speakerId":"speaker_01","content":"Welcome to this episode"}}' \
  --quality 720p \
  --no-wait --json

Provide either an audio file or TTS, never both — passing both is rejected. For TTS, use the nested --pixverse-json '{"tts":{...}}'; do not use --lip-sync-tts / --lip-sync-speaker-id / --lip-sync-content, which the contract does not accept.

Video edit (HappyHorse)

Edit an existing clip — change style, replace the background, restyle motion. HappyHorse only; if you ask for this on SeeDance, the AI switches you to HappyHorse.

Video requirements: mp4/mov (H.264 recommended); 3–60s input (output capped at 15s); ≤ 100 MB; short edge ≥ 360px, long edge ≤ 4096px. Optionally pass 0–5 reference images.

listenhub video create \
  --prompt "replace the background with a deep starry sky, keep the subject's motion" \
  --model "happyhorse" \
  --resolution "1080p" \
  --reference-video "/path/to/input.mp4" \
  --audio-setting "origin" \
  --no-wait --json

--audio-setting controls audio: auto lets the model decide, origin keeps the original audio. Video edit has no --ratio or --duration — the output matches the input video.

Other PixVerse capabilities

PixVerse exposes additional atomic capabilities through --capability, all OpenAPI-only with URL inputs:

Capability	Inputs	Constraints
`mimic` (motion transfer)	1 image + 1 video	Quality locked to 720p; motion source 5–30s
`restyle`	`--source-video-id` (or `--source-task-id`) + `--restyle-id`	—
`fusion`	Nested `imageReferences` (1–8), prompt uses `@refName`	Top-level `--image` must be empty
`transition` / `multi_transition`	Nested `multiTransition` keyframes (2–7)	Default quality 360p
`agent` (ad_master / promo_mix)	Prompt + images	Quality 720p/1080p only; duration 20/30/60 only; `promo_mix` needs ≥ 4 images

Parameters

The AI asks for these one at a time and applies sensible session defaults. For HappyHorse and SeeDance, ratio and duration use --ratio / --duration; PixVerse uses --quality and --aspect-ratio instead of --resolution / --ratio.

Parameter	Flag	Notes
Prompt	`--prompt`	Free text. HappyHorse ≤ 2500 (Chinese) / ≤ 5000 (non-Chinese); SeeDance ≤ 500; PixVerse ≤ 2048
Model	`--model`	`happyhorse` (default), `doubao-seedance-2-pro`, `doubao-seedance-2-fast`, `pixverse`
Resolution	`--resolution`	HappyHorse: 720p/1080p; SeeDance: 480p/720p/1080p (480p is SeeDance-only)
Aspect ratio	`--ratio`	Not used for image-to-video or video edit (ratio follows input)
Duration	`--duration`	Seconds. HappyHorse 3–15, SeeDance 4–15
First frame	`--first-frame`	Image-to-video source image
Last frame	`--last-frame`	SeeDance frame mode only
Reference image	`--reference-image`	Repeatable; 1–9 (HappyHorse), or video-edit references (0–5)
Reference video	`--reference-video`	Video-edit input (HappyHorse) or SeeDance reference
Audio setting	`--audio-setting`	Video edit only: `auto` or `origin`
Seed	`--seed`	Optional; for reproducing a result

PixVerse-specific flags: --capability, --quality (360p/540p/720p/1080p), --aspect-ratio (9:16/16:9/1:1/4:3/3:4), --source-video-id / --source-task-id, --audio, --agent-type, --restyle-id, and --pixverse-json for nested payloads (tts, imageReferences, multiTransition).

Some choices are auto-corrected: 480p on HappyHorse falls back to 720p; 1080p on doubao-seedance-2-fast upgrades to doubao-seedance-2-pro. The AI tells you when it adjusts.

Estimating credits

Before generating, the AI runs an estimate. To check cost yourself, mirror the create parameters against the estimate command:

# HappyHorse / SeeDance
listenhub video estimate --model "happyhorse" --resolution "1080p" --ratio "16:9" --duration 5 --json

# PixVerse — mirror the capability + quality + duration
listenhub openapi video pixverse estimate --capability text_to_video --model pixverse --quality 720p --duration 5 --json

For video edit, add --has-video-input and --input-video-duration <seconds>.

Output

Status flows pending → generating → uploading → success. On success the AI reports the video URL, duration, resolution, ratio, seed, and credits charged.

Output behavior follows the outputMode set during config:

inline (default) or both — the video URL and metadata are shown directly in the conversation.
download or both — the file is also saved to the current working directory with a topic-based name (e.g. cyberpunk-city.mp4). Names are de-duplicated automatically.

To review past work, listenhub video get <taskId> --json returns a single task and listenhub video list --json lists recent tasks. Global flags apply: --json / -j, --no-wait, and --timeout <s>.

API Reference

See the AI Video API reference for endpoint paths, request parameters, and response fields.

Video Generation

On this page