Video Generation
Generate AI videos from text, images, or reference media — with image-to-video, video editing, and PixVerse lip sync.
Generate AI videos from a text prompt or reference materials using the listenhub video CLI. Animate a still image, edit an existing clip, or drive a character's lips with audio or text-to-speech. Three model families cover different jobs: HappyHorse, SeeDance, and PixVerse.
Trigger
Invoke this skill with /video-gen, or use any of these phrases:
| Phrase | Language |
|---|---|
video generation / text to video / create video | English |
video edit / lipsync / pixverse | English |
生成视频 / 做视频 / 视频生成 | Chinese |
视频编辑 / 口型 / 对口型 | Chinese |
Requires ListenHub Skills to be installed — see Getting Started.
For narrated explainer videos with AI visuals, use /explainer instead.
Quick Example
Generate a video: cyberpunk city at night, 16:9, 5 secondsThe AI walks you through the mode and parameters one question at a time, shows a cost estimate, and asks you to confirm before generating. Generation takes minutes — the job runs in the background and the AI notifies you when the video is ready, with a URL, duration, resolution, and credit cost.
Video generation always runs with --no-wait, so the CLI returns a task id immediately and the AI polls in the background (10s interval). If you only have a task id, check progress with listenhub video get <taskId> --json.
Models
Pick the model by the job. HappyHorse is the default and the only family that edits existing video; SeeDance adds last-frame and reference-audio support; PixVerse is the only family with lip sync, plus a set of atomic capabilities (mimic, restyle, fusion, transition, marketing agent).
| Capability | HappyHorse (default) | SeeDance | PixVerse |
|---|---|---|---|
| Text-to-video | Yes | Yes | Yes (text_to_video) |
| Image-to-video (first-frame) | Yes | Yes (+ last-frame) | Yes (image_to_video) |
| Reference image | Yes (1–9, [Image N] syntax) | Yes | Yes (fusion, @refName) |
| Video edit | Yes | No | No |
| Lip sync | No | No | Yes (lip_sync, audio or TTS) |
| Motion transfer / mimic | No | No | Yes (mimic, locked 720p) |
| Restyle | No | No | Yes (restyle) |
| Transition (first → last) | No | Yes (frame mode) | Yes (transition / multi_transition) |
| Reference video | No (use video edit) | Yes | Yes (mimic / lip_sync source) |
| Reference audio | No | Yes | Yes (lip_sync) |
| Max resolution | 1080p | 1080p | 1080p |
| Resolution options | 720p, 1080p | 480p, 720p, 1080p | 360p, 540p, 720p, 1080p |
| Duration range | 3–15s | 4–15s | 1–60s (agent: 20/30/60) |
| Aspect ratios | 16:9, 9:16, 1:1, 4:3, 3:4, 4:5, 5:4 | 16:9, 9:16, 1:1, 4:3, 3:4, 21:9 | 9:16, 16:9, 1:1, 4:3, 3:4 |
Lip sync, mimic, restyle, fusion, transition, and the marketing agent are PixVerse-only. HappyHorse and SeeDance do not support them.
SeeDance model variants
When you pick SeeDance, choose between two variants:
| Model | Notes |
|---|---|
doubao-seedance-2-pro | Higher quality; required for 1080p; supports last-frame and reference-audio |
doubao-seedance-2-fast | Faster; selecting 1080p auto-upgrades to pro |
PixVerse is OpenAPI-only — it lives under listenhub openapi video pixverse and uses public-network URLs for all media (no local file upload). If you want lip sync, mimic, restyle, fusion, transition, or the marketing agent but only have internal-auth login configured, set up an API key first with listenhub openapi config set-key.
Modes
The AI routes to a mode based on what reference material you have. HappyHorse and SeeDance share the listenhub video create command; PixVerse uses listenhub openapi video pixverse generate with an explicit --capability.
Generate a video from a text prompt only, no reference media. Available on all three model families.
listenhub video create \
--prompt "cyberpunk city at night, neon reflections on wet streets" \
--model "happyhorse" \
--resolution "1080p" \
--ratio "16:9" \
--duration 5 \
--no-wait --jsonAnimate a still image as the first frame. SeeDance can also take a last-frame image to interpolate a transition between two stills.
Image requirements: jpg, jpeg, png, or webp; local files up to 20 MB; width and height ≥ 300px; aspect ratio between 1:2.5 and 2.5:1.
listenhub video create \
--prompt "bring the scene to life with smooth motion" \
--model "happyhorse" \
--resolution "1080p" \
--duration 5 \
--first-frame "/path/to/scene.png" \
--no-wait --jsonFor SeeDance frame mode, add --last-frame and use a doubao-seedance-2-* model.
HappyHorse image-to-video has no --ratio — the output ratio is determined by the input image. SeeDance still accepts --ratio.
Supply 1–9 reference images to guide style or characters. With HappyHorse, refer to specific images in the prompt using [Image 1], [Image 2], and so on.
Image requirements: jpg, jpeg, png, or webp; up to 20 MB each; HappyHorse recommends short edge ≥ 400px.
listenhub video create \
--prompt "[Image 1]'s character walking through [Image 2]'s street" \
--model "happyhorse" \
--resolution "1080p" \
--ratio "16:9" \
--duration 5 \
--reference-image "/path/to/character.png" \
--reference-image "/path/to/scene.png" \
--no-wait --jsonSeeDance reference mode additionally accepts up to 3 reference videos (mp4/mov, ≤ 50 MB) and up to 3 reference audios (mp3/wav, ≤ 20 MB, paired with an image or video).
PixVerse only. Drive a character's lips with either an audio file or text-to-speech. The source video must already exist on PixVerse — reference it by --source-video-id or by a prior succeeded task with --source-task-id.
Drive with an audio file (one public audio URL, 5–60s):
listenhub openapi video pixverse generate \
--capability lip_sync \
--source-video-id "abc123" \
--audio "https://example.com/voice.mp3" \
--quality 720p \
--no-wait --jsonDrive with text-to-speech (nested tts, no --audio):
listenhub openapi video pixverse generate \
--capability lip_sync \
--source-task-id "task_xyz" \
--pixverse-json '{"tts":{"speakerId":"speaker_01","content":"Welcome to this episode"}}' \
--quality 720p \
--no-wait --jsonProvide either an audio file or TTS, never both — passing both is rejected. For TTS, use the nested --pixverse-json '{"tts":{...}}'; do not use --lip-sync-tts / --lip-sync-speaker-id / --lip-sync-content, which the contract does not accept.
Video edit (HappyHorse)
Edit an existing clip — change style, replace the background, restyle motion. HappyHorse only; if you ask for this on SeeDance, the AI switches you to HappyHorse.
Video requirements: mp4/mov (H.264 recommended); 3–60s input (output capped at 15s); ≤ 100 MB; short edge ≥ 360px, long edge ≤ 4096px. Optionally pass 0–5 reference images.
listenhub video create \
--prompt "replace the background with a deep starry sky, keep the subject's motion" \
--model "happyhorse" \
--resolution "1080p" \
--reference-video "/path/to/input.mp4" \
--audio-setting "origin" \
--no-wait --json--audio-setting controls audio: auto lets the model decide, origin keeps the original audio. Video edit has no --ratio or --duration — the output matches the input video.
Other PixVerse capabilities
PixVerse exposes additional atomic capabilities through --capability, all OpenAPI-only with URL inputs:
| Capability | Inputs | Constraints |
|---|---|---|
mimic (motion transfer) | 1 image + 1 video | Quality locked to 720p; motion source 5–30s |
restyle | --source-video-id (or --source-task-id) + --restyle-id | — |
fusion | Nested imageReferences (1–8), prompt uses @refName | Top-level --image must be empty |
transition / multi_transition | Nested multiTransition keyframes (2–7) | Default quality 360p |
agent (ad_master / promo_mix) | Prompt + images | Quality 720p/1080p only; duration 20/30/60 only; promo_mix needs ≥ 4 images |
Parameters
The AI asks for these one at a time and applies sensible session defaults. For HappyHorse and SeeDance, ratio and duration use --ratio / --duration; PixVerse uses --quality and --aspect-ratio instead of --resolution / --ratio.
| Parameter | Flag | Notes |
|---|---|---|
| Prompt | --prompt | Free text. HappyHorse ≤ 2500 (Chinese) / ≤ 5000 (non-Chinese); SeeDance ≤ 500; PixVerse ≤ 2048 |
| Model | --model | happyhorse (default), doubao-seedance-2-pro, doubao-seedance-2-fast, pixverse |
| Resolution | --resolution | HappyHorse: 720p/1080p; SeeDance: 480p/720p/1080p (480p is SeeDance-only) |
| Aspect ratio | --ratio | Not used for image-to-video or video edit (ratio follows input) |
| Duration | --duration | Seconds. HappyHorse 3–15, SeeDance 4–15 |
| First frame | --first-frame | Image-to-video source image |
| Last frame | --last-frame | SeeDance frame mode only |
| Reference image | --reference-image | Repeatable; 1–9 (HappyHorse), or video-edit references (0–5) |
| Reference video | --reference-video | Video-edit input (HappyHorse) or SeeDance reference |
| Audio setting | --audio-setting | Video edit only: auto or origin |
| Seed | --seed | Optional; for reproducing a result |
PixVerse-specific flags: --capability, --quality (360p/540p/720p/1080p), --aspect-ratio (9:16/16:9/1:1/4:3/3:4), --source-video-id / --source-task-id, --audio, --agent-type, --restyle-id, and --pixverse-json for nested payloads (tts, imageReferences, multiTransition).
Some choices are auto-corrected: 480p on HappyHorse falls back to 720p; 1080p on doubao-seedance-2-fast upgrades to doubao-seedance-2-pro. The AI tells you when it adjusts.
Estimating credits
Before generating, the AI runs an estimate. To check cost yourself, mirror the create parameters against the estimate command:
# HappyHorse / SeeDance
listenhub video estimate --model "happyhorse" --resolution "1080p" --ratio "16:9" --duration 5 --json
# PixVerse — mirror the capability + quality + duration
listenhub openapi video pixverse estimate --capability text_to_video --model pixverse --quality 720p --duration 5 --jsonFor video edit, add --has-video-input and --input-video-duration <seconds>.
Output
Status flows pending → generating → uploading → success. On success the AI reports the video URL, duration, resolution, ratio, seed, and credits charged.
Output behavior follows the outputMode set during config:
inline(default) orboth— the video URL and metadata are shown directly in the conversation.downloadorboth— the file is also saved to the current working directory with a topic-based name (e.g.cyberpunk-city.mp4). Names are de-duplicated automatically.
To review past work, listenhub video get <taskId> --json returns a single task and listenhub video list --json lists recent tasks. Global flags apply: --json / -j, --no-wait, and --timeout <s>.
API Reference
See the AI Video API reference for endpoint paths, request parameters, and response fields.