ListenHubSkills

Video Generation

Generate AI videos from text, images, or reference media — with image-to-video, video editing, and PixVerse lip sync.

Generate AI videos from a text prompt or reference materials using the listenhub video CLI. Animate a still image, edit an existing clip, or drive a character's lips with audio or text-to-speech. Three model families cover different jobs: HappyHorse, SeeDance, and PixVerse.

Trigger

Invoke this skill with /video-gen, or use any of these phrases:

PhraseLanguage
video generation / text to video / create videoEnglish
video edit / lipsync / pixverseEnglish
生成视频 / 做视频 / 视频生成Chinese
视频编辑 / 口型 / 对口型Chinese

Requires ListenHub Skills to be installed — see Getting Started.

For narrated explainer videos with AI visuals, use /explainer instead.

Quick Example

Generate a video: cyberpunk city at night, 16:9, 5 seconds

The AI walks you through the mode and parameters one question at a time, shows a cost estimate, and asks you to confirm before generating. Generation takes minutes — the job runs in the background and the AI notifies you when the video is ready, with a URL, duration, resolution, and credit cost.

Video generation always runs with --no-wait, so the CLI returns a task id immediately and the AI polls in the background (10s interval). If you only have a task id, check progress with listenhub video get <taskId> --json.

Models

Pick the model by the job. HappyHorse is the default and the only family that edits existing video; SeeDance adds last-frame and reference-audio support; PixVerse is the only family with lip sync, plus a set of atomic capabilities (mimic, restyle, fusion, transition, marketing agent).

CapabilityHappyHorse (default)SeeDancePixVerse
Text-to-videoYesYesYes (text_to_video)
Image-to-video (first-frame)YesYes (+ last-frame)Yes (image_to_video)
Reference imageYes (1–9, [Image N] syntax)YesYes (fusion, @refName)
Video editYesNoNo
Lip syncNoNoYes (lip_sync, audio or TTS)
Motion transfer / mimicNoNoYes (mimic, locked 720p)
RestyleNoNoYes (restyle)
Transition (first → last)NoYes (frame mode)Yes (transition / multi_transition)
Reference videoNo (use video edit)YesYes (mimic / lip_sync source)
Reference audioNoYesYes (lip_sync)
Max resolution1080p1080p1080p
Resolution options720p, 1080p480p, 720p, 1080p360p, 540p, 720p, 1080p
Duration range3–15s4–15s1–60s (agent: 20/30/60)
Aspect ratios16:9, 9:16, 1:1, 4:3, 3:4, 4:5, 5:416:9, 9:16, 1:1, 4:3, 3:4, 21:99:16, 16:9, 1:1, 4:3, 3:4

Lip sync, mimic, restyle, fusion, transition, and the marketing agent are PixVerse-only. HappyHorse and SeeDance do not support them.

SeeDance model variants

When you pick SeeDance, choose between two variants:

ModelNotes
doubao-seedance-2-proHigher quality; required for 1080p; supports last-frame and reference-audio
doubao-seedance-2-fastFaster; selecting 1080p auto-upgrades to pro

PixVerse is OpenAPI-only — it lives under listenhub openapi video pixverse and uses public-network URLs for all media (no local file upload). If you want lip sync, mimic, restyle, fusion, transition, or the marketing agent but only have internal-auth login configured, set up an API key first with listenhub openapi config set-key.

Modes

The AI routes to a mode based on what reference material you have. HappyHorse and SeeDance share the listenhub video create command; PixVerse uses listenhub openapi video pixverse generate with an explicit --capability.

Generate a video from a text prompt only, no reference media. Available on all three model families.

listenhub video create \
  --prompt "cyberpunk city at night, neon reflections on wet streets" \
  --model "happyhorse" \
  --resolution "1080p" \
  --ratio "16:9" \
  --duration 5 \
  --no-wait --json

Animate a still image as the first frame. SeeDance can also take a last-frame image to interpolate a transition between two stills.

Image requirements: jpg, jpeg, png, or webp; local files up to 20 MB; width and height ≥ 300px; aspect ratio between 1:2.5 and 2.5:1.

listenhub video create \
  --prompt "bring the scene to life with smooth motion" \
  --model "happyhorse" \
  --resolution "1080p" \
  --duration 5 \
  --first-frame "/path/to/scene.png" \
  --no-wait --json

For SeeDance frame mode, add --last-frame and use a doubao-seedance-2-* model.

HappyHorse image-to-video has no --ratio — the output ratio is determined by the input image. SeeDance still accepts --ratio.

Supply 1–9 reference images to guide style or characters. With HappyHorse, refer to specific images in the prompt using [Image 1], [Image 2], and so on.

Image requirements: jpg, jpeg, png, or webp; up to 20 MB each; HappyHorse recommends short edge ≥ 400px.

listenhub video create \
  --prompt "[Image 1]'s character walking through [Image 2]'s street" \
  --model "happyhorse" \
  --resolution "1080p" \
  --ratio "16:9" \
  --duration 5 \
  --reference-image "/path/to/character.png" \
  --reference-image "/path/to/scene.png" \
  --no-wait --json

SeeDance reference mode additionally accepts up to 3 reference videos (mp4/mov, ≤ 50 MB) and up to 3 reference audios (mp3/wav, ≤ 20 MB, paired with an image or video).

PixVerse only. Drive a character's lips with either an audio file or text-to-speech. The source video must already exist on PixVerse — reference it by --source-video-id or by a prior succeeded task with --source-task-id.

Drive with an audio file (one public audio URL, 5–60s):

listenhub openapi video pixverse generate \
  --capability lip_sync \
  --source-video-id "abc123" \
  --audio "https://example.com/voice.mp3" \
  --quality 720p \
  --no-wait --json

Drive with text-to-speech (nested tts, no --audio):

listenhub openapi video pixverse generate \
  --capability lip_sync \
  --source-task-id "task_xyz" \
  --pixverse-json '{"tts":{"speakerId":"speaker_01","content":"Welcome to this episode"}}' \
  --quality 720p \
  --no-wait --json

Provide either an audio file or TTS, never both — passing both is rejected. For TTS, use the nested --pixverse-json '{"tts":{...}}'; do not use --lip-sync-tts / --lip-sync-speaker-id / --lip-sync-content, which the contract does not accept.

Video edit (HappyHorse)

Edit an existing clip — change style, replace the background, restyle motion. HappyHorse only; if you ask for this on SeeDance, the AI switches you to HappyHorse.

Video requirements: mp4/mov (H.264 recommended); 3–60s input (output capped at 15s); ≤ 100 MB; short edge ≥ 360px, long edge ≤ 4096px. Optionally pass 0–5 reference images.

listenhub video create \
  --prompt "replace the background with a deep starry sky, keep the subject's motion" \
  --model "happyhorse" \
  --resolution "1080p" \
  --reference-video "/path/to/input.mp4" \
  --audio-setting "origin" \
  --no-wait --json

--audio-setting controls audio: auto lets the model decide, origin keeps the original audio. Video edit has no --ratio or --duration — the output matches the input video.

Other PixVerse capabilities

PixVerse exposes additional atomic capabilities through --capability, all OpenAPI-only with URL inputs:

CapabilityInputsConstraints
mimic (motion transfer)1 image + 1 videoQuality locked to 720p; motion source 5–30s
restyle--source-video-id (or --source-task-id) + --restyle-id
fusionNested imageReferences (1–8), prompt uses @refNameTop-level --image must be empty
transition / multi_transitionNested multiTransition keyframes (2–7)Default quality 360p
agent (ad_master / promo_mix)Prompt + imagesQuality 720p/1080p only; duration 20/30/60 only; promo_mix needs ≥ 4 images

Parameters

The AI asks for these one at a time and applies sensible session defaults. For HappyHorse and SeeDance, ratio and duration use --ratio / --duration; PixVerse uses --quality and --aspect-ratio instead of --resolution / --ratio.

ParameterFlagNotes
Prompt--promptFree text. HappyHorse ≤ 2500 (Chinese) / ≤ 5000 (non-Chinese); SeeDance ≤ 500; PixVerse ≤ 2048
Model--modelhappyhorse (default), doubao-seedance-2-pro, doubao-seedance-2-fast, pixverse
Resolution--resolutionHappyHorse: 720p/1080p; SeeDance: 480p/720p/1080p (480p is SeeDance-only)
Aspect ratio--ratioNot used for image-to-video or video edit (ratio follows input)
Duration--durationSeconds. HappyHorse 3–15, SeeDance 4–15
First frame--first-frameImage-to-video source image
Last frame--last-frameSeeDance frame mode only
Reference image--reference-imageRepeatable; 1–9 (HappyHorse), or video-edit references (0–5)
Reference video--reference-videoVideo-edit input (HappyHorse) or SeeDance reference
Audio setting--audio-settingVideo edit only: auto or origin
Seed--seedOptional; for reproducing a result

PixVerse-specific flags: --capability, --quality (360p/540p/720p/1080p), --aspect-ratio (9:16/16:9/1:1/4:3/3:4), --source-video-id / --source-task-id, --audio, --agent-type, --restyle-id, and --pixverse-json for nested payloads (tts, imageReferences, multiTransition).

Some choices are auto-corrected: 480p on HappyHorse falls back to 720p; 1080p on doubao-seedance-2-fast upgrades to doubao-seedance-2-pro. The AI tells you when it adjusts.

Estimating credits

Before generating, the AI runs an estimate. To check cost yourself, mirror the create parameters against the estimate command:

# HappyHorse / SeeDance
listenhub video estimate --model "happyhorse" --resolution "1080p" --ratio "16:9" --duration 5 --json

# PixVerse — mirror the capability + quality + duration
listenhub openapi video pixverse estimate --capability text_to_video --model pixverse --quality 720p --duration 5 --json

For video edit, add --has-video-input and --input-video-duration <seconds>.

Output

Status flows pendinggeneratinguploadingsuccess. On success the AI reports the video URL, duration, resolution, ratio, seed, and credits charged.

Output behavior follows the outputMode set during config:

  • inline (default) or both — the video URL and metadata are shown directly in the conversation.
  • download or both — the file is also saved to the current working directory with a topic-based name (e.g. cyberpunk-city.mp4). Names are de-duplicated automatically.

To review past work, listenhub video get <taskId> --json returns a single task and listenhub video list --json lists recent tasks. Global flags apply: --json / -j, --no-wait, and --timeout <s>.

API Reference

See the AI Video API reference for endpoint paths, request parameters, and response fields.

On this page