ListenHub Voice
Generate audio end to end with ListenHub-Voice-1.0 — narration, sound effects, multi-voice dialogue, voice cloning, or image-to-audio, from a single text script.
Turn a text script into a finished audio track with the end-to-end ListenHub-Voice-1.0 model. Unlike stitched-together TTS, the model produces one continuous take that can carry sound effects, multiple speakers, cloned voices, or narration derived from a reference image.
ListenHub Voice is currently free for a limited time. Audio generated during this period is for personal, non-commercial use only and must not be used for any commercial purpose. Billing resumes after the free period ends, per later announcements.
Trigger
Invoke this skill with /listenhub-voice, or use any of these phrases:
| Phrase | Language |
|---|---|
generate audio / sound effect | English |
end-to-end audio / image to audio | English |
生成音频 / 语音生成 | Chinese |
端到端音频 / 图片转音频 | Chinese |
多音色对白 / 参考音频克隆 / 音效生成 | Chinese |
Requires ListenHub Skills to be installed — see Getting Started.
Quick Example
Generate a 20-second clip: "Welcome to ListenHub — here is your daily briefing."The AI collects the script, voice, and any tuning, then submits one async task and polls it until the audio is ready. You get a link to listen and download.
When to Use
Plain text / sound effects
Narrate a script and let the model bake in the sound effects it describes — no voice selection needed.
Single voice
Read a script in one built-in voice or a platform voice_type.
Multi-voice dialogue
Assign 2–3 voices to a conversation, one line at a time.
Voice cloning
Clone a voice from a short reference audio clip and speak your script in it.
Image to audio
Turn a reference image into a short narrated clip.
For plain single-voice narration with an already-registered ListenHub speaker, /tts is lower latency. Use /listenhub-voice when you want sound effects, dialogue, cloning, or image-driven audio in one pass.
Modes
No voices, no image — the model synthesizes the script and any sound effects described in it.
Generate audio: "Rain patters on the window as a distant train rolls by."One voice reads the whole script — a built-in ListenHub voice or platform voice_type.
Read this in a warm female voice: "Here are today's headlines."Two or three voices in conversation. Each line is assigned to a voice with an @音频N prefix, in order.
Make a two-voice dialogue: @音频1 asks a question, @音频2 answers.Every voice in a multi-voice request must be reference-audio-capable. A built-in voice_type is single-voice only.
Clone a voice from a short public reference clip, then speak your script in it.
Clone the voice in https://example.com/host.mp3 and read my intro.Turn a reference image into a short narrated clip. Image mode is mutually exclusive with voices.
Describe this image as a 15-second narrated clip.Parameters
| Parameter | Options | Default |
|---|---|---|
| Text | Up to 1400 characters | required |
| Voices | 1–3 built-in voices or reference clips | None (plain text) |
| Image | One reference image (mutually exclusive with voices) | None |
| Speech rate | -50 to 100 | Model default |
| Loudness | -50 to 100 | Model default |
| Pitch | -12 to 12 | Model default |
| Format | mp3, wav, pcm, ogg_opus | mp3 |
| Duration hint | 1 to 110 seconds | None |
| Watermark | On / off | Off |
Voices and an image are mutually exclusive — send at most one. Built-in voices are fetched from the API; ask "what voices are available?" to browse the list with demo audio.
Output
After the task reaches success, you receive:
- Listen link — stream the finished audio
- Audio download — say "download audio" to save it locally (the file extension follows your chosen
format) - Task detail — status, billed duration, and credits charged
Generation is asynchronous: the task moves through pending → generating → uploading → success. On failure, the reason is reported and any reserved credits are refunded.
API Reference
See the ListenHub Voice API reference for the underlying endpoints, request fields, and error codes.