ListenHub Voice

Generate audio end to end with ListenHub-Voice-1.0 — narration, sound effects, multi-voice dialogue, voice cloning, or image-to-audio, from a single text script.

Turn a text script into a finished audio track with the end-to-end ListenHub-Voice-1.0 model. Unlike stitched-together TTS, the model produces one continuous take that can carry sound effects, multiple speakers, cloned voices, or narration derived from a reference image.

ListenHub Voice is currently free for a limited time. Audio generated during this period is for personal, non-commercial use only and must not be used for any commercial purpose. Billing resumes after the free period ends, per later announcements.

Trigger

Invoke this skill with /listenhub-voice, or use any of these phrases:

Phrase	Language
`generate audio` / `sound effect`	English
`end-to-end audio` / `image to audio`	English
`生成音频` / `语音生成`	Chinese
`端到端音频` / `图片转音频`	Chinese
`多音色对白` / `参考音频克隆` / `音效生成`	Chinese

Requires ListenHub Skills to be installed — see Getting Started.

Quick Example

Generate a 20-second clip: "Welcome to ListenHub — here is your daily briefing."

The AI collects the script, voice, and any tuning, then submits one async task and polls it until the audio is ready. You get a link to listen and download.

When to Use

Plain text / sound effects

Narrate a script and let the model bake in the sound effects it describes — no voice selection needed.

Single voice

Read a script in one built-in voice or a platform voice_type.

Multi-voice dialogue

Assign 2–3 voices to a conversation, one line at a time.

Voice cloning

Clone a voice from a short reference audio clip and speak your script in it.

Image to audio

Turn a reference image into a short narrated clip.

For plain single-voice narration with an already-registered ListenHub speaker, /tts is lower latency. Use /listenhub-voice when you want sound effects, dialogue, cloning, or image-driven audio in one pass.

Modes

No voices, no image — the model synthesizes the script and any sound effects described in it.

Generate audio: "Rain patters on the window as a distant train rolls by."

One voice reads the whole script — a built-in ListenHub voice or platform voice_type.

Read this in a warm female voice: "Here are today's headlines."

Two or three voices in conversation. Each line is assigned to a voice with an @音频N prefix, in order.

Make a two-voice dialogue: @音频1 asks a question, @音频2 answers.

Every voice in a multi-voice request must be reference-audio-capable. A built-in voice_type is single-voice only.

Clone a voice from a short public reference clip, then speak your script in it.

Clone the voice in https://example.com/host.mp3 and read my intro.

Turn a reference image into a short narrated clip. Image mode is mutually exclusive with voices.

Describe this image as a 15-second narrated clip.

Parameters

Parameter	Options	Default
Text	Up to 1400 characters	required
Voices	1–3 built-in voices or reference clips	None (plain text)
Image	One reference image (mutually exclusive with voices)	None
Speech rate	`-50` to `100`	Model default
Loudness	`-50` to `100`	Model default
Pitch	`-12` to `12`	Model default
Format	`mp3`, `wav`, `pcm`, `ogg_opus`	`mp3`
Duration hint	`1` to `110` seconds	None
Watermark	On / off	Off

Voices and an image are mutually exclusive — send at most one. Built-in voices are fetched from the API; ask "what voices are available?" to browse the list with demo audio.

Output

After the task reaches success, you receive:

Listen link — stream the finished audio
Audio download — say "download audio" to save it locally (the file extension follows your chosen format)
Task detail — status, billed duration, and credits charged

Generation is asynchronous: the task moves through pending → generating → uploading → success. On failure, the reason is reported and any reserved credits are refunded.

API Reference

See the ListenHub Voice API reference for the underlying endpoints, request fields, and error codes.