ListenHubSkills

ListenHub Voice

Generate audio end to end with ListenHub-Voice-1.0 — narration, sound effects, multi-voice dialogue, voice cloning, or image-to-audio, from a single text script.

Turn a text script into a finished audio track with the end-to-end ListenHub-Voice-1.0 model. Unlike stitched-together TTS, the model produces one continuous take that can carry sound effects, multiple speakers, cloned voices, or narration derived from a reference image.

ListenHub Voice is currently free for a limited time. Audio generated during this period is for personal, non-commercial use only and must not be used for any commercial purpose. Billing resumes after the free period ends, per later announcements.

Trigger

Invoke this skill with /listenhub-voice, or use any of these phrases:

PhraseLanguage
generate audio / sound effectEnglish
end-to-end audio / image to audioEnglish
生成音频 / 语音生成Chinese
端到端音频 / 图片转音频Chinese
多音色对白 / 参考音频克隆 / 音效生成Chinese

Requires ListenHub Skills to be installed — see Getting Started.

Quick Example

Generate a 20-second clip: "Welcome to ListenHub — here is your daily briefing."

The AI collects the script, voice, and any tuning, then submits one async task and polls it until the audio is ready. You get a link to listen and download.

When to Use

Plain text / sound effects

Narrate a script and let the model bake in the sound effects it describes — no voice selection needed.

Single voice

Read a script in one built-in voice or a platform voice_type.

Multi-voice dialogue

Assign 2–3 voices to a conversation, one line at a time.

Voice cloning

Clone a voice from a short reference audio clip and speak your script in it.

Image to audio

Turn a reference image into a short narrated clip.

For plain single-voice narration with an already-registered ListenHub speaker, /tts is lower latency. Use /listenhub-voice when you want sound effects, dialogue, cloning, or image-driven audio in one pass.

Modes

No voices, no image — the model synthesizes the script and any sound effects described in it.

Generate audio: "Rain patters on the window as a distant train rolls by."

One voice reads the whole script — a built-in ListenHub voice or platform voice_type.

Read this in a warm female voice: "Here are today's headlines."

Two or three voices in conversation. Each line is assigned to a voice with an @音频N prefix, in order.

Make a two-voice dialogue: @音频1 asks a question, @音频2 answers.

Every voice in a multi-voice request must be reference-audio-capable. A built-in voice_type is single-voice only.

Clone a voice from a short public reference clip, then speak your script in it.

Clone the voice in https://example.com/host.mp3 and read my intro.

Turn a reference image into a short narrated clip. Image mode is mutually exclusive with voices.

Describe this image as a 15-second narrated clip.

Parameters

ParameterOptionsDefault
TextUp to 1400 charactersrequired
Voices1–3 built-in voices or reference clipsNone (plain text)
ImageOne reference image (mutually exclusive with voices)None
Speech rate-50 to 100Model default
Loudness-50 to 100Model default
Pitch-12 to 12Model default
Formatmp3, wav, pcm, ogg_opusmp3
Duration hint1 to 110 secondsNone
WatermarkOn / offOff

Voices and an image are mutually exclusive — send at most one. Built-in voices are fetched from the API; ask "what voices are available?" to browse the list with demo audio.

Output

After the task reaches success, you receive:

  • Listen link — stream the finished audio
  • Audio download — say "download audio" to save it locally (the file extension follows your chosen format)
  • Task detail — status, billed duration, and credits charged

Generation is asynchronous: the task moves through pendinggeneratinguploadingsuccess. On failure, the reason is reported and any reserved credits are refunded.

API Reference

See the ListenHub Voice API reference for the underlying endpoints, request fields, and error codes.

On this page