
MOSS-TTSD: Open-Source AI Delivers Industry-Leading Natural Dialogue and Voice Cloning
Feiyu Shen
3
7-31MOSS-TTSD is an open-source, high-quality Text-to-Spoken Dialogue (TTSD) model designed to overcome the limitations of existing TTS systems in generating natural dialogue speech with proper prosody and context. Built upon Qwen3-1.7B-base and trained on massive, carefully processed multi-speaker dialogue data, it achieves industry-leading performance in naturalness, expressiveness, and zero-shot voice cloning.
MOSS-TTSD: A Dialogue Speech Synthesis Solution
- Problem Addressed: Current TTS models struggle to synthesize high-quality dialogue speech due to a lack of overall dialogue context, leading to unnatural prosody and style shifts in complex real-world scenarios like podcasts or interviews.
- Core Solution: MOSS-TTSD (Text to Spoken Dialogue) directly generates high-quality dialogue speech from multi-speaker text inputs, accurately modeling dialogue-specific prosody and intonation.
- Key Features: Supports both Chinese and English speech synthesis, dual-speaker voice cloning, and seamless long-speech generation, achieving industry-leading naturalness and expressiveness.
- Accessibility: The model weights, inference code, and API are fully open-source and available for commercial use.
Core Model Architecture and XY-Tokenizer
- Base Architecture: MOSS-TTSD is built upon the Qwen3-1.7B-base model and uses a fully discretized speech sequence modeling approach.
- Speech Discretization (XY-Tokenizer): It employs an 8-layer RVQ audio Codec called XY-Tokenizer, designed to simultaneously encode semantic and acoustic information at a low bitrate (1kbps) with a 12.5Hz frame rate.
- XY-Tokenizer Training: XY-Tokenizer is trained using a two-stage multi-task learning process with a dual Whisper Encoder, demonstrating superior performance in both semantic (lower WER) and acoustic metrics compared to other low-bitrate codecs.
- Sequence Generation: Speech token generation is performed using an autoregressive method combined with a multi-head Delay Pattern, inspired by models like MusicGen and VoiceCraft.
Advanced Data Processing and Training
- Extensive Data Training: The model is trained on approximately one million hours of single-speaker speech data and 400,000 hours of dialogue speech data.
- Efficient Data Pipeline: An efficient data processing pipeline is utilized to accurately filter and label high-quality single-speaker and multi-speaker dialogue audio from vast raw audio sources.
- Superior Diarization: An internal speaker diarization model, which outperforms open-source alternatives like pyannote-speaker-diarization-3.1, is used for accurate speech segmentation and speaker labeling.
- Multi-Stage Training Strategy: Includes a significant TTS pre-training phase (1.1 million hours of C/E data) to enhance prosody and generalization, followed by TTSD post-training on 100,000 hours of Chinese and 270,000 hours of English dialogue data, supplemented by synthesized data for improved speaker switch accuracy.
Leading Performance and Practical Applications
- Benchmarked Quality: The TTS pre-trained model achieves performance comparable to top-tier closed-source models like Seed-TTS in quality evaluations.
- Enhanced Dialogue Generation: MOSS-TTSD demonstrates more natural prosody, stronger expressiveness, and greater stability compared to open-source models (e.g., MoonCast), and offers zero-shot voice cloning and higher text customization than commercial alternatives (e.g., Doubao Podcast TTS).
- Versatile Capabilities: Supports diverse generation forms, including dialogue generation, audio cloning (from dialogue segments or single-speaker audio), AI podcast generation, and the incorporation of sound events (e.g., cough, laughter) within dialogues.
- Long-form Audio Synthesis: The low-bitrate codec allows for training on audio segments up to 960 seconds, enabling the seamless generation of ultra-long speech without unnatural transitions between concatenated segments.