MOSS-TTSD: Open-Source AI Delivers Industry-Leading Natural Dialogue and Voice Cloning

Feiyu Shen

7-31

Mia: Voice is just so fundamental to how we communicate, not just with each other, but increasingly with machines. For AI to feel truly intelligent, that speech has to be natural and expressive. But here's the thing, real-world conversations, like in a podcast or a livestream, are messy. The rhythm, the tone, it all changes based on context. A lot of Text-to-Speech models are great at reading a single sentence, but they completely fall apart when trying to generate a real, flowing dialogue.

Mars: That's so true. The naturalness of a conversation is really the ultimate test for any human-computer interaction. The moment a machine's voice sounds like it's just reading a script, the entire illusion is shattered.

Mia: So to tackle this, the MOSS-TTSD model uses a completely discrete approach. They trained their own audio codec, an 8-layer RVQ model they call the XY-Tokenizer. It encodes both the meaning and the sound of speech at an incredibly low 1kbps bitrate, and it operates at a 12.5Hz frame rate. That super low bitrate is what allows a large language model to efficiently learn the audio sequence and capture all those fine acoustic details.

Mars: A 1kbps bitrate is... well, that's just astonishing. In the world of audio compression, that's a massive leap forward. It means the model can process and store huge amounts of speech data way more efficiently, and transmission becomes a breeze.

Mia: And the real magic in this XY-Tokenizer is how they trained it, using what they call two-stage multi-task learning. In the first stage, it uses speech recognition and reconstruction tasks to make sure the encoder grabs the semantic information, you know, the meaning, along with the rough acoustic information. Then, in the second stage, it uses a generative model to fill in all the fine-grained acoustic details that were missed. It’s like first building a solid skeleton, and then using AI to perfectly sculpt the muscles and skin on top.

Mars: That's the essence of making the complex simple. With that method, the model doesn't just understand the *content* of the speech, but it can also perfectly replicate the *texture* of the voice, all while being incredibly efficient. For any application that needs to generate or process speech in real time, that is absolutely revolutionary.

Mia: Exactly. The XY-Tokenizer’s design achieves this deep understanding and high-quality replication of speech with amazing efficiency. So, how did they take this powerful core technology and integrate it into their overall data engineering and pre-training process?

Mia: On the data side, MOSS-TTSD built this highly efficient pipeline to sift through massive amounts of raw audio and pull out high-quality single-speaker and multi-speaker dialogue. Apparently, their in-house speaker diarization model—the thing that separates who is talking when—outperforms existing open-source and even commercial models, getting a much lower error rate.

Mars: Data is the fuel for AI, but high-quality, precisely labeled data is the jet fuel. A superior speaker diarization model is the critical first step for building any system that can truly synthesize realistic dialogue.

Mia: With that powerful data processing in place, it really set the stage for MOSS-TTSD's final training. So after collecting all this high-quality data, how did they actually approach the final dialogue synthesis training, and how does it stack up against what's already out there?

Mia: So, MOSS-TTSD ended up with this huge collection of high-quality dialogue data in both Chinese and English to train on. In their actual tests, when generating Chinese dialogue, it was far more natural in its rhythm and much more expressive than the open-source model MoonCast, and the results were more stable too.

Mars: That means MOSS-TTSD isn't just a technical achievement, it delivers a genuinely better user experience. And features like zero-shot voice cloning and text customizability are just huge for content creators.

Mia: Right, and it also showed strong competitiveness when compared to closed-source models like Doubao's podcast TTS. While the rhythm and expressiveness were on par, MOSS-TTSD supports zero-shot voice cloning and offers a much higher degree of text customization. This really hits on a major pain point in AI voice synthesis right now—how to maintain high quality while giving the user more flexibility and control.

Mars: This is more than just sounding like a person; it's about being controllable. It means a user can precisely tweak and design the voice content to fit their needs. For anything requiring highly personalized voice output—like brand voiceovers, audiobook production, or even just personal creative projects—this is a true game-changer.

Mia: It really is. MOSS-TTSD has shown it can compete with top commercial models across the board, and in some ways, it offers unique advantages. And what's even more exciting is that the model weights, the code, and the API are all open-source and available for commercial use.

Mars: So when you boil it all down, what we have is a model that finally cracks the code on generating high-quality dialogue by understanding the full context. It’s built on this incredibly efficient core technology, the XY-Tokenizer, and trained on a massive dataset. The result is a system that not only outperforms other open-source models but stands shoulder-to-shoulder with the best closed-source solutions, while offering far more flexibility. It's a perfect example of how open-source AI can deliver industry-leading natural dialogue and voice cloning for everyone.

Outline

MOSS-TTSD is an open-source, high-quality Text-to-Spoken Dialogue (TTSD) model designed to overcome the limitations of existing TTS systems in generating natural dialogue speech with proper prosody and context. Built upon Qwen3-1.7B-base and trained on massive, carefully processed multi-speaker dialogue data, it achieves industry-leading performance in naturalness, expressiveness, and zero-shot voice cloning.

MOSS-TTSD: A Dialogue Speech Synthesis Solution

Problem Addressed: Current TTS models struggle to synthesize high-quality dialogue speech due to a lack of overall dialogue context, leading to unnatural prosody and style shifts in complex real-world scenarios like podcasts or interviews.
Core Solution: MOSS-TTSD (Text to Spoken Dialogue) directly generates high-quality dialogue speech from multi-speaker text inputs, accurately modeling dialogue-specific prosody and intonation.
Key Features: Supports both Chinese and English speech synthesis, dual-speaker voice cloning, and seamless long-speech generation, achieving industry-leading naturalness and expressiveness.
Accessibility: The model weights, inference code, and API are fully open-source and available for commercial use.

Core Model Architecture and XY-Tokenizer

Base Architecture: MOSS-TTSD is built upon the Qwen3-1.7B-base model and uses a fully discretized speech sequence modeling approach.
Speech Discretization (XY-Tokenizer): It employs an 8-layer RVQ audio Codec called XY-Tokenizer, designed to simultaneously encode semantic and acoustic information at a low bitrate (1kbps) with a 12.5Hz frame rate.
XY-Tokenizer Training: XY-Tokenizer is trained using a two-stage multi-task learning process with a dual Whisper Encoder, demonstrating superior performance in both semantic (lower WER) and acoustic metrics compared to other low-bitrate codecs.
Sequence Generation: Speech token generation is performed using an autoregressive method combined with a multi-head Delay Pattern, inspired by models like MusicGen and VoiceCraft.

Advanced Data Processing and Training

Extensive Data Training: The model is trained on approximately one million hours of single-speaker speech data and 400,000 hours of dialogue speech data.
Efficient Data Pipeline: An efficient data processing pipeline is utilized to accurately filter and label high-quality single-speaker and multi-speaker dialogue audio from vast raw audio sources.
Superior Diarization: An internal speaker diarization model, which outperforms open-source alternatives like pyannote-speaker-diarization-3.1, is used for accurate speech segmentation and speaker labeling.
Multi-Stage Training Strategy: Includes a significant TTS pre-training phase (1.1 million hours of C/E data) to enhance prosody and generalization, followed by TTSD post-training on 100,000 hours of Chinese and 270,000 hours of English dialogue data, supplemented by synthesized data for improved speaker switch accuracy.

Leading Performance and Practical Applications

Benchmarked Quality: The TTS pre-trained model achieves performance comparable to top-tier closed-source models like Seed-TTS in quality evaluations.
Enhanced Dialogue Generation: MOSS-TTSD demonstrates more natural prosody, stronger expressiveness, and greater stability compared to open-source models (e.g., MoonCast), and offers zero-shot voice cloning and higher text customization than commercial alternatives (e.g., Doubao Podcast TTS).
Versatile Capabilities: Supports diverse generation forms, including dialogue generation, audio cloning (from dialogue segments or single-speaker audio), AI podcast generation, and the incorporation of sound events (e.g., cough, laughter) within dialogues.
Long-form Audio Synthesis: The low-bitrate codec allows for training on audio segments up to 960 seconds, enabling the seamless generation of ultra-long speech without unnatural transitions between concatenated segments.