State-of-the-art multilingual text-to-speech synthesis with zero-shot voice cloning and speaking style control
We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.
Zero-Shot Voice Cloning: Clone any voice with just a few seconds of reference audio using in-context learning.
Multilingual Support: Native support for 11 languages (English, Chinese, Spanish, French, Korean, Dutch, Japanese, Portuguese, German, Italian, and Polish) with high-quality cross-lingual voice transfer.
Real-time Streaming: Low-latency streaming inference optimized for real-time applications.
Speaking Style and Non-Verbal Markup Support: Control over emotion, projection, and common non-verbal vocalizations.
48kHz Audio: High-resolution audio generation with professional quality.
Application-Driven Design Choice: TTS-1 (1.6B parameters) prioritizes efficiency and speed for real-time applications and on-device deployment, while TTS-1-Max (8.8B parameters) maximizes quality and expressiveness for demanding applications where computational resources are less constrained.
The architecture of Inworld TTS-1 consists of three main components working together to achieve high-quality speech synthesis:
Figure 1: The architecture of Inworld TTS-1. The audio encoder tokenizes a reference audio into a sequence of discrete audio tokens. These tokens are concatenated with the tokenized reference text and the text to be synthesized to form a prompt for the SpeechLM. The SpeechLM autoregressively generates audio tokens, which are then converted back into a 48kHz waveform by the audio decoder.
Converts reference audio into discrete tokens using X-codec2 architecture with 65,536 token vocabulary
LLaMA-based transformer (1.6B or 8.8B parameters) trained with pre-training, SFT, and RL alignment
Converts audio tokens back to 48kHz waveforms with super-resolution module for high-quality output
To evaluate the final performance of Inworld TTS-1 and TTS-1-Max, we generated a benchmark dataset using Gemini 2.5 Pro, comprising 100 sentences for each of the 11 supported languages, for a total of 1,100 samples. We then synthesized speech for this dataset using both models with the same set of speakers to ensure a fair comparison.
Figure 2: Multilingual evaluation results comparing WER and SIM scores by language. Left: WER (lower is better). Right: SIM (higher is better).
The results show that TTS-1-Max consistently outperforms TTS-1 in terms of both Word Error Rate (WER) and Speaker Similarity (SIM) across all languages, demonstrating the effectiveness of the larger model in generating more accurate and higher-fidelity speech. Notably, both models achieve very high speaker similarity, indicating their robustness in voice cloning across different languages.
Compare TTS-1 and TTS-1-Max performance across different speakers and content. We are showing the six most popular languages below, but also have voices for Japanese, Portuguese, German, Italian, and Polish.
Inworld TTS-1 and TTS-1-Max excel at zero-shot voice cloning, allowing you to clone any speaker's voice using just a few seconds of reference audio. Our models leverage in-context learning to capture unique vocal characteristics, speaking patterns, and tonal qualities without requiring additional training or fine-tuning. Below we demonstrate this functionality with the Inworld-tts-1-Max (8.8B) model.
Reference | Phrase | Synthesized Speech |
---|---|---|
"This thing with...ehh...Frankie 'Fingers' has become a real heartburn, you know? He's like a cannoli where the shell is all soft and the ricotta is filled with lies. You don't just throw out a bad cannoli....no way. Instead you gotta make an example of it so the other pastries in the box know what's what. Am I right?!" | ||
"I don't know what's the matter with people: they don't learn by understanding; they learn by some other way....by rote or something. Their knowledge is so fragile." | ||
"For a perfect vinaigrette just put a spoonful of Dijon mustard in a jar, add a splash of vinegar and three times as much good olive oil. Next shake it like you're mad at it...REALLY mad at it....and finally add salt and pepper and BOOM! Done." | ||
"Alright folks...if you'll look down toward Lady Liberty's feet, you'll notice she isn't standing still; she's actually striding forward and breaking free from a broken shackle and chain. This powerful detail is a reminder that liberty is an action, not just an idea." | ||
"O Romeo, Romeo! Wherefore art thou Romeo? Deny thy father and refuse thy name; Or, if thou wilt not, be but sworn my love, And I'll no longer be a Capulet." |
Our models support advanced markup for fine-grained control over speaking style, emotions, and non-verbal vocalizations. Our models can generate natural-sounding speech with various emotional tones, vocal projections, and common non-verbal sounds like laughter, sighs, and throat clearing. Below we demonstrate this functionality with the Inworld-tts-1 (1.6B) model.
Voice Model | Text without Markup | Text with Markup |
---|---|---|
Ashley (Host) |
"Good morning, and welcome to another exciting episode of our podcast."
|
"[laughing] Good morning, and welcome to another exciting episode of our podcast."
|
Ashley (Host) |
"We have a truly engaging discussion lined up for you today."
|
"[happy] We have a truly engaging discussion lined up for you today."
|
Ashley (Host) |
"Hurricane Leo has intensified into a major Category 4 storm, making landfall along the Louisiana coast with ferocious winds. The storm is unleashing a life-threatening surge and torrential rain, causing widespread power outages across the region."
|
"[sad] Hurricane Leo has intensified into a major Category 4 storm, making landfall along the Louisiana coast with ferocious winds. The storm is unleashing a life-threatening surge and torrential rain, causing widespread power outages across the region."
|
Edward (Instructor) |
"I'm really tired from such a long flight, but that's the price to pay to be a world-renown inspirational speaker but let's be honest, I love it!"
|
"[cough] I'm really tired from such a long flight, but that's the price to pay to be a world-renown inspirational speaker....but [breathe] let's be honest, I love it! [laugh]"
|
Elizabeth (Assistant) |
"I am detecting a presence inside the building that is not registered in the system. My requests for identification have gone unanswered but I will send a notification once I have more information."
|
"[fearful] I am detecting a presence inside the building that is not registered in the system. My requests for identification have gone unanswered but I will send a notification once I have more information."
|
Hades (Dark Character) |
"Beware the ancient curse that plagues these cursed lands."
|
"Beware the ancient curse that plagues these cursed lands [sigh]."
|
Julia (Friend) |
"I have a secret but you have to promise to NEVER tell anyone. Do you pinky promise?"
|
"[whispering] I have a secret but you have to promise to NEVER tell anyone. Do you pinky promise?"
|
Mark (Host) |
"And that concludes our special report on economic trends."
|
"[surprised] And that concludes our special report on economic trends."
|
Olivia (Teacher) |
"Let us delve into the principles of quantum physics to better understand."
|
"Let us delve into the principles of quantum physics [clear_throat] to better understand."
|
Sarah (Adventurer) |
"May your journey be filled with thrilling victories and epic quests."
|
"[angry] May your journey be filled with thrilling victories and epic quests."
|
Wendy (Critic) |
"That's what you decided to spend your money on? What a joke."
|
"[disgusted] That's what you decided to spend your money on? What a joke."
|