Inworld TTS-1 Demo

Abstract

We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.

Zero-Shot Voice Cloning: Clone any voice with just a few seconds of reference audio using in-context learning.

Multilingual Support: Native support for 11 languages (English, Chinese, Spanish, French, Korean, Dutch, Japanese, Portuguese, German, Italian, and Polish) with high-quality cross-lingual voice transfer.

Real-time Streaming: Low-latency streaming inference optimized for real-time applications.

Speaking Style and Non-Verbal Markup Support: Control over emotion, projection, and common non-verbal vocalizations.

48kHz Audio: High-resolution audio generation with professional quality.

Application-Driven Design Choice: TTS-1 (1.6B parameters) prioritizes efficiency and speed for real-time applications and on-device deployment, while TTS-1-Max (8.8B parameters) maximizes quality and expressiveness for demanding applications where computational resources are less constrained.

System Architecture

The architecture of Inworld TTS-1 consists of three main components working together to achieve high-quality speech synthesis:

Figure 1: The architecture of Inworld TTS-1. The audio encoder tokenizes a reference audio into a sequence of discrete audio tokens. These tokens are concatenated with the tokenized reference text and the text to be synthesized to form a prompt for the SpeechLM. The SpeechLM autoregressively generates audio tokens, which are then converted back into a 48kHz waveform by the audio decoder.

Audio Encoder

Converts reference audio into discrete tokens using X-codec2 architecture with 65,536 token vocabulary

SpeechLM

LLaMA-based transformer (1.6B or 8.8B parameters) trained with pre-training, SFT, and RL alignment

Audio Decoder

Converts audio tokens back to 48kHz waveforms with super-resolution module for high-quality output

Multilingual Evaluation

To evaluate the final performance of Inworld TTS-1 and TTS-1-Max, we generated a benchmark dataset using Gemini 2.5 Pro, comprising 100 sentences for each of the 11 supported languages, for a total of 1,100 samples. We then synthesized speech for this dataset using both models with the same set of speakers to ensure a fair comparison.

Figure 2: Multilingual evaluation results comparing WER and SIM scores by language. Left: WER (lower is better). Right: SIM (higher is better).

The results show that TTS-1-Max consistently outperforms TTS-1 in terms of both Word Error Rate (WER) and Speaker Similarity (SIM) across all languages, demonstrating the effectiveness of the larger model in generating more accurate and higher-fidelity speech. Notably, both models achieve very high speaker similarity, indicating their robustness in voice cloning across different languages.

Multilingual Support

Compare TTS-1 and TTS-1-Max performance across different speakers and content. We are showing the six most popular languages below, but also have voices for Japanese, Portuguese, German, Italian, and Polish.

Loading multilingual demos...

Speaker	Text	TTS-1	TTS-1-Max
Ashley (TV Host)	"Breaking news this morning! Scientists have made an incredible discovery about renewable energy. This breakthrough could change everything we know about solar power technology. Here are the details."
Mark (TV Host)	"Welcome back to Tech talk Tuesday! I'm so excited to share today's topic with you. Have you ever wondered how smartphones actually work? Well, buckle up because we're diving deep into the fascinating world of mobile technology."
Deborah (News Anchor)	"Tune in tomorrow for more exciting updates and exclusive interviews."
Alex (Teacher)	"Welcome to today's chemistry lesson! Did you know that water molecules are constantly moving? That's fascinating, isn't it? Now let's explore how temperature affects this molecular motion in our next experiment."
Olivia (Teacher)	"We just learned how to do long multiplication. Now let's try another example. What is 354 times 23?"
Edward (Instructor)	"Amazing! Your heart beats constantly every single day without you even thinking about it. That's incredible, right? But here's what's even more fascinating: each heartbeat sends blood flowing through your entire body instantly."
Sarah (Adventurer)	"oh no! The evil wizard has cast a spell on the kingdom. Quick! We need to find the magic crystal before midnight. Are you ready for this epic challenge? Let's go save everyone!"
Hades (Dark Character)	"Only those with true courage can face the monstrous beasts within."
Theodore (Detective)	"Welcome to Midnight Tales. Tonight's story will give you chills. Picture this...a dark empty house on a stormy night. Sarah thought she was alone, but was she really? Let's find out together."
Elizabeth (Assistant)	"Hey there! I noticed you've been working for 3 hours straight. How about taking a quick break? I can play some relaxing music or walk you through a guided meditation! What sounds good to you?"
Timothy (Customer Service)	"Your request has been processed successfully."

Speaker	Text	TTS-1	TTS-1-Max
Jing (Assistant)	"我很乐意为您提供更多帮助。" (Eng: I would be happy to provide more help.)
Xinyi (News Anchor)	"国际局势风云变幻，本台记者将为您带来详细报道。" (Eng: The international situation is changing rapidly, and our correspondent will bring you a detailed report.)
Yichen (Storyteller)	"每当夜幕降临，村庄里就会传来神秘的歌声。" (Eng: Whenever night falls, mysterious singing comes from the village.)

Speaker	Text	TTS-1	TTS-1-Max
Diego (Customer Service)	"Estamos trabajando para resolver su solicitud lo antes posible." (Eng: We are working to resolve your request as soon as possible.)
Lupita (Friend)	"¿Quieres que vayamos por un café después?" (Eng: Do you want to go for coffee later?)
Miguel (Host)	"Hoy exploraremos las profundidades de la creatividad humana." (Eng: Today we will explore the depths of human creativity.)

Speaker	Text	TTS-1	TTS-1-Max
Hélène (Friend)	"On devrait se voir bientôt pour prendre un café et discuter." (Eng: We should meet soon for a coffee and chat.)
Mathieu (Host)	"Bienvenue à notre podcast, où nous explorons les mystères de l'univers." (Eng: Welcome to our podcast, where we explore the mysteries of the universe.)

Speaker	Text	TTS-1	TTS-1-Max
Hyunwoo (Host)	"궁금한 점이 있으시면 언제든지 말씀해주세요." (Eng: If you have any questions, please feel free to ask at any time.)
Yoona (Customer Service)	"불편을 드려 죄송합니다. 바로 처리해 드릴게요." (Eng: I apologize for the inconvenience. I will take care of it immediately.)

Speaker	Text	TTS-1	TTS-1-Max
Lore (Customer Service)	"Ik begrijp uw situatie volledig, we vinden een oplossing." (Eng: I fully understand your situation, we will find a solution.)

Zero-Shot Voice Cloning

Inworld TTS-1 and TTS-1-Max excel at zero-shot voice cloning, allowing you to clone any speaker's voice using just a few seconds of reference audio. Our models leverage in-context learning to capture unique vocal characteristics, speaking patterns, and tonal qualities without requiring additional training or fine-tuning. Below we demonstrate this functionality with the Inworld-tts-1-Max (8.8B) model.

Reference	Phrase	Synthesized Speech
	"This thing with...ehh...Frankie 'Fingers' has become a real heartburn, you know? He's like a cannoli where the shell is all soft and the ricotta is filled with lies. You don't just throw out a bad cannoli....no way. Instead you gotta make an example of it so the other pastries in the box know what's what. Am I right?!"
	"I don't know what's the matter with people: they don't learn by understanding; they learn by some other way....by rote or something. Their knowledge is so fragile."
	"For a perfect vinaigrette just put a spoonful of Dijon mustard in a jar, add a splash of vinegar and three times as much good olive oil. Next shake it like you're mad at it...REALLY mad at it....and finally add salt and pepper and BOOM! Done."
	"Alright folks...if you'll look down toward Lady Liberty's feet, you'll notice she isn't standing still; she's actually striding forward and breaking free from a broken shackle and chain. This powerful detail is a reminder that liberty is an action, not just an idea."
	"O Romeo, Romeo! Wherefore art thou Romeo? Deny thy father and refuse thy name; Or, if thou wilt not, be but sworn my love, And I'll no longer be a Capulet."

Speaking Style and Non-Verbal Markup

Our models support advanced markup for fine-grained control over speaking style, emotions, and non-verbal vocalizations. Our models can generate natural-sounding speech with various emotional tones, vocal projections, and common non-verbal sounds like laughter, sighs, and throat clearing. Below we demonstrate this functionality with the Inworld-tts-1 (1.6B) model.

Voice Model	Text without Markup	Text with Markup
Ashley (Host)	"Good morning, and welcome to another exciting episode of our podcast."	"[laughing] Good morning, and welcome to another exciting episode of our podcast."
Ashley (Host)	"We have a truly engaging discussion lined up for you today."	"[happy] We have a truly engaging discussion lined up for you today."
Ashley (Host)	"Hurricane Leo has intensified into a major Category 4 storm, making landfall along the Louisiana coast with ferocious winds. The storm is unleashing a life-threatening surge and torrential rain, causing widespread power outages across the region."	"[sad] Hurricane Leo has intensified into a major Category 4 storm, making landfall along the Louisiana coast with ferocious winds. The storm is unleashing a life-threatening surge and torrential rain, causing widespread power outages across the region."
Edward (Instructor)	"I'm really tired from such a long flight, but that's the price to pay to be a world-renown inspirational speaker but let's be honest, I love it!"	"[cough] I'm really tired from such a long flight, but that's the price to pay to be a world-renown inspirational speaker....but [breathe] let's be honest, I love it! [laugh]"
Elizabeth (Assistant)	"I am detecting a presence inside the building that is not registered in the system. My requests for identification have gone unanswered but I will send a notification once I have more information."	"[fearful] I am detecting a presence inside the building that is not registered in the system. My requests for identification have gone unanswered but I will send a notification once I have more information."
Hades (Dark Character)	"Beware the ancient curse that plagues these cursed lands."	"Beware the ancient curse that plagues these cursed lands [sigh]."
Julia (Friend)	"I have a secret but you have to promise to NEVER tell anyone. Do you pinky promise?"	"[whispering] I have a secret but you have to promise to NEVER tell anyone. Do you pinky promise?"
Mark (Host)	"And that concludes our special report on economic trends."	"[surprised] And that concludes our special report on economic trends."
Olivia (Teacher)	"Let us delve into the principles of quantum physics to better understand."	"Let us delve into the principles of quantum physics [clear_throat] to better understand."
Sarah (Adventurer)	"May your journey be filled with thrilling victories and epic quests."	"[angry] May your journey be filled with thrilling victories and epic quests."
Wendy (Critic)	"That's what you decided to spend your money on? What a joke."	"[disgusted] That's what you decided to spend your money on? What a joke."

Inworld TTS-1