README.md
56.5 KB · 1316 lines · markdown Raw
1 ---
2 license: apache-2.0
3 pipeline_tag: text-to-speech
4 ---
5
6 # Qwen3-TTS
7
8 ## Overview
9 ### Introduction
10
11 <p align="center">
12 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/qwen3_tts_introduction.png" width="90%"/>
13 <p>
14
15 Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles to meet global application needs. In addition, the models feature strong contextual understanding, enabling adaptive control of tone, speaking rate, and emotional expression based on instructions and text semantics, and they show markedly improved robustness to noisy input text. Key features:
16
17 * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling of speech signals. It fully preserves paralinguistic information and acoustic environmental features, enabling high-speed, high-fidelity speech reconstruction through a lightweight non-DiT architecture.
18 * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture, it realizes full-information end-to-end speech modeling. This completely bypasses the information bottlenecks and cascading errors inherent in traditional LM+DiT schemes, significantly enhancing the model’s versatility, generation efficiency, and performance ceiling.
19 * **Extreme Low-Latency Streaming Generation**: Based on the innovative Dual-Track hybrid streaming generation architecture, a single model supports both streaming and non-streaming generation. It can output the first audio packet immediately after a single character is input, with end-to-end synthesis latency as low as 97ms, meeting the rigorous demands of real-time interactive scenarios.
20 * **Intelligent Text Understanding and Voice Control**: Supports speech generation driven by natural language instructions, allowing for flexible control over multi-dimensional acoustic attributes such as timbre, emotion, and prosody. By deeply integrating text semantic understanding, the model adaptively adjusts tone, rhythm, and emotional expression, achieving lifelike “what you imagine is what you hear” output.
21
22
23 ### Model Architecture
24
25 <p align="center">
26 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/overview.png" width="80%"/>
27 <p>
28
29 ### Released Models Description and Download
30
31 Below is an introduction and download information for the Qwen3-TTS models that have already been released. Other models mentioned in the technical report will be released in the near future. Please select and download the model that fits your needs.
32
33 | Tokenizer Name | Description |
34 |---------------------------------|-------------|
35 | Qwen3-TTS-Tokenizer-12Hz | The Qwen3-TTS-Tokenizer-12Hz model which can encode the input speech into codes and decode them back into speech. |
36
37
38 | Model | Features | Language Support | Streaming | Instruction Control |
39 |---|---|---|---|---|
40 | Qwen3-TTS-12Hz-1.7B-VoiceDesign | Performs voice design based on user-provided descriptions. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ | ✅ |
41 | Qwen3-TTS-12Hz-1.7B-CustomVoice | Provides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ | ✅ |
42 | Qwen3-TTS-12Hz-1.7B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ | |
43 | Qwen3-TTS-12Hz-0.6B-CustomVoice | Supports 9 premium timbres covering various combinations of gender, age, language, and dialect. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ | |
44 | Qwen3-TTS-12Hz-0.6B-Base | Base model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models. | Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian | ✅ | |
45
46 During model loading in the qwen-tts package or vLLM, model weights will be automatically downloaded based on the model name. However, if your runtime environment is not conducive to downloading weights during execution, you can refer to the following commands to manually download the model weights to a local directory:
47
48 ```bash
49 # Download through ModelScope (recommended for users in Mainland China)
50 pip install -U modelscope
51 modelscope download --model Qwen/Qwen3-TTS-Tokenizer-12Hz --local_dir ./Qwen3-TTS-Tokenizer-12Hz
52 modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local_dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
53 modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local_dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
54 modelscope download --model Qwen/Qwen3-TTS-12Hz-1.7B-Base --local_dir ./Qwen3-TTS-12Hz-1.7B-Base
55 modelscope download --model Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local_dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
56 modelscope download --model Qwen/Qwen3-TTS-12Hz-0.6B-Base --local_dir ./Qwen3-TTS-12Hz-0.6B-Base
57
58 # Download through Hugging Face
59 pip install -U "huggingface_hub[cli]"
60 huggingface-cli download Qwen/Qwen3-TTS-Tokenizer-12Hz --local-dir ./Qwen3-TTS-Tokenizer-12Hz
61 huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-1.7B-CustomVoice
62 huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --local-dir ./Qwen3-TTS-12Hz-1.7B-VoiceDesign
63 huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B-Base --local-dir ./Qwen3-TTS-12Hz-1.7B-Base
64 huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice --local-dir ./Qwen3-TTS-12Hz-0.6B-CustomVoice
65 huggingface-cli download Qwen/Qwen3-TTS-12Hz-0.6B-Base --local-dir ./Qwen3-TTS-12Hz-0.6B-Base
66 ```
67
68
69 ## Quickstart
70
71 ### Environment Setup
72
73 The easiest way to quickly use Qwen3-TTS is to install the `qwen-tts` Python package from PyPI. This will pull in the required runtime dependencies and allow you to load any released Qwen3-TTS model. We recommend using a **fresh, isolated environment** to avoid dependency conflicts with existing packages. You can create a clean Python 3.12 environment like this:
74
75 ```bash
76 conda create -n qwen3-tts python=3.12 -y
77 conda activate qwen3-tts
78 ```
79
80 then run:
81
82 ```bash
83 pip install -U qwen-tts
84 ```
85
86 If you want to develop or modify the code locally, install from source in editable mode.
87
88 ```bash
89 git clone https://github.com/QwenLM/Qwen3-TTS.git
90 cd Qwen3-TTS
91 pip install -e .
92 ```
93
94 Additionally, we recommend using FlashAttention 2 to reduce GPU memory usage.
95
96 ```bash
97 pip install -U flash-attn --no-build-isolation
98 ```
99
100 If your machine has less than 96GB of RAM and lots of CPU cores, run:
101
102 ```bash
103 MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
104 ```
105
106 Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [FlashAttention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
107
108
109 ### Python Package Usage
110
111 After installation, you can import `Qwen3TTSModel` to run custom voice TTS, voice design, and voice clone. The model weights can be specified either as a Hugging Face model id (recommended) or as a local directory path you downloaded. For all the `generate_*` functions below, besides the parameters shown and explicitly documented, you can also pass generation kwargs supported by Hugging Face Transformers `model.generate`, e.g., `max_new_tokens`, `top_p`, etc.
112
113 #### Custom Voice Generate
114
115 For custom voice models (`Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice`), you just need to call `generate_custom_voice`, passing a single string or a batch list, along with `language`, `speaker`, and optional `instruct`. You can also call `model.get_supported_speakers()` and `model.get_supported_languages()` to see which speakers and languages the current model supports.
116
117 ```python
118 import torch
119 import soundfile as sf
120 from qwen_tts import Qwen3TTSModel
121
122 model = Qwen3TTSModel.from_pretrained(
123 "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
124 device_map="cuda:0",
125 dtype=torch.bfloat16,
126 attn_implementation="flash_attention_2",
127 )
128
129 # single inference
130 wavs, sr = model.generate_custom_voice(
131 text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
132 language="Chinese", # Pass `Auto` (or omit) for auto language adaptive; if the target language is known, set it explicitly.
133 speaker="Vivian",
134 instruct="用特别愤怒的语气说", # Omit if not needed.
135 )
136 sf.write("output_custom_voice.wav", wavs[0], sr)
137
138 # batch inference
139 wavs, sr = model.generate_custom_voice(
140 text=[
141 "其实我真的有发现,我是一个特别善于观察别人情绪的人。",
142 "She said she would be here by noon."
143 ],
144 language=["Chinese", "English"],
145 speaker=["Vivian", "Ryan"],
146 instruct=["", "Very happy."]
147 )
148 sf.write("output_custom_voice_1.wav", wavs[0], sr)
149 sf.write("output_custom_voice_2.wav", wavs[1], sr)
150 ```
151
152 For `Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice` models, the supported speaker list and speaker descriptions are provided below. We recommend using each speaker’s native language for the best quality. Of course, each speaker can speak any language supported by the model.
153
154 | Speaker | Voice Description | Native language |
155 | --- | --- | --- |
156 | Vivian | Bright, slightly edgy young female voice. | Chinese |
157 | Serena | Warm, gentle young female voice. | Chinese |
158 | Uncle_Fu | Seasoned male voice with a low, mellow timbre. | Chinese |
159 | Dylan | Youthful Beijing male voice with a clear, natural timbre. | Chinese (Beijing Dialect) |
160 | Eric | Lively Chengdu male voice with a slightly husky brightness. | Chinese (Sichuan Dialect) |
161 | Ryan | Dynamic male voice with strong rhythmic drive. | English |
162 | Aiden | Sunny American male voice with a clear midrange. | English |
163 | Ono_Anna | Playful Japanese female voice with a light, nimble timbre. | Japanese |
164 | Sohee | Warm Korean female voice with rich emotion. | Korean |
165
166 #### Voice Design
167
168 For the voice design model (`Qwen3-TTS-12Hz-1.7B-VoiceDesign`), you can use `generate_voice_design` to provide the target text and a natural-language `instruct` description.
169
170 ```python
171 import torch
172 import soundfile as sf
173 from qwen_tts import Qwen3TTSModel
174
175 model = Qwen3TTSModel.from_pretrained(
176 "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
177 device_map="cuda:0",
178 dtype=torch.bfloat16,
179 attn_implementation="flash_attention_2",
180 )
181
182 # single inference
183 wavs, sr = model.generate_voice_design(
184 text="哥哥,你回来啦,人家等了你好久好久了,要抱抱!",
185 language="Chinese",
186 instruct="体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。",
187 )
188 sf.write("output_voice_design.wav", wavs[0], sr)
189
190 # batch inference
191 wavs, sr = model.generate_voice_design(
192 text=[
193 "哥哥,你回来啦,人家等了你好久好久了,要抱抱!",
194 "It's in the top drawer... wait, it's empty? No way, that's impossible! I'm sure I put it there!"
195 ],
196 language=["Chinese", "English"],
197 instruct=[
198 "体现撒娇稚嫩的萝莉女声,音调偏高且起伏明显,营造出黏人、做作又刻意卖萌的听觉效果。",
199 "Speak in an incredulous tone, but with a hint of panic beginning to creep into your voice."
200 ]
201 )
202 sf.write("output_voice_design_1.wav", wavs[0], sr)
203 sf.write("output_voice_design_2.wav", wavs[1], sr)
204 ```
205
206 #### Voice Clone
207
208 For the voice clone model (`Qwen3-TTS-12Hz-1.7B/0.6B-Base`), to clone a voice and synthesize new content, you just need to provide a reference audio clip (`ref_audio`) along with its transcript (`ref_text`). `ref_audio` can be a local file path, a URL, a base64 string, or a `(numpy_array, sample_rate)` tuple. If you set `x_vector_only_mode=True`, only the speaker embedding is used so `ref_text` is not required, but cloning quality may be reduced.
209
210 ```python
211 import torch
212 import soundfile as sf
213 from qwen_tts import Qwen3TTSModel
214
215 model = Qwen3TTSModel.from_pretrained(
216 "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
217 device_map="cuda:0",
218 dtype=torch.bfloat16,
219 attn_implementation="flash_attention_2",
220 )
221
222 ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
223 ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
224
225 wavs, sr = model.generate_voice_clone(
226 text="I am solving the equation: x = [-b ± √(b²-4ac)] / 2a? Nobody can — it's a disaster (◍•͈⌔•͈◍), very sad!",
227 language="English",
228 ref_audio=ref_audio,
229 ref_text=ref_text,
230 )
231 sf.write("output_voice_clone.wav", wavs[0], sr)
232 ```
233
234 If you need to reuse the same reference prompt across multiple generations (to avoid recomputing prompt features), build it once with `create_voice_clone_prompt` and pass it via `voice_clone_prompt`.
235
236 ```python
237 prompt_items = model.create_voice_clone_prompt(
238 ref_audio=ref_audio,
239 ref_text=ref_text,
240 x_vector_only_mode=False,
241 )
242 wavs, sr = model.generate_voice_clone(
243 text=["Sentence A.", "Sentence B."],
244 language=["English", "English"],
245 voice_clone_prompt=prompt_items,
246 )
247 sf.write("output_voice_clone_1.wav", wavs[0], sr)
248 sf.write("output_voice_clone_2.wav", wavs[1], sr)
249 ```
250
251 For more examples of reusable voice clone prompts, batch cloning, and batch inference, please refer to the [example codes](https://github.com/QwenLM/Qwen3-TTS/blob/main/examples/test_model_12hz_base.py). With those examples and the `generate_voice_clone` function description, you can explore more advanced usage patterns.
252
253 #### Voice Design then Clone
254
255 If you want a designed voice that you can reuse like a cloned speaker, a practical workflow is: (1) use the **VoiceDesign** model to synthesize a short reference clip that matches your target persona, (2) feed that clip into `create_voice_clone_prompt` to build a reusable prompt, and then (3) call `generate_voice_clone` with `voice_clone_prompt` to generate new content without re-extracting features every time. This is especially useful when you want a consistent character voice across many lines.
256
257 ```python
258 import torch
259 import soundfile as sf
260 from qwen_tts import Qwen3TTSModel
261
262 # create a reference audio in the target style using the VoiceDesign model
263 design_model = Qwen3TTSModel.from_pretrained(
264 "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
265 device_map="cuda:0",
266 dtype=torch.bfloat16,
267 attn_implementation="flash_attention_2",
268 )
269
270 ref_text = "H-hey! You dropped your... uh... calculus notebook? I mean, I think it's yours? Maybe?"
271 ref_instruct = "Male, 17 years old, tenor range, gaining confidence - deeper breath support now, though vowels still tighten when nervous"
272 ref_wavs, sr = design_model.generate_voice_design(
273 text=ref_text,
274 language="English",
275 instruct=ref_instruct
276 )
277 sf.write("voice_design_reference.wav", ref_wavs[0], sr)
278
279 # build a reusable clone prompt from the voice design reference
280 clone_model = Qwen3TTSModel.from_pretrained(
281 "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
282 device_map="cuda:0",
283 dtype=torch.bfloat16,
284 attn_implementation="flash_attention_2",
285 )
286
287 voice_clone_prompt = clone_model.create_voice_clone_prompt(
288 ref_audio=(ref_wavs[0], sr), # or "voice_design_reference.wav"
289 ref_text=ref_text,
290 )
291
292 sentences = [
293 "No problem! I actually... kinda finished those already? If you want to compare answers or something...",
294 "What? No! I mean yes but not like... I just think you're... your titration technique is really precise!",
295 ]
296
297 # reuse it for multiple single calls
298 wavs, sr = clone_model.generate_voice_clone(
299 text=sentences[0],
300 language="English",
301 voice_clone_prompt=voice_clone_prompt,
302 )
303 sf.write("clone_single_1.wav", wavs[0], sr)
304
305 wavs, sr = clone_model.generate_voice_clone(
306 text=sentences[1],
307 language="English",
308 voice_clone_prompt=voice_clone_prompt,
309 )
310 sf.write("clone_single_2.wav", wavs[0], sr)
311
312 # or batch generate in one call
313 wavs, sr = clone_model.generate_voice_clone(
314 text=sentences,
315 language=["English", "English"],
316 voice_clone_prompt=voice_clone_prompt,
317 )
318 for i, w in enumerate(wavs):
319 sf.write(f"clone_batch_{i}.wav", w, sr)
320 ```
321
322 #### Tokenizer Encode and Decode
323
324 If you only want to encode and decode audio for transport or training and so on, `Qwen3TTSTokenizer` supports encode/decode with paths, URLs, numpy waveforms, and dict/list payloads, for example:
325
326 ```python
327 import soundfile as sf
328 from qwen_tts import Qwen3TTSTokenizer
329
330 tokenizer = Qwen3TTSTokenizer.from_pretrained(
331 "Qwen/Qwen3-TTS-Tokenizer-12Hz",
332 device_map="cuda:0",
333 )
334
335 enc = tokenizer.encode("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/tokenizer_demo_1.wav")
336 wavs, sr = tokenizer.decode(enc)
337 sf.write("decode_output.wav", wavs[0], sr)
338 ```
339
340 For more tokenizer examples (including different input formats and batch usage), please refer to the [example codes](https://github.com/QwenLM/Qwen3-TTS/blob/main/examples/test_tokenizer_12hz.py). With those examples and the description for `Qwen3TTSTokenizer`, you can explore more advanced usage patterns.
341
342 ### Launch Local Web UI Demo
343
344 To launch the Qwen3-TTS web ui demo, simply install the `qwen-tts` package and run `qwen-tts-demo`. Use the command below for help:
345
346 ```bash
347 qwen-tts-demo --help
348 ```
349
350 To launch the demo, you can use the following commands:
351
352 ```bash
353 # CustomVoice model
354 qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
355 # VoiceDesign model
356 qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign --ip 0.0.0.0 --port 8000
357 # Base model
358 qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 0.0.0.0 --port 8000
359 ```
360
361 And then open `http://<your-ip>:8000`, or access it via port forwarding in tools like VS Code.
362
363 #### Base Model HTTPS Notes
364
365 To avoid browser microphone permission issues after deploying the server, for Base model deployments, it is recommended/required to run the gradio service over **HTTPS** (especially when accessed remotely or behind modern browsers/gateways). Use `--ssl-certfile` and `--ssl-keyfile` to enable HTTPS. First we need to generate a private key and a self-signed cert (valid for 365 days):
366
367 ```bash
368 openssl req -x509 -newkey rsa:2048 \
369 -keyout key.pem -out cert.pem \
370 -days 365 -nodes \
371 -subj "/CN=localhost"
372 ```
373
374 Then run the demo with HTTPS:
375
376 ```bash
377 qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
378 --ip 0.0.0.0 --port 8000 \
379 --ssl-certfile cert.pem \
380 --ssl-keyfile key.pem \
381 --no-ssl-verify
382 ```
383
384 And open `https://<your-ip>:8000` to experience it. If your browser shows a warning, it’s expected for self-signed certificates. For production, use a real certificate.
385
386 ### DashScope API Usage
387
388 To further explore Qwen3-TTS, we encourage you to try our DashScope API for a faster and more efficient experience. For detailed API information and documentation, please refer to the following:
389
390 | API Description | API Documentation (Mainland China) | API Documentation (International) |
391 |------------------|-----------------------------------|------------------------------------|
392 | Real-time API for Qwen3-TTS of custom voice model. | [https://help.aliyun.com/zh/model-studio/qwen-tts-realtime](https://help.aliyun.com/zh/model-studio/qwen-tts-realtime) | [https://www.alibabacloud.com/help/en/model-studio/qwen-tts-realtime](https://www.alibabacloud.com/help/en/model-studio/qwen-tts-realtime) |
393 | Real-time API for Qwen3-TTS of voice clone model. | [https://help.aliyun.com/zh/model-studio/qwen-tts-voice-cloning](https://help.aliyun.com/zh/model-studio/qwen-tts-voice-cloning) | [https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-cloning](https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-cloning) |
394 | Real-time API for Qwen3-TTS of voice design model. | [https://help.aliyun.com/zh/model-studio/qwen-tts-voice-design](https://help.aliyun.com/zh/model-studio/qwen-tts-voice-design) | [https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-design](https://www.alibabacloud.com/help/en/model-studio/qwen-tts-voice-design) |
395
396
397 ## vLLM Usage
398
399 vLLM officially provides day-0 support for Qwen3-TTS! Welcome to use vLLM-Omni for Qwen3-TTS deployment and inference. For installation and more details, please check [vLLM-Omni official documentation](https://docs.vllm.ai/projects/vllm-omni/en/latest/getting_started/quickstart/#installation). Now only offline inference is supported. Online serving will be supported later, and vLLM-Omni will continue to offer support and optimization for Qwen3-TTS in areas such as inference speed and streaming capabilities.
400
401 ### Offline Inference
402 You can use vLLM-Omni to inference Qwen3-TTS locally, we provide examples in [vLLM-Omni repo](https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/qwen3_tts) which can generate audio output:
403 ```bash
404 # git clone https://github.com/vllm-project/vllm-omni.git
405
406 # cd vllm-omni/examples/offline_inference/qwen3_tts
407
408 # Run a single sample with CustomVoice task
409 python end2end.py --query-type CustomVoice
410
411 # Batch sample (multiple prompts in one run) with CustomVoice task:
412 python end2end.py --query-type CustomVoice --use-batch-sample
413
414 # Run a single sample with VoiceDesign task
415 python end2end.py --query-type VoiceDesign
416
417 # Batch sample (multiple prompts in one run) with VoiceDesign task:
418 python end2end.py --query-type VoiceDesign --use-batch-sample
419
420 # Run a single sample with Base task in icl mode-tag
421 python end2end.py --query-type Base --mode-tag icl
422 ```
423
424 ## Evaluation
425
426 During evaluation, we ran inference for all models with `dtype=torch.bfloat16` and set `max_new_tokens=2048`. All other sampling parameters used the defaults from the checkpoint’s `generate_config.json`. For the Seed-Test and InstructTTS-Eval test sets, we set `language="auto"`, while for all other test sets we explicitly passed the corresponding `language`. The detailed results are shown below.
427
428
429 <details>
430 <summary>Speech Generation Benchmarks</summary>
431
432 *Zero-shot speech generation on the Seed-TTS test set. Performance is measured by Word Error Rate (WER, ↓), where lower is better.*
433
434 <table>
435 <thead>
436 <tr>
437 <th style="text-align: center;">Datasets</th>
438 <th style="text-align: left;">Model</th>
439 <th colspan="2" style="text-align: center;">Performance</th>
440 </tr>
441 <tr style="border-bottom: 1px solid #ddd; border-top: 1px solid #ddd;">
442 <td colspan="4" style="text-align: center;"><em>Content Consistency</em></td>
443 </tr>
444 </thead>
445 <tbody>
446 <tr>
447 <td rowspan="14" style="text-align: center; vertical-align: middle;">SEED<br><em>test-zh</em> | <em>test-en</em></td>
448 <td style="text-align: left;">Seed-TTS (Anastassiou et al., 2024)</td>
449 <td style="text-align: center;">1.12</td>
450 <td style="text-align: center;">2.25</td>
451 </tr>
452 <tr>
453 <td style="text-align: left;">MaskGCT (Wang et al., 2024)</td>
454 <td style="text-align: center;">2.27</td>
455 <td style="text-align: center;">2.62</td>
456 </tr>
457 <tr>
458 <td style="text-align: left;">E2 TTS (Eskimez et al., 2024)</td>
459 <td style="text-align: center;">1.97</td>
460 <td style="text-align: center;">2.19</td>
461 </tr>
462 <tr>
463 <td style="text-align: left;">F5-TTS (Chen et al., 2024)</td>
464 <td style="text-align: center;">1.56</td>
465 <td style="text-align: center;">1.83</td>
466 </tr>
467 <tr>
468 <td style="text-align: left;">Spark TTS (Wang et al., 2025)</td>
469 <td style="text-align: center;">1.20</td>
470 <td style="text-align: center;">1.98</td>
471 </tr>
472 <tr>
473 <td style="text-align: left;">Llasa-8B (Ye et al., 2025b)</td>
474 <td style="text-align: center;">1.59</td>
475 <td style="text-align: center;">2.97</td>
476 </tr>
477 <tr>
478 <td style="text-align: left;">KALL-E (Xia et al., 2024)</td>
479 <td style="text-align: center;">0.96</td>
480 <td style="text-align: center;">1.94</td>
481 </tr>
482 <tr>
483 <td style="text-align: left;">FireRedTTS 2 (Xie et al., 2025)</td>
484 <td style="text-align: center;">1.14</td>
485 <td style="text-align: center;">1.95</td>
486 </tr>
487 <tr>
488 <td style="text-align: left;">CosyVoice 3 (Du et al., 2025)</td>
489 <td style="text-align: center;"><strong>0.71</strong></td>
490 <td style="text-align: center;">1.45</td>
491 </tr>
492 <tr>
493 <td style="text-align: left;">MiniMax-Speech (Zhang et al., 2025a)</td>
494 <td style="text-align: center;">0.83</td>
495 <td style="text-align: center;">1.65</td>
496 </tr>
497 <tr>
498 <td style="text-align: left;">Qwen3-TTS-25Hz-0.6B-Base</td>
499 <td style="text-align: center;">1.18</td>
500 <td style="text-align: center;">1.64</td>
501 </tr>
502 <tr>
503 <td style="text-align: left;">Qwen3-TTS-25Hz-1.7B-Base</td>
504 <td style="text-align: center;">1.10</td>
505 <td style="text-align: center;">1.49</td>
506 </tr>
507 <tr>
508 <td style="text-align: left;">Qwen3-TTS-12Hz-0.6B-Base</td>
509 <td style="text-align: center;">0.92</td>
510 <td style="text-align: center;">1.32</td>
511 </tr>
512 <tr>
513 <td style="text-align: left;">Qwen3-TTS-12Hz-1.7B-Base</td>
514 <td style="text-align: center;">0.77</td>
515 <td style="text-align: center;"><strong>1.24</strong></td>
516 </tr>
517 </tbody>
518 </table>
519
520 <br>
521
522 *Multilingual speech generation on the TTS multilingual test set. Performance is measured by Word Error Rate (WER, ↓) for content consistency and Cosine Similarity (SIM, ↑) for speaker similarity.*
523
524 <table>
525 <thead>
526 <tr>
527 <th rowspan="2" style="text-align: left; vertical-align: bottom;">Language</th>
528 <th colspan="2" style="text-align: center;">Qwen3-TTS-25Hz</th>
529 <th colspan="2" style="text-align: center;">Qwen3-TTS-12Hz</th>
530 <th rowspan="2" style="text-align: center; vertical-align: bottom;">MiniMax</th>
531 <th rowspan="2" style="text-align: center; vertical-align: bottom;">ElevenLabs</th>
532 </tr>
533 <tr>
534 <th style="text-align: center;">0.6B-Base</th>
535 <th style="text-align: center;">1.7B-Base</th>
536 <th style="text-align: center;">0.6B-Base</th>
537 <th style="text-align: center;">1.7B-Base</th>
538 </tr>
539 </thead>
540 <tbody>
541 <tr>
542 <td colspan="7" style="text-align: center; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;"><em>Content Consistency</em></td>
543 </tr>
544 <tr>
545 <td style="text-align: left;">Chinese</td>
546 <td style="text-align: center;">1.108</td>
547 <td style="text-align: center;"><strong>0.777</strong></td>
548 <td style="text-align: center;">1.145</td>
549 <td style="text-align: center;">0.928</td>
550 <td style="text-align: center;">2.252</td>
551 <td style="text-align: center;">16.026</td>
552 </tr>
553 <tr>
554 <td style="text-align: left;">English</td>
555 <td style="text-align: center;">1.048</td>
556 <td style="text-align: center;">1.014</td>
557 <td style="text-align: center;"><strong>0.836</strong></td>
558 <td style="text-align: center;">0.934</td>
559 <td style="text-align: center;">2.164</td>
560 <td style="text-align: center;">2.339</td>
561 </tr>
562 <tr>
563 <td style="text-align: left;">German</td>
564 <td style="text-align: center;">1.501</td>
565 <td style="text-align: center;">0.960</td>
566 <td style="text-align: center;">1.089</td>
567 <td style="text-align: center;">1.235</td>
568 <td style="text-align: center;">1.906</td>
569 <td style="text-align: center;"><strong>0.572</strong></td>
570 </tr>
571 <tr>
572 <td style="text-align: left;">Italian</td>
573 <td style="text-align: center;">1.169</td>
574 <td style="text-align: center;">1.105</td>
575 <td style="text-align: center;">1.534</td>
576 <td style="text-align: center;"><strong>0.948</strong></td>
577 <td style="text-align: center;">1.543</td>
578 <td style="text-align: center;">1.743</td>
579 </tr>
580 <tr>
581 <td style="text-align: left;">Portuguese</td>
582 <td style="text-align: center;">2.046</td>
583 <td style="text-align: center;">1.778</td>
584 <td style="text-align: center;">2.254</td>
585 <td style="text-align: center;">1.526</td>
586 <td style="text-align: center;">1.877</td>
587 <td style="text-align: center;"><strong>1.331</strong></td>
588 </tr>
589 <tr>
590 <td style="text-align: left;">Spanish</td>
591 <td style="text-align: center;">2.031</td>
592 <td style="text-align: center;">1.491</td>
593 <td style="text-align: center;">1.491</td>
594 <td style="text-align: center;">1.126</td>
595 <td style="text-align: center;"><strong>1.029</strong></td>
596 <td style="text-align: center;">1.084</td>
597 </tr>
598 <tr>
599 <td style="text-align: left;">Japanese</td>
600 <td style="text-align: center;">4.189</td>
601 <td style="text-align: center;">5.121</td>
602 <td style="text-align: center;">6.404</td>
603 <td style="text-align: center;">3.823</td>
604 <td style="text-align: center;"><strong>3.519</strong></td>
605 <td style="text-align: center;">10.646</td>
606 </tr>
607 <tr>
608 <td style="text-align: left;">Korean</td>
609 <td style="text-align: center;">2.852</td>
610 <td style="text-align: center;">2.631</td>
611 <td style="text-align: center;"><strong>1.741</strong></td>
612 <td style="text-align: center;">1.755</td>
613 <td style="text-align: center;">1.747</td>
614 <td style="text-align: center;">1.865</td>
615 </tr>
616 <tr>
617 <td style="text-align: left;">French</td>
618 <td style="text-align: center;">2.852</td>
619 <td style="text-align: center;"><strong>2.631</strong></td>
620 <td style="text-align: center;">2.931</td>
621 <td style="text-align: center;">2.858</td>
622 <td style="text-align: center;">4.099</td>
623 <td style="text-align: center;">5.216</td>
624 </tr>
625 <tr>
626 <td style="text-align: left;">Russian</td>
627 <td style="text-align: center;">5.957</td>
628 <td style="text-align: center;">4.535</td>
629 <td style="text-align: center;">4.458</td>
630 <td style="text-align: center;"><strong>3.212</strong></td>
631 <td style="text-align: center;">4.281</td>
632 <td style="text-align: center;">3.878</td>
633 </tr>
634 <tr style="border-top: 1px solid #ddd;">
635 <td colspan="7" style="text-align: center; border-bottom: 1px solid #ddd;"><em>Speaker Similarity</em></td>
636 </tr>
637 <tr>
638 <td style="text-align: left;">Chinese</td>
639 <td style="text-align: center;">0.797</td>
640 <td style="text-align: center;">0.796</td>
641 <td style="text-align: center;"><strong>0.811</strong></td>
642 <td style="text-align: center;">0.799</td>
643 <td style="text-align: center;">0.780</td>
644 <td style="text-align: center;">0.677</td>
645 </tr>
646 <tr>
647 <td style="text-align: left;">English</td>
648 <td style="text-align: center;">0.811</td>
649 <td style="text-align: center;">0.815</td>
650 <td style="text-align: center;"><strong>0.829</strong></td>
651 <td style="text-align: center;">0.775</td>
652 <td style="text-align: center;">0.756</td>
653 <td style="text-align: center;">0.613</td>
654 </tr>
655 <tr>
656 <td style="text-align: left;">German</td>
657 <td style="text-align: center;">0.749</td>
658 <td style="text-align: center;">0.737</td>
659 <td style="text-align: center;">0.769</td>
660 <td style="text-align: center;"><strong>0.775</strong></td>
661 <td style="text-align: center;">0.733</td>
662 <td style="text-align: center;">0.614</td>
663 </tr>
664 <tr>
665 <td style="text-align: left;">Italian</td>
666 <td style="text-align: center;">0.722</td>
667 <td style="text-align: center;">0.718</td>
668 <td style="text-align: center;">0.792</td>
669 <td style="text-align: center;"><strong>0.817</strong></td>
670 <td style="text-align: center;">0.699</td>
671 <td style="text-align: center;">0.579</td>
672 </tr>
673 <tr>
674 <td style="text-align: left;">Portuguese</td>
675 <td style="text-align: center;">0.790</td>
676 <td style="text-align: center;">0.783</td>
677 <td style="text-align: center;">0.794</td>
678 <td style="text-align: center;"><strong>0.817</strong></td>
679 <td style="text-align: center;">0.805</td>
680 <td style="text-align: center;">0.711</td>
681 </tr>
682 <tr>
683 <td style="text-align: left;">Spanish</td>
684 <td style="text-align: center;">0.732</td>
685 <td style="text-align: center;">0.731</td>
686 <td style="text-align: center;">0.812</td>
687 <td style="text-align: center;"><strong>0.814</strong></td>
688 <td style="text-align: center;">0.762</td>
689 <td style="text-align: center;">0.615</td>
690 </tr>
691 <tr>
692 <td style="text-align: left;">Japanese</td>
693 <td style="text-align: center;"><strong>0.810</strong></td>
694 <td style="text-align: center;">0.807</td>
695 <td style="text-align: center;">0.798</td>
696 <td style="text-align: center;">0.788</td>
697 <td style="text-align: center;">0.776</td>
698 <td style="text-align: center;">0.738</td>
699 </tr>
700 <tr>
701 <td style="text-align: left;">Korean</td>
702 <td style="text-align: center;"><strong>0.824</strong></td>
703 <td style="text-align: center;">0.814</td>
704 <td style="text-align: center;">0.812</td>
705 <td style="text-align: center;">0.799</td>
706 <td style="text-align: center;">0.779</td>
707 <td style="text-align: center;">0.700</td>
708 </tr>
709 <tr>
710 <td style="text-align: left;">French</td>
711 <td style="text-align: center;">0.698</td>
712 <td style="text-align: center;">0.703</td>
713 <td style="text-align: center;">0.700</td>
714 <td style="text-align: center;"><strong>0.714</strong></td>
715 <td style="text-align: center;">0.628</td>
716 <td style="text-align: center;">0.535</td>
717 </tr>
718 <tr>
719 <td style="text-align: left;">Russian</td>
720 <td style="text-align: center;">0.734</td>
721 <td style="text-align: center;">0.744</td>
722 <td style="text-align: center;">0.781</td>
723 <td style="text-align: center;"><strong>0.792</strong></td>
724 <td style="text-align: center;">0.761</td>
725 <td style="text-align: center;">0.676</td>
726 </tr>
727 </tbody>
728 </table>
729
730 <br>
731
732 *Cross-lingual speech generation on the Cross-Lingual benchmark. Performance is measured by Mixed Error Rate (WER for English, CER for others, ↓).*
733
734 <table>
735 <thead>
736 <tr>
737 <th style="text-align: left;">Task</th>
738 <th style="text-align: center;">Qwen3-TTS-25Hz-1.7B-Base</th>
739 <th style="text-align: center;">Qwen3-TTS-12Hz-1.7B-Base</th>
740 <th style="text-align: center;">CosyVoice3</th>
741 <th style="text-align: center;">CosyVoice2</th>
742 </tr>
743 </thead>
744 <tbody>
745 <tr>
746 <td style="text-align: left;">en-to-zh</td>
747 <td style="text-align: center;">5.66</td>
748 <td style="text-align: center;"><strong>4.77</strong></td>
749 <td style="text-align: center;">5.09</td>
750 <td style="text-align: center;">13.5</td>
751 </tr>
752 <tr>
753 <td style="text-align: left;">ja-to-zh</td>
754 <td style="text-align: center;">3.92</td>
755 <td style="text-align: center;">3.43</td>
756 <td style="text-align: center;"><strong>3.05</strong></td>
757 <td style="text-align: center;">48.1</td>
758 </tr>
759 <tr>
760 <td style="text-align: left;">ko-to-zh</td>
761 <td style="text-align: center;">1.14</td>
762 <td style="text-align: center;">1.08</td>
763 <td style="text-align: center;"><strong>1.06</strong></td>
764 <td style="text-align: center;">7.70</td>
765 </tr>
766 <tr style="border-top: 1px solid #ddd;">
767 <td style="text-align: left;">zh-to-en</td>
768 <td style="text-align: center;">2.91</td>
769 <td style="text-align: center;"><strong>2.77</strong></td>
770 <td style="text-align: center;">2.98</td>
771 <td style="text-align: center;">6.47</td>
772 </tr>
773 <tr>
774 <td style="text-align: left;">ja-to-en</td>
775 <td style="text-align: center;">3.95</td>
776 <td style="text-align: center;"><strong>3.04</strong></td>
777 <td style="text-align: center;">4.20</td>
778 <td style="text-align: center;">17.1</td>
779 </tr>
780 <tr>
781 <td style="text-align: left;">ko-to-en</td>
782 <td style="text-align: center;">3.48</td>
783 <td style="text-align: center;"><strong>3.09</strong></td>
784 <td style="text-align: center;">4.19</td>
785 <td style="text-align: center;">11.2</td>
786 </tr>
787 <tr style="border-top: 1px solid #ddd;">
788 <td style="text-align: left;">zh-to-ja</td>
789 <td style="text-align: center;">9.29</td>
790 <td style="text-align: center;">8.40</td>
791 <td style="text-align: center;"><strong>7.08</strong></td>
792 <td style="text-align: center;">13.1</td>
793 </tr>
794 <tr>
795 <td style="text-align: left;">en-to-ja</td>
796 <td style="text-align: center;">7.74</td>
797 <td style="text-align: center;">7.21</td>
798 <td style="text-align: center;"><strong>6.80</strong></td>
799 <td style="text-align: center;">14.9</td>
800 </tr>
801 <tr>
802 <td style="text-align: left;">ko-to-ja</td>
803 <td style="text-align: center;">4.17</td>
804 <td style="text-align: center;"><strong>3.67</strong></td>
805 <td style="text-align: center;">3.93</td>
806 <td style="text-align: center;">5.86</td>
807 </tr>
808 <tr style="border-top: 1px solid #ddd;">
809 <td style="text-align: left;">zh-to-ko</td>
810 <td style="text-align: center;">8.12</td>
811 <td style="text-align: center;"><strong>4.82</strong></td>
812 <td style="text-align: center;">14.4</td>
813 <td style="text-align: center;">24.8</td>
814 </tr>
815 <tr>
816 <td style="text-align: left;">en-to-ko</td>
817 <td style="text-align: center;">6.83</td>
818 <td style="text-align: center;"><strong>5.14</strong></td>
819 <td style="text-align: center;">5.87</td>
820 <td style="text-align: center;">21.9</td>
821 </tr>
822 <tr>
823 <td style="text-align: left;">ja-to-ko</td>
824 <td style="text-align: center;">6.86</td>
825 <td style="text-align: center;"><strong>5.59</strong></td>
826 <td style="text-align: center;">7.92</td>
827 <td style="text-align: center;">21.5</td>
828 </tr>
829 </tbody>
830 </table>
831
832 <br>
833
834 *Controllable speech generation on InstructTTSEval. Performance is measured by Attribute Perception and Synthesis accuracy (APS), Description-Speech Consistency (DSD), and Response Precision (RP).*
835
836 <table>
837 <thead>
838 <tr>
839 <th rowspan="2" style="text-align: left; vertical-align: bottom;">Type</th>
840 <th rowspan="2" style="text-align: left; vertical-align: bottom;">Model</th>
841 <th colspan="3" style="text-align: center;">InstructTTSEval-ZH</th>
842 <th colspan="3" style="text-align: center;">InstructTTSEval-EN</th>
843 </tr>
844 <tr>
845 <th style="text-align: center;">APS (↑)</th>
846 <th style="text-align: center;">DSD (↑)</th>
847 <th style="text-align: center;">RP (↑)</th>
848 <th style="text-align: center;">APS (↑)</th>
849 <th style="text-align: center;">DSD (↑)</th>
850 <th style="text-align: center;">RP (↑)</th>
851 </tr>
852 </thead>
853 <tbody>
854 <tr>
855 <td rowspan="5" style="text-align: left; vertical-align: middle;"><em>Target<br>Speaker</em></td>
856 <td style="text-align: left;">Gemini-flash</td>
857 <td style="text-align: center;">88.2</td>
858 <td style="text-align: center;"><strong>90.9</strong></td>
859 <td style="text-align: center;"><strong>77.3</strong></td>
860 <td style="text-align: center;"><strong>92.3</strong></td>
861 <td style="text-align: center;"><strong>93.8</strong></td>
862 <td style="text-align: center;"><strong>80.1</strong></td>
863 </tr>
864 <tr>
865 <td style="text-align: left;">Gemini-pro</td>
866 <td style="text-align: center;"><strong>89.0</strong></td>
867 <td style="text-align: center;">90.1</td>
868 <td style="text-align: center;">75.5</td>
869 <td style="text-align: center;">87.6</td>
870 <td style="text-align: center;">86.0</td>
871 <td style="text-align: center;">67.2</td>
872 </tr>
873 <tr>
874 <td style="text-align: left;">Qwen3TTS-25Hz-1.7B-CustomVoice</td>
875 <td style="text-align: center;">83.1</td>
876 <td style="text-align: center;">75.0</td>
877 <td style="text-align: center;">63.0</td>
878 <td style="text-align: center;">79.0</td>
879 <td style="text-align: center;">82.8</td>
880 <td style="text-align: center;">69.3</td>
881 </tr>
882 <tr>
883 <td style="text-align: left;">Qwen3TTS-12Hz-1.7B-CustomVoice</td>
884 <td style="text-align: center;">83.0</td>
885 <td style="text-align: center;">77.8</td>
886 <td style="text-align: center;">61.2</td>
887 <td style="text-align: center;">77.3</td>
888 <td style="text-align: center;">77.1</td>
889 <td style="text-align: center;">63.7</td>
890 </tr>
891 <tr>
892 <td style="text-align: left;">GPT-4o-mini-tts</td>
893 <td style="text-align: center;">54.9</td>
894 <td style="text-align: center;">52.3</td>
895 <td style="text-align: center;">46.0</td>
896 <td style="text-align: center;">76.4</td>
897 <td style="text-align: center;">74.3</td>
898 <td style="text-align: center;">54.8</td>
899 </tr>
900 <tr style="border-top: 1px solid #ddd;">
901 <td rowspan="9" style="text-align: left; vertical-align: middle;"><em>Voice<br>Design</em></td>
902 <td style="text-align: left;">Qwen3TTS-12Hz-1.7B-VD</td>
903 <td style="text-align: center;"><strong>85.2</strong></td>
904 <td style="text-align: center;"><strong>81.1</strong></td>
905 <td style="text-align: center;"><strong>65.1</strong></td>
906 <td style="text-align: center;">82.9</td>
907 <td style="text-align: center;"><strong>82.4</strong></td>
908 <td style="text-align: center;"><strong>68.4</strong></td>
909 </tr>
910 <tr>
911 <td style="text-align: left;">Mimo-Audio-7B-Instruct (Zhang et al., 2025b)</td>
912 <td style="text-align: center;">75.7</td>
913 <td style="text-align: center;">74.3</td>
914 <td style="text-align: center;">61.5</td>
915 <td style="text-align: center;">80.6</td>
916 <td style="text-align: center;">77.6</td>
917 <td style="text-align: center;">59.5</td>
918 </tr>
919 <tr>
920 <td style="text-align: left;">VoiceSculptor (Hu et al., 2026)</td>
921 <td style="text-align: center;">75.7</td>
922 <td style="text-align: center;">64.7</td>
923 <td style="text-align: center;">61.5</td>
924 <td style="text-align: center;">-</td>
925 <td style="text-align: center;">-</td>
926 <td style="text-align: center;">-</td>
927 </tr>
928 <tr>
929 <td style="text-align: left;">Hume</td>
930 <td style="text-align: center;">-</td>
931 <td style="text-align: center;">-</td>
932 <td style="text-align: center;">-</td>
933 <td style="text-align: center;"><strong>83.0</strong></td>
934 <td style="text-align: center;">75.3</td>
935 <td style="text-align: center;">54.3</td>
936 </tr>
937 <tr>
938 <td style="text-align: left;">VoxInstruct (Zhou et al., 2024)</td>
939 <td style="text-align: center;">47.5</td>
940 <td style="text-align: center;">52.3</td>
941 <td style="text-align: center;">42.6</td>
942 <td style="text-align: center;">54.9</td>
943 <td style="text-align: center;">57.0</td>
944 <td style="text-align: center;">39.3</td>
945 </tr>
946 <tr>
947 <td style="text-align: left;">Parler-tts-mini (Lyth & King, 2024)</td>
948 <td style="text-align: center;">-</td>
949 <td style="text-align: center;">-</td>
950 <td style="text-align: center;">-</td>
951 <td style="text-align: center;">63.4</td>
952 <td style="text-align: center;">48.7</td>
953 <td style="text-align: center;">28.6</td>
954 </tr>
955 <tr>
956 <td style="text-align: left;">Parler-tts-large (Lyth & King, 2024)</td>
957 <td style="text-align: center;">-</td>
958 <td style="text-align: center;">-</td>
959 <td style="text-align: center;">-</td>
960 <td style="text-align: center;">60.0</td>
961 <td style="text-align: center;">45.9</td>
962 <td style="text-align: center;">31.2</td>
963 </tr>
964 <tr>
965 <td style="text-align: left;">PromptTTS (Guo et al., 2023)</td>
966 <td style="text-align: center;">-</td>
967 <td style="text-align: center;">-</td>
968 <td style="text-align: center;">-</td>
969 <td style="text-align: center;">64.3</td>
970 <td style="text-align: center;">47.2</td>
971 <td style="text-align: center;">31.4</td>
972 </tr>
973 <tr>
974 <td style="text-align: left;">PromptStyle (Liu et al., 2023)</td>
975 <td style="text-align: center;">-</td>
976 <td style="text-align: center;">-</td>
977 <td style="text-align: center;">-</td>
978 <td style="text-align: center;">57.4</td>
979 <td style="text-align: center;">46.4</td>
980 <td style="text-align: center;">30.9</td>
981 </tr>
982 </tbody>
983 </table>
984
985 <br>
986
987 *Target-Speaker Multilingual Speech Generation on the TTS multilingual test set. Performance is measured by Word Error Rate (WER, ↓).*
988
989 <table>
990 <thead>
991 <tr>
992 <th rowspan="2" style="text-align: left; vertical-align: bottom;">Language</th>
993 <th colspan="2" style="text-align: center;">Qwen3-TTS-25Hz</th>
994 <th colspan="2" style="text-align: center;">Qwen3-TTS-12Hz</th>
995 <th rowspan="2" style="text-align: center; vertical-align: bottom;">GPT-4o-Audio<br>Preview</th>
996 </tr>
997 <tr>
998 <th style="text-align: center;">0.6B-CustomVoice</th>
999 <th style="text-align: center;">1.7B-CustomVoice</th>
1000 <th style="text-align: center;">0.6B-CustomVoice</th>
1001 <th style="text-align: center;">1.7B-CustomVoice</th>
1002 </tr>
1003 </thead>
1004 <tbody>
1005 <tr>
1006 <td style="text-align: left;">Chinese</td>
1007 <td style="text-align: center;">0.874</td>
1008 <td style="text-align: center;"><strong>0.708</strong></td>
1009 <td style="text-align: center;">0.944</td>
1010 <td style="text-align: center;">0.903</td>
1011 <td style="text-align: center;">3.519</td>
1012 </tr>
1013 <tr>
1014 <td style="text-align: left;">English</td>
1015 <td style="text-align: center;">1.332</td>
1016 <td style="text-align: center;">0.936</td>
1017 <td style="text-align: center;">1.188</td>
1018 <td style="text-align: center;"><strong>0.899</strong></td>
1019 <td style="text-align: center;">2.197</td>
1020 </tr>
1021 <tr>
1022 <td style="text-align: left;">German</td>
1023 <td style="text-align: center;">0.990</td>
1024 <td style="text-align: center;"><strong>0.634</strong></td>
1025 <td style="text-align: center;">2.722</td>
1026 <td style="text-align: center;">1.057</td>
1027 <td style="text-align: center;">1.161</td>
1028 </tr>
1029 <tr>
1030 <td style="text-align: left;">Italian</td>
1031 <td style="text-align: center;">1.861</td>
1032 <td style="text-align: center;">1.271</td>
1033 <td style="text-align: center;">2.545</td>
1034 <td style="text-align: center;">1.362</td>
1035 <td style="text-align: center;"><strong>1.194</strong></td>
1036 </tr>
1037 <tr>
1038 <td style="text-align: left;">Portuguese</td>
1039 <td style="text-align: center;">1.728</td>
1040 <td style="text-align: center;">1.854</td>
1041 <td style="text-align: center;">3.219</td>
1042 <td style="text-align: center;">2.681</td>
1043 <td style="text-align: center;"><strong>1.504</strong></td>
1044 </tr>
1045 <tr>
1046 <td style="text-align: left;">Spanish</td>
1047 <td style="text-align: center;">1.309</td>
1048 <td style="text-align: center;">1.284</td>
1049 <td style="text-align: center;"><strong>1.154</strong></td>
1050 <td style="text-align: center;">1.330</td>
1051 <td style="text-align: center;">4.000</td>
1052 </tr>
1053 <tr>
1054 <td style="text-align: left;">Japanese</td>
1055 <td style="text-align: center;"><strong>3.875</strong></td>
1056 <td style="text-align: center;">4.518</td>
1057 <td style="text-align: center;">6.877</td>
1058 <td style="text-align: center;">4.924</td>
1059 <td style="text-align: center;">5.001</td>
1060 </tr>
1061 <tr>
1062 <td style="text-align: left;">Korean</td>
1063 <td style="text-align: center;">2.202</td>
1064 <td style="text-align: center;">2.274</td>
1065 <td style="text-align: center;">3.053</td>
1066 <td style="text-align: center;"><strong>1.741</strong></td>
1067 <td style="text-align: center;">2.763</td>
1068 </tr>
1069 <tr>
1070 <td style="text-align: left;">French</td>
1071 <td style="text-align: center;">3.865</td>
1072 <td style="text-align: center;"><strong>3.080</strong></td>
1073 <td style="text-align: center;">3.841</td>
1074 <td style="text-align: center;">3.781</td>
1075 <td style="text-align: center;">3.605</td>
1076 </tr>
1077 <tr>
1078 <td style="text-align: left;">Russian</td>
1079 <td style="text-align: center;">6.529</td>
1080 <td style="text-align: center;"><strong>4.444</strong></td>
1081 <td style="text-align: center;">5.809</td>
1082 <td style="text-align: center;">4.734</td>
1083 <td style="text-align: center;">5.250</td>
1084 </tr>
1085 </tbody>
1086 </table>
1087
1088 <br>
1089
1090 *Long speech generation results. Performance is measured by Word Error Rate (WER, ↓).*
1091
1092 <table>
1093 <thead>
1094 <tr>
1095 <th style="text-align: center;">Datasets</th>
1096 <th style="text-align: left;">Model</th>
1097 <th colspan="2" style="text-align: center;">Performance</th>
1098 </tr>
1099 <tr style="border-bottom: 1px solid #ddd; border-top: 1px solid #ddd;">
1100 <td colspan="4" style="text-align: center;"><em>Content Consistency</em></td>
1101 </tr>
1102 </thead>
1103 <tbody>
1104 <tr>
1105 <td rowspan="5" style="text-align: center; vertical-align: middle;"><em>long-zh</em> | <em>long-en</em></td>
1106 <td style="text-align: left;">Higgs-Audio-v2 (chunk) (Boson AI, 2025)</td>
1107 <td style="text-align: center;">5.505</td>
1108 <td style="text-align: center;">6.917</td>
1109 </tr>
1110 <tr>
1111 <td style="text-align: left;">VibeVoice (Peng et al., 2025)</td>
1112 <td style="text-align: center;">22.619</td>
1113 <td style="text-align: center;">1.780</td>
1114 </tr>
1115 <tr>
1116 <td style="text-align: left;">VoxCPM (Zhou et al., 2025)</td>
1117 <td style="text-align: center;">4.835</td>
1118 <td style="text-align: center;">7.474</td>
1119 </tr>
1120 <tr>
1121 <td style="text-align: left;">Qwen3-TTS-25Hz-1.7B-CustomVoice</td>
1122 <td style="text-align: center;"><strong>1.517</strong></td>
1123 <td style="text-align: center;"><strong>1.225</strong></td>
1124 </tr>
1125 <tr>
1126 <td style="text-align: left;">Qwen3-TTS-12Hz-1.7B-CustomVoice</td>
1127 <td style="text-align: center;">2.356</td>
1128 <td style="text-align: center;">2.812</td>
1129 </tr>
1130 </tbody>
1131 </table>
1132 </details>
1133
1134
1135 <details>
1136 <summary>Speech Tokenizer Benchmarks</summary>
1137
1138 *Comparison between different supervised semantic speech tokenizers on ASR Task.*
1139
1140 <table>
1141 <thead>
1142 <tr>
1143 <th style="text-align: left;">Model</th>
1144 <th style="text-align: center;">Codebook Size</th>
1145 <th style="text-align: center;">FPS</th>
1146 <th style="text-align: center;">C.V. EN</th>
1147 <th style="text-align: center;">C.V. CN</th>
1148 <th style="text-align: center;">Fluers EN</th>
1149 <th style="text-align: center;">Fluers CN</th>
1150 </tr>
1151 </thead>
1152 <tbody>
1153 <tr>
1154 <td style="text-align: left;">S3 Tokenizer(VQ) (Du et al., 2024a)</td>
1155 <td style="text-align: center;">4096</td>
1156 <td style="text-align: center;">50</td>
1157 <td style="text-align: center;">12.06</td>
1158 <td style="text-align: center;">15.38</td>
1159 <td style="text-align: center;">-</td>
1160 <td style="text-align: center;">-</td>
1161 </tr>
1162 <tr>
1163 <td style="text-align: left;">S3 Tokenizer(VQ) (Du et al., 2024a)</td>
1164 <td style="text-align: center;">4096</td>
1165 <td style="text-align: center;">25</td>
1166 <td style="text-align: center;">11.56</td>
1167 <td style="text-align: center;">18.26</td>
1168 <td style="text-align: center;">7.65</td>
1169 <td style="text-align: center;">5.03</td>
1170 </tr>
1171 <tr>
1172 <td style="text-align: left;">S3 Tokenizer(FSQ) (Du et al., 2024a)</td>
1173 <td style="text-align: center;">6561</td>
1174 <td style="text-align: center;">25</td>
1175 <td style="text-align: center;">10.67</td>
1176 <td style="text-align: center;"><strong>7.29</strong></td>
1177 <td style="text-align: center;">6.58</td>
1178 <td style="text-align: center;">4.43</td>
1179 </tr>
1180 <tr>
1181 <td style="text-align: left;">Qwen-TTS-Tokenizer-25Hz (Stage 1)</td>
1182 <td style="text-align: center;">32768</td>
1183 <td style="text-align: center;">25</td>
1184 <td style="text-align: center;"><strong>7.51</strong></td>
1185 <td style="text-align: center;">10.73</td>
1186 <td style="text-align: center;"><strong>3.07</strong></td>
1187 <td style="text-align: center;"><strong>4.23</strong></td>
1188 </tr>
1189 <tr>
1190 <td style="text-align: left;">Qwen-TTS-Tokenizer-25Hz (Stage 2)</td>
1191 <td style="text-align: center;">32768</td>
1192 <td style="text-align: center;">25</td>
1193 <td style="text-align: center;">10.40</td>
1194 <td style="text-align: center;">14.99</td>
1195 <td style="text-align: center;">4.14</td>
1196 <td style="text-align: center;">4.67</td>
1197 </tr>
1198 </tbody>
1199 </table>
1200
1201 <br>
1202
1203 *Comparison between different semantic-related speech tokenizers.*
1204
1205 <table>
1206 <thead>
1207 <tr>
1208 <th style="text-align: left;">Model</th>
1209 <th style="text-align: center;">NQ</th>
1210 <th style="text-align: center;">Codebook Size</th>
1211 <th style="text-align: center;">FPS</th>
1212 <th style="text-align: center;">PESQ_WB</th>
1213 <th style="text-align: center;">PESQ_NB</th>
1214 <th style="text-align: center;">STOI</th>
1215 <th style="text-align: center;">UTMOS</th>
1216 <th style="text-align: center;">SIM</th>
1217 </tr>
1218 </thead>
1219 <tbody>
1220 <tr>
1221 <td style="text-align: left;">SpeechTokenizer (Zhang et al., 2023a)</td>
1222 <td style="text-align: center;">8</td>
1223 <td style="text-align: center;">1024</td>
1224 <td style="text-align: center;">50</td>
1225 <td style="text-align: center;">2.60</td>
1226 <td style="text-align: center;">3.05</td>
1227 <td style="text-align: center;">0.92</td>
1228 <td style="text-align: center;">3.90</td>
1229 <td style="text-align: center;">0.85</td>
1230 </tr>
1231 <tr>
1232 <td style="text-align: left;">X-codec (Ye et al., 2025a)</td>
1233 <td style="text-align: center;">2</td>
1234 <td style="text-align: center;">1024</td>
1235 <td style="text-align: center;">50</td>
1236 <td style="text-align: center;">2.68</td>
1237 <td style="text-align: center;">3.27</td>
1238 <td style="text-align: center;">0.86</td>
1239 <td style="text-align: center;">4.11</td>
1240 <td style="text-align: center;">0.84</td>
1241 </tr>
1242 <tr>
1243 <td style="text-align: left;">X-codec 2 (Ye et al., 2025b)</td>
1244 <td style="text-align: center;">1</td>
1245 <td style="text-align: center;">65536</td>
1246 <td style="text-align: center;">50</td>
1247 <td style="text-align: center;">2.43</td>
1248 <td style="text-align: center;">3.04</td>
1249 <td style="text-align: center;">0.92</td>
1250 <td style="text-align: center;">4.13</td>
1251 <td style="text-align: center;">0.82</td>
1252 </tr>
1253 <tr>
1254 <td style="text-align: left;">XY-Tokenizer (Gong et al., 2025)</td>
1255 <td style="text-align: center;">8</td>
1256 <td style="text-align: center;">1024</td>
1257 <td style="text-align: center;">12.5</td>
1258 <td style="text-align: center;">2.41</td>
1259 <td style="text-align: center;">3.00</td>
1260 <td style="text-align: center;">0.91</td>
1261 <td style="text-align: center;">3.98</td>
1262 <td style="text-align: center;">0.83</td>
1263 </tr>
1264 <tr>
1265 <td style="text-align: left;">Mimi (Défossez et al., 2024)</td>
1266 <td style="text-align: center;">16</td>
1267 <td style="text-align: center;">2048</td>
1268 <td style="text-align: center;">12.5</td>
1269 <td style="text-align: center;">2.88</td>
1270 <td style="text-align: center;">3.42</td>
1271 <td style="text-align: center;">0.94</td>
1272 <td style="text-align: center;">3.87</td>
1273 <td style="text-align: center;">0.87</td>
1274 </tr>
1275 <tr>
1276 <td style="text-align: left;">FireredTTS 2 Tokenizer (Xie et al., 2025)</td>
1277 <td style="text-align: center;">16</td>
1278 <td style="text-align: center;">2048</td>
1279 <td style="text-align: center;">12.5</td>
1280 <td style="text-align: center;">2.73</td>
1281 <td style="text-align: center;">3.28</td>
1282 <td style="text-align: center;">0.94</td>
1283 <td style="text-align: center;">3.88</td>
1284 <td style="text-align: center;">0.87</td>
1285 </tr>
1286 <tr>
1287 <td style="text-align: left;">Qwen-TTS-Tokenizer-12Hz</td>
1288 <td style="text-align: center;">16</td>
1289 <td style="text-align: center;">2048</td>
1290 <td style="text-align: center;">12.5</td>
1291 <td style="text-align: center;"><strong>3.21</strong></td>
1292 <td style="text-align: center;"><strong>3.68</strong></td>
1293 <td style="text-align: center;"><strong>0.96</strong></td>
1294 <td style="text-align: center;"><strong>4.16</strong></td>
1295 <td style="text-align: center;"><strong>0.95</strong></td>
1296 </tr>
1297 </tbody>
1298 </table>
1299
1300 </details>
1301
1302
1303 ## Citation
1304
1305 If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
1306
1307 ```BibTeX
1308 @article{Qwen3-TTS,
1309 title={Qwen3-TTS Technical Report},
1310 author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
1311 journal={arXiv preprint arXiv:2601.15621},
1312 year={2026}
1313 }
1314 ```
1315
1316 <br>