README.md
3.1 KB · 88 lines · markdown Raw
1 ---
2 license: apache-2.0
3 pipeline_tag: text-to-speech
4 library_name: qwen-tts
5 tags:
6 - audio
7 - tts
8 - qwen
9 - multilingual
10 ---
11
12 # Qwen3-TTS
13
14 <br>
15
16 <p align="center">
17 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/qwen3_tts_logo.png" width="400"/>
18 <p>
19
20 <p align="center">
21 &nbsp&nbsp🤗 <a href="https://huggingface.co/collections/Qwen/qwen3-tts">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/collections/Qwen/Qwen3-TTS">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://qwen.ai/blog?id=qwen3tts-0115">Blog</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2601.15621">Paper</a>&nbsp&nbsp | &nbsp&nbsp💻 <a href="https://github.com/QwenLM/Qwen3-TTS">GitHub</a>
22 </p>
23
24 We release **Qwen3-TTS**, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
25
26 ## Overview
27 Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
28
29 * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
30 * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture to bypass traditional information bottlenecks.
31 * **Extreme Low-Latency Streaming Generation**: Supports streaming generation with end-to-end synthesis latency as low as 97ms.
32 * **Intelligent Voice Control**: Supports speech generation driven by natural language instructions for flexible control over timbre, emotion, and prosody.
33
34 ## Quickstart
35
36 ### Environment Setup
37
38 Install the `qwen-tts` Python package from PyPI:
39
40 ```bash
41 pip install -U qwen-tts
42 ```
43
44 ### Python Package Usage
45
46 ```python
47 import torch
48 import soundfile as sf
49 from qwen_tts import Qwen3TTSModel
50
51 # Load the model
52 model = Qwen3TTSModel.from_pretrained(
53 "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
54 device_map="cuda:0",
55 dtype=torch.bfloat16,
56 attn_implementation="flash_attention_2",
57 )
58
59 # Custom Voice Generation
60 wavs, sr = model.generate_custom_voice(
61 text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
62 language="Chinese",
63 speaker="Vivian",
64 instruct="用特别愤怒的语气说",
65 )
66 sf.write("output.wav", wavs[0], sr)
67 ```
68
69 ## Evaluation
70
71 Zero-shot speech generation on the Seed-TTS test set (Word Error Rate (WER, ↓)):
72
73 | Model | test-zh | test-en |
74 |---|---|---|
75 | Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
76
77 ## Citation
78
79 If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:
80
81 ```BibTeX
82 @article{Qwen3-TTS,
83 title={Qwen3-TTS Technical Report},
84 author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
85 journal={arXiv preprint arXiv:2601.15621},
86 year={2026}
87 }
88 ```