README.md
3.6 KB · 103 lines · markdown Raw
1 ---
2 license: apache-2.0
3 pipeline_tag: text-to-speech
4 language:
5 - zh
6 - en
7 - ja
8 - ko
9 - de
10 - fr
11 - ru
12 - pt
13 - es
14 - it
15 tags:
16 - audio
17 - tts
18 - voice-clone
19 ---
20
21 # Qwen3-TTS-12Hz-0.6B-Base
22
23 [**Qwen3-TTS Technical Report**](https://huggingface.co/papers/2601.15621) | [**GitHub Repository**](https://github.com/QwenLM/Qwen3-TTS) | [**Hugging Face Demo**](https://huggingface.co/spaces/Qwen/Qwen3-TTS)
24
25 Qwen3-TTS is a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control.
26
27 This specific checkpoint is the **0.6B Base model**, which is capable of rapid voice cloning from a user-provided audio input.
28
29 ## Quickstart
30
31 ### Installation
32
33 ```bash
34 pip install -U qwen-tts
35 # Optional: for optimized performance
36 pip install -U flash-attn --no-build-isolation
37 ```
38
39 ### Sample Usage (Voice Clone)
40
41 To clone a voice and synthesize new content using the Base model, you can use the following code snippet:
42
43 ```python
44 import torch
45 import soundfile as sf
46 from qwen_tts import Qwen3TTSModel
47
48 # Load the model
49 model = Qwen3TTSModel.from_pretrained(
50 "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
51 device_map="cuda:0",
52 dtype=torch.bfloat16,
53 attn_implementation="flash_attention_2",
54 )
55
56 # Reference audio for cloning
57 ref_audio = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone.wav"
58 ref_text = "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
59
60 # Generate speech
61 wavs, sr = model.generate_voice_clone(
62 text="I am solving the equation: x = [-b ± √(b²-4ac)] / 2a? Nobody can — it's a disaster (◍•͈⌔•͈◍), very sad!",
63 language="English",
64 ref_audio=ref_audio,
65 ref_text=ref_text,
66 )
67
68 # Save the resulting audio
69 sf.write("output_voice_clone.wav", wavs[0], sr)
70 ```
71
72 ## Overview
73 ### Introduction
74
75 <p align="center">
76 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/qwen3_tts_introduction.png" width="90%"/>
77 <p>
78
79 Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles to meet global application needs. Key features:
80
81 * **Powerful Speech Representation**: Powered by the self-developed Qwen3-TTS-Tokenizer-12Hz, it achieves efficient acoustic compression and high-dimensional semantic modeling.
82 * **Universal End-to-End Architecture**: Utilizing a discrete multi-codebook LM architecture, it realizes full-information end-to-end speech modeling.
83 * **Extreme Low-Latency Streaming Generation**: End-to-end synthesis latency as low as 97ms, meeting the rigorous demands of real-time interactive scenarios.
84 * **Intelligent Text Understanding and Voice Control**: Supports speech generation driven by natural language instructions, allowing for flexible control over multi-dimensional acoustic attributes.
85
86 ### Model Architecture
87
88 <p align="center">
89 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/overview.png" width="80%"/>
90 <p>
91
92 ## Citation
93
94 If you find this work useful, please consider citing the technical report:
95
96 ```BibTeX
97 @article{Qwen3-TTS,
98 title={Qwen3-TTS Technical Report},
99 author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
100 journal={arXiv preprint arXiv:2601.15621},
101 year={2026}
102 }
103 ```