README.md
3.2 KB · 98 lines · markdown Raw
1 ---
2 license: apache-2.0
3 pipeline_tag: text-to-speech
4 language:
5 - zh
6 - en
7 - ja
8 - ko
9 - de
10 - fr
11 - ru
12 - pt
13 - es
14 - it
15 tags:
16 - tts
17 - qwen
18 - audio
19 arxiv: 2601.15621
20 ---
21
22 # Qwen3-TTS-12Hz-0.6B-CustomVoice
23
24 [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) is a series of advanced multilingual, controllable, robust, and streaming text-to-speech models developed by the Qwen team.
25
26 This specific checkpoint is the **0.6B CustomVoice** variant, based on the **12Hz** tokenizer. It supports 9 premium timbres and allows for fine-grained style control over target voices via natural language instructions across 10 major languages.
27
28 - **Paper:** [Qwen3-TTS Technical Report](https://huggingface.co/papers/2601.15621)
29 - **GitHub:** [QwenLM/Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS)
30 - **Demo:** [Hugging Face Spaces](https://huggingface.co/spaces/Qwen/Qwen3-TTS)
31
32 ## Key Features
33 * **Multilingual Synthesis**: Supports Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian.
34 * **Intelligent Control**: Adapts tone, rhythm, and emotional expression based on natural language instructions (e.g., "Speak in a very happy tone").
35 * **Low Latency**: Optimized for streaming generation with the Qwen3-TTS-Tokenizer-12Hz, achieving end-to-end synthesis latency as low as 97ms.
36
37 ## Quickstart
38
39 To use Qwen3-TTS, you can install the `qwen-tts` package:
40
41 ```bash
42 pip install -U qwen-tts
43 ```
44
45 ### Sample Usage
46
47 ```python
48 import torch
49 import soundfile as sf
50 from qwen_tts import Qwen3TTSModel
51
52 # Load the model
53 model = Qwen3TTSModel.from_pretrained(
54 "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice",
55 device_map="cuda:0",
56 dtype=torch.bfloat16,
57 attn_implementation="flash_attention_2",
58 )
59
60 # Generate speech with specific instructions
61 wavs, sr = model.generate_custom_voice(
62 text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
63 language="Chinese",
64 speaker="Vivian",
65 instruct="用特别愤怒的语气说",
66 )
67
68 # Save the generated audio
69 sf.write("output_custom_voice.wav", wavs[0], sr)
70 ```
71
72 ## Supported Speakers
73
74 For `Qwen3-TTS-12Hz-0.6B-CustomVoice`, the following speakers are supported. We recommend using each speaker’s native language for the best results:
75
76 | Speaker | Voice Description | Native Language |
77 | --- | --- | --- |
78 | Vivian | Bright young female voice. | Chinese |
79 | Serena | Warm, gentle young female voice. | Chinese |
80 | Uncle_Fu | Seasoned male voice, mellow timbre. | Chinese |
81 | Dylan | Youthful Beijing male voice. | Chinese (Beijing) |
82 | Eric | Lively Chengdu male voice. | Chinese (Sichuan) |
83 | Ryan | Dynamic male voice with rhythm. | English |
84 | Aiden | Sunny American male voice. | English |
85 | Ono_Anna | Playful Japanese female voice. | Japanese |
86 | Sohee | Warm Korean female voice. | Korean |
87
88 ## Citation
89 If you find Qwen3-TTS useful for your research, please consider citing:
90
91 ```bibtex
92 @article{Qwen3-TTS,
93 title={Qwen3-TTS Technical Report},
94 author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
95 journal={arXiv preprint arXiv:2601.15621},
96 year={2026}
97 }
98 ```