README.md
11.7 KB · 235 lines · markdown Raw
1 ---
2 license: apache-2.0
3 language:
4 - zh
5 - en
6 - fr
7 - es
8 - ja
9 - ko
10 - it
11 - ru
12 - de
13 pipeline_tag: text-to-speech
14 ---
15
16 ![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)
17
18 ## 👉🏻 CosyVoice 👈🏻
19
20 **Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)
21
22 **CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)
23
24 **CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)
25
26 ## Highlight🔥
27
28 **Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.
29 ### Key Features
30 - **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.
31 - **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.
32 - **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.
33 - **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.
34 - **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.
35 - **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.
36
37
38 ## Roadmap
39
40 - [x] 2025/12
41
42 - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script
43 - [x] release Fun-CosyVoice3-0.5B modelscope gradio space
44
45 - [x] 2025/08
46
47 - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support
48
49 - [x] 2025/07
50
51 - [x] release Fun-CosyVoice 3.0 eval set
52
53 - [x] 2025/05
54
55 - [x] add CosyVoice2-0.5B vllm support
56
57 - [x] 2024/12
58
59 - [x] 25hz CosyVoice2-0.5B released
60
61 - [x] 2024/09
62
63 - [x] 25hz CosyVoice-300M base model
64 - [x] 25hz CosyVoice-300M voice conversion function
65
66 - [x] 2024/08
67
68 - [x] Repetition Aware Sampling(RAS) inference for llm stability
69 - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization
70
71 - [x] 2024/07
72
73 - [x] Flow matching training support
74 - [x] WeTextProcessing support when ttsfrd is not available
75 - [x] Fastapi server and client
76
77 ## Evaluation
78
79 | Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑ |
80 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
81 | Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
82 | Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
83 | MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
84 | F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
85 | Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
86 | CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
87 | FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
88 | Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
89 | VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
90 | VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
91 | HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
92 | VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
93 | GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
94 | GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
95 | Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
96 | Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
97
98
99 ## Install
100
101 ### Clone and install
102
103 - Clone the repo
104 ``` sh
105 git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
106 # If you failed to clone the submodule due to network failures, please run the following command until success
107 cd CosyVoice
108 git submodule update --init --recursive
109 ```
110
111 - Install Conda: please see https://docs.conda.io/en/latest/miniconda.html
112 - Create Conda env:
113
114 ``` sh
115 conda create -n cosyvoice -y python=3.10
116 conda activate cosyvoice
117 pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
118
119 # If you encounter sox compatibility issues
120 # ubuntu
121 sudo apt-get install sox libsox-dev
122 # centos
123 sudo yum install sox sox-devel
124 ```
125
126 ### Model download
127
128 ``` python
129 from huggingface_hub import snapshot_download
130 snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
131 snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
132 ```
133
134 Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.
135
136 Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.
137
138 ``` sh
139 cd pretrained_models/CosyVoice-ttsfrd/
140 unzip resource.zip -d .
141 pip install ttsfrd_dependency-0.1-py3-none-any.whl
142 pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
143 ```
144
145 ### Basic Usage
146
147 ``` python
148 import sys
149 sys.path.append('third_party/Matcha-TTS')
150 from cosyvoice.cli.cosyvoice import AutoModel
151 import torchaudio
152
153 """ CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
154 """
155 cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
156 # en zero_shot usage
157 for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
158 './asset/zero_shot_prompt.wav', stream=False)):
159 torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
160 # zh zero_shot usage
161 for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
162 './asset/zero_shot_prompt.wav', stream=False)):
163 torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
164
165 # fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280
166 for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]',
167 './asset/zero_shot_prompt.wav', stream=False)):
168 torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
169
170 # instruct usage, for supported control, check cosyvoice/utils/common.py#L28
171 for i, j in enumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',
172 './asset/zero_shot_prompt.wav', stream=False)):
173 torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
174 for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',
175 './asset/zero_shot_prompt.wav', stream=False)):
176 torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
177
178 # hotfix usage
179 for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',
180 './asset/zero_shot_prompt.wav', stream=False)):
181 torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
182 ```
183
184 ## Discussion & Communication
185
186 You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).
187
188 You can also scan the QR code to join our official Dingding chat group.
189
190 <img src="./asset/dingding.png" width="250px">
191
192 ## Acknowledge
193
194 1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).
195 2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).
196 3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).
197 4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).
198 5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).
199
200 ## Citations
201
202 ``` bibtex
203 @article{du2024cosyvoice,
204 title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
205 author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
206 journal={arXiv preprint arXiv:2407.05407},
207 year={2024}
208 }
209
210 @article{du2024cosyvoice,
211 title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
212 author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
213 journal={arXiv preprint arXiv:2412.10117},
214 year={2024}
215 }
216
217 @article{du2025cosyvoice,
218 title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
219 author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},
220 journal={arXiv preprint arXiv:2505.17589},
221 year={2025}
222 }
223
224 @inproceedings{lyu2025build,
225 title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},
226 author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},
227 booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
228 pages={1--2},
229 year={2025},
230 organization={IEEE}
231 }
232 ```
233
234 ## Disclaimer
235 The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.