README.md · Fun-CosyVoice3-0.5B-2512

1

---

2

license: apache-2.0

3

language:

4

- zh

5

- en

6

- fr

7

- es

8

- ja

9

- ko

10

- it

11

- ru

12

- de

13

pipeline_tag: text-to-speech

14

---

15

16

![SVG Banners](https://svg-banners.vercel.app/api?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)

17

18

## 👉🏻 CosyVoice 👈🏻

19

20

**Fun-CosyVoice 3.0**: [Demos](https://funaudiollm.github.io/cosyvoice3/); [Paper](https://arxiv.org/abs/2505.17589); [Modelscope](https://www.modelscope.cn/models/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [Huggingface](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512); [CV3-Eval](https://github.com/FunAudioLLM/CV3-Eval)

21

22

**CosyVoice 2.0**: [Demos](https://funaudiollm.github.io/cosyvoice2/); [Paper](https://arxiv.org/abs/2412.10117); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice2-0.5B); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice2-0.5B)

23

24

**CosyVoice 1.0**: [Demos](https://fun-audio-llm.github.io); [Paper](https://funaudiollm.github.io/pdf/CosyVoice_v1.pdf); [Modelscope](https://www.modelscope.cn/models/iic/CosyVoice-300M); [HuggingFace](https://huggingface.co/FunAudioLLM/CosyVoice-300M)

25

26

## Highlight🔥

27

28

**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.

29

### Key Features

30

- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects/accents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual/cross-lingual zero-shot voice cloning.

31

- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.

32

- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.

33

- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.

34

- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.

35

- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.

36

37

38

## Roadmap

39

40

- [x] 2025/12

41

42

- [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training/inference script

43

- [x] release Fun-CosyVoice3-0.5B modelscope gradio space

44

45

- [x] 2025/08

46

47

    - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support

48

49

- [x] 2025/07

50

51

- [x] release Fun-CosyVoice 3.0 eval set

52

53

- [x] 2025/05

54

55

- [x] add CosyVoice2-0.5B vllm support

56

57

- [x] 2024/12

58

59

- [x] 25hz CosyVoice2-0.5B released

60

61

- [x] 2024/09

62

63

- [x] 25hz CosyVoice-300M base model

64

- [x] 25hz CosyVoice-300M voice conversion function

65

66

- [x] 2024/08

67

68

- [x] Repetition Aware Sampling(RAS) inference for llm stability

69

- [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization

70

71

- [x] 2024/07

72

73

- [x] Flow matching training support

74

- [x] WeTextProcessing support when ttsfrd is not available

75

- [x] Fastapi server and client

76

77

## Evaluation

78

79

| Model | Open-Source | Model Size | test-zh<br>CER (%) ↓ | test-zh<br>Speaker Similarity (%) ↑ | test-en<br>WER (%) ↓ | test-en<br>Speaker Similarity (%) ↑ | test-hard<br>CER (%) ↓ | test-hard<br>Speaker Similarity (%) ↑ |

80

| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |

81

| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |

82

| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |

83

| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |

84

| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |

85

| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |

86

| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |

87

| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |

88

| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |

89

| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |

90

| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |

91

| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |

92

| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |

93

| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |

94

| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |

95

| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |

96

| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |

97

98

99

## Install

100

101

### Clone and install

102

103

- Clone the repo

104

``` sh

105

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git

106

# If you failed to clone the submodule due to network failures, please run the following command until success

107

cd CosyVoice

108

git submodule update --init --recursive

109

```

110

111

- Install Conda: please see https://docs.conda.io/en/latest/miniconda.html

112

- Create Conda env:

113

114

``` sh

115

conda create -n cosyvoice -y python=3.10

116

conda activate cosyvoice

117

pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

118

119

# If you encounter sox compatibility issues

120

# ubuntu

121

sudo apt-get install sox libsox-dev

122

# centos

123

sudo yum install sox sox-devel

124

```

125

126

### Model download

127

128

``` python

129

from huggingface_hub import snapshot_download

130

snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')

131

snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')

132

```

133

134

Optionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.

135

136

Notice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.

137

138

``` sh

139

cd pretrained_models/CosyVoice-ttsfrd/

140

unzip resource.zip -d .

141

pip install ttsfrd_dependency-0.1-py3-none-any.whl

142

pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

143

```

144

145

### Basic Usage

146

147

``` python

148

import sys

149

sys.path.append('third_party/Matcha-TTS')

150

from cosyvoice.cli.cosyvoice import AutoModel

151

import torchaudio

152

153

""" CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details

154

"""

155

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')

156

# en zero_shot usage

157

for i, j in enumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',

158

'./asset/zero_shot_prompt.wav', stream=False)):

159

torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

160

# zh zero_shot usage

161

for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',

162

'./asset/zero_shot_prompt.wav', stream=False)):

163

torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

164

165

# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280

166

for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点，[breath]邻居都很活络，[breath]嗯，都很熟悉。[breath]',

167

'./asset/zero_shot_prompt.wav', stream=False)):

168

torchaudio.save('fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

169

170

# instruct usage, for supported control, check cosyvoice/utils/common.py#L28

171

for i, j in enumerate(cosyvoice.inference_instruct2('好少咯，一般系放嗰啲国庆啊，中秋嗰啲可能会咯。', 'You are a helpful assistant. 请用广东话表达。<|endofprompt|>',

172

'./asset/zero_shot_prompt.wav', stream=False)):

173

torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

174

for i, j in enumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物，那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐，笑容如花儿般绽放。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>',

175

'./asset/zero_shot_prompt.wav', stream=False)):

176

torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

177

178

# hotfix usage

179

for i, j in enumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。', 'You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。',

180

'./asset/zero_shot_prompt.wav', stream=False)):

181

torchaudio.save('hotfix_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

182

```

183

184

## Discussion & Communication

185

186

You can directly discuss on [Github Issues](https://github.com/FunAudioLLM/CosyVoice/issues).

187

188

You can also scan the QR code to join our official Dingding chat group.

189

190

<img src="./asset/dingding.png" width="250px">

191

192

## Acknowledge

193

194

1. We borrowed a lot of code from [FunASR](https://github.com/modelscope/FunASR).

195

2. We borrowed a lot of code from [FunCodec](https://github.com/modelscope/FunCodec).

196

3. We borrowed a lot of code from [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS).

197

4. We borrowed a lot of code from [AcademiCodec](https://github.com/yangdongchao/AcademiCodec).

198

5. We borrowed a lot of code from [WeNet](https://github.com/wenet-e2e/wenet).

199

200

## Citations

201

202

``` bibtex

203

@article{du2024cosyvoice,

204

title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},

205

  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},

206

journal={arXiv preprint arXiv:2407.05407},

207

year={2024}

208

}

209

210

@article{du2024cosyvoice,

211

title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},

212

  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},

213

journal={arXiv preprint arXiv:2412.10117},

214

year={2024}

215

}

216

217

@article{du2025cosyvoice,

218

title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},

219

  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},

220

journal={arXiv preprint arXiv:2505.17589},

221

year={2025}

222

}

223

224

@inproceedings{lyu2025build,

225

title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},

226

author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},

227

booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},

228

pages={1--2},

229

year={2025},

230

organization={IEEE}

231

}

232

```

233

234

## Disclaimer

235

The content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.