README.md

7.6 KB · 227 lines · markdown Raw

1	`---`
2	`language:`
3	`- zh`
4	`- en`
5	`- ar`
6	`- my`
7	`- da`
8	`- nl`
9	`- fi`
10	`- fr`
11	`- de`
12	`- el`
13	`- he`
14	`- hi`
15	`- id`
16	`- it`
17	`- ja`
18	`- km`
19	`- ko`
20	`- lo`
21	`- ms`
22	`- no`
23	`- pl`
24	`- pt`
25	`- ru`
26	`- es`
27	`- sw`
28	`- sv`
29	`- tl`
30	`- th`
31	`- tr`
32	`- vi`
33	`license: apache-2.0`
34	`library_name: voxcpm`
35	`tags:`
36	`- text-to-speech`
37	`- tts`
38	`- multilingual`
39	`- voice-cloning`
40	`- voice-design`
41	`- diffusion`
42	`- audio`
43	`pipeline_tag: text-to-speech`
44	`---`
45
46	`# VoxCPM2`
47
48	`VoxCPM2 is a tokenizer-free, diffusion autoregressive Text-to-Speech model — 2B parameters, 30 languages, 48kHz audio output, trained on over 2 million hours of multilingual speech data.`
49
50	`[![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)`
51	`[![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)`
52	`[![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)`
53	`[![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)`
54	`[![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)`
55	`[![Lark](https://img.shields.io/badge/飞书群-VoxCPM-00D6B9?logo=lark&logoColor=white)](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f)`
56
57	`## Highlights`
58
59	`- 🌍 30-Language Multilingual — No language tag needed; input text in any supported language directly`
60	`- 🎨 Voice Design — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required`
61	`- 🎛️ Controllable Cloning — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre`
62	`- 🎙️ Ultimate Cloning — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced`
63	`- 🔊 48kHz Studio-Quality Output — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed`
64	`- 🧠 Context-Aware Synthesis — Automatically infers appropriate prosody and expressiveness from text content`
65	`- ⚡ Real-Time Streaming — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)`
66	`- 📜 Fully Open-Source & Commercial-Ready — Apache-2.0 license, free for commercial use`
67
68
69	`<summary><b>Supported Languages (30)</b></summary>`
70
71	`Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese`
72
73	`Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话`
74
75
76	`## Quick Start`
77
78	`### Installation`
79
80	```bash
81	`pip install voxcpm`
82	```
83
84	`Requirements: Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)`
85
86	`### Text-to-Speech`
87
88	```python
89	`from voxcpm import VoxCPM`
90	`import soundfile as sf`
91
92	`model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)`
93
94	`wav = model.generate(`
95	`text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",`
96	`cfg_value=2.0,`
97	`inference_timesteps=10,`
98	`)`
99	`sf.write("output.wav", wav, model.tts_model.sample_rate)`
100	```
101
102	`### Voice Design`
103
104	Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
105
106	```python
107	`wav = model.generate(`
108	`text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",`
109	`cfg_value=2.0,`
110	`inference_timesteps=10,`
111	`)`
112	`sf.write("voice_design.wav", wav, model.tts_model.sample_rate)`
113	```
114
115	`### Controllable Voice Cloning`
116
117	```python
118	`# Basic cloning`
119	`wav = model.generate(`
120	`text="This is a cloned voice generated by VoxCPM2.",`
121	`reference_wav_path="speaker.wav",`
122	`)`
123	`sf.write("clone.wav", wav, model.tts_model.sample_rate)`
124
125	`# Cloning with style control`
126	`wav = model.generate(`
127	`text="(slightly faster, cheerful tone)This is a cloned voice with style control.",`
128	`reference_wav_path="speaker.wav",`
129	`cfg_value=2.0,`
130	`inference_timesteps=10,`
131	`)`
132	`sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)`
133	```
134
135	`### Ultimate Cloning`
136
137	Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
138
139	```python
140	`wav = model.generate(`
141	`text="This is an ultimate cloning demonstration using VoxCPM2.",`
142	`prompt_wav_path="speaker_reference.wav",`
143	`prompt_text="The transcript of the reference audio.",`
144	`reference_wav_path="speaker_reference.wav",`
145	`)`
146	`sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)`
147	```
148
149	`### Streaming`
150
151	```python
152	`import numpy as np`
153
154	`chunks = []`
155	`for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):`
156	`chunks.append(chunk)`
157	`wav = np.concatenate(chunks)`
158	`sf.write("streaming.wav", wav, model.tts_model.sample_rate)`
159	```
160
161	`## Model Details`
162
163	`\| Property \| Value \|`
164	`\|---\|---\|`
165	`\| Architecture \| Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) \|`
166	`\| Backbone \| Based on MiniCPM-4, totally 2B parameters \|`
167	`\| Audio VAE \| AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) \|`
168	`\| Training Data \| 2M+ hours multilingual speech \|`
169	`\| LM Token Rate \| 6.25 Hz \|`
170	`\| Max Sequence Length \| 8192 tokens \|`
171	`\| dtype \| bfloat16 \|`
172	`\| VRAM \| ~8 GB \|`
173	`\| RTF (RTX 4090) \| ~0.30 (standard) / ~0.13 (Nano-vLLM) \|`
174
175	`## Performance`
176
177	`VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.`
178
179	`See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).`
180
181	`## Fine-tuning`
182
183	`VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:`
184
185	```bash
186	`# LoRA fine-tuning (recommended)`
187	`python scripts/train_voxcpm_finetune.py \`
188	`--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml`
189
190	`# Full fine-tuning`
191	`python scripts/train_voxcpm_finetune.py \`
192	`--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml`
193	```
194
195	`See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.`
196
197	`## Limitations`
198
199	`- Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.`
200	`- Performance varies across languages depending on training data availability.`
201	`- Occasional instability may occur with very long or highly expressive inputs.`
202	`- Strictly forbidden to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.`
203
204	`## Citation`
205
206	```bibtex
207	`@article{voxcpm2_2026,`
208	`title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},`
209	`author = {VoxCPM Team},`
210	`journal = {GitHub},`
211	`year = {2026},`
212	`}`
213
214	`@article{voxcpm2025,`
215	`title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},`
216	`author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and`
217	`Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and`
218	`Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},`
219	`journal = {arXiv preprint arXiv:2509.24650},`
220	`year = {2025},`
221	`}`
222	```
223
224	`## License`
225
226	`Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.`
227