README.md
7.6 KB · 227 lines · markdown Raw
1 ---
2 language:
3 - zh
4 - en
5 - ar
6 - my
7 - da
8 - nl
9 - fi
10 - fr
11 - de
12 - el
13 - he
14 - hi
15 - id
16 - it
17 - ja
18 - km
19 - ko
20 - lo
21 - ms
22 - no
23 - pl
24 - pt
25 - ru
26 - es
27 - sw
28 - sv
29 - tl
30 - th
31 - tr
32 - vi
33 license: apache-2.0
34 library_name: voxcpm
35 tags:
36 - text-to-speech
37 - tts
38 - multilingual
39 - voice-cloning
40 - voice-design
41 - diffusion
42 - audio
43 pipeline_tag: text-to-speech
44 ---
45
46 # VoxCPM2
47
48 **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data.
49
50 [![GitHub](https://img.shields.io/badge/GitHub-VoxCPM-blue?logo=github)](https://github.com/OpenBMB/VoxCPM)
51 [![Docs](https://img.shields.io/badge/Docs-ReadTheDocs-8CA1AF)](https://voxcpm.readthedocs.io/en/latest/)
52 [![Demo](https://img.shields.io/badge/Live%20Playground-Demo-orange)](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo)
53 [![Audio Samples](https://img.shields.io/badge/Audio%20Samples-Demo%20Page-green)](https://openbmb.github.io/voxcpm2-demopage)
54 [![Discord](https://img.shields.io/badge/Discord-VoxCPM-5865F2?logo=discord&logoColor=white)](https://discord.gg/KZUx7tVNwz)
55 [![Lark](https://img.shields.io/badge/飞书群-VoxCPM-00D6B9?logo=lark&logoColor=white)](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f)
56
57 ## Highlights
58
59 - 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly
60 - 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
61 - 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
62 - 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
63 - 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed
64 - 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content
65 - ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm)
66 - 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use
67
68
69 <summary><b>Supported Languages (30)</b></summary>
70
71 Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
72
73 Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
74
75
76 ## Quick Start
77
78 ### Installation
79
80 ```bash
81 pip install voxcpm
82 ```
83
84 **Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html)
85
86 ### Text-to-Speech
87
88 ```python
89 from voxcpm import VoxCPM
90 import soundfile as sf
91
92 model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
93
94 wav = model.generate(
95 text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
96 cfg_value=2.0,
97 inference_timesteps=10,
98 )
99 sf.write("output.wav", wav, model.tts_model.sample_rate)
100 ```
101
102 ### Voice Design
103
104 Put the voice description in parentheses at the start of `text`, followed by the content to synthesize:
105
106 ```python
107 wav = model.generate(
108 text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
109 cfg_value=2.0,
110 inference_timesteps=10,
111 )
112 sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
113 ```
114
115 ### Controllable Voice Cloning
116
117 ```python
118 # Basic cloning
119 wav = model.generate(
120 text="This is a cloned voice generated by VoxCPM2.",
121 reference_wav_path="speaker.wav",
122 )
123 sf.write("clone.wav", wav, model.tts_model.sample_rate)
124
125 # Cloning with style control
126 wav = model.generate(
127 text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
128 reference_wav_path="speaker.wav",
129 cfg_value=2.0,
130 inference_timesteps=10,
131 )
132 sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
133 ```
134
135 ### Ultimate Cloning
136
137 Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity:
138
139 ```python
140 wav = model.generate(
141 text="This is an ultimate cloning demonstration using VoxCPM2.",
142 prompt_wav_path="speaker_reference.wav",
143 prompt_text="The transcript of the reference audio.",
144 reference_wav_path="speaker_reference.wav",
145 )
146 sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
147 ```
148
149 ### Streaming
150
151 ```python
152 import numpy as np
153
154 chunks = []
155 for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
156 chunks.append(chunk)
157 wav = np.concatenate(chunks)
158 sf.write("streaming.wav", wav, model.tts_model.sample_rate)
159 ```
160
161 ## Model Details
162
163 | Property | Value |
164 |---|---|
165 | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
166 | Backbone | Based on MiniCPM-4, totally 2B parameters |
167 | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
168 | Training Data | 2M+ hours multilingual speech |
169 | LM Token Rate | 6.25 Hz |
170 | Max Sequence Length | 8192 tokens |
171 | dtype | bfloat16 |
172 | VRAM | ~8 GB |
173 | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
174
175 ## Performance
176
177 VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
178
179 See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
180
181 ## Fine-tuning
182
183 VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
184
185 ```bash
186 # LoRA fine-tuning (recommended)
187 python scripts/train_voxcpm_finetune.py \
188 --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
189
190 # Full fine-tuning
191 python scripts/train_voxcpm_finetune.py \
192 --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
193 ```
194
195 See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions.
196
197 ## Limitations
198
199 - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
200 - Performance varies across languages depending on training data availability.
201 - Occasional instability may occur with very long or highly expressive inputs.
202 - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
203
204 ## Citation
205
206 ```bibtex
207 @article{voxcpm2_2026,
208 title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
209 author = {VoxCPM Team},
210 journal = {GitHub},
211 year = {2026},
212 }
213
214 @article{voxcpm2025,
215 title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
216 author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
217 Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
218 Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
219 journal = {arXiv preprint arXiv:2509.24650},
220 year = {2025},
221 }
222 ```
223
224 ## License
225
226 Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.
227