README.md
| 1 | --- |
| 2 | language: |
| 3 | - zh |
| 4 | - en |
| 5 | - ar |
| 6 | - my |
| 7 | - da |
| 8 | - nl |
| 9 | - fi |
| 10 | - fr |
| 11 | - de |
| 12 | - el |
| 13 | - he |
| 14 | - hi |
| 15 | - id |
| 16 | - it |
| 17 | - ja |
| 18 | - km |
| 19 | - ko |
| 20 | - lo |
| 21 | - ms |
| 22 | - no |
| 23 | - pl |
| 24 | - pt |
| 25 | - ru |
| 26 | - es |
| 27 | - sw |
| 28 | - sv |
| 29 | - tl |
| 30 | - th |
| 31 | - tr |
| 32 | - vi |
| 33 | license: apache-2.0 |
| 34 | library_name: voxcpm |
| 35 | tags: |
| 36 | - text-to-speech |
| 37 | - tts |
| 38 | - multilingual |
| 39 | - voice-cloning |
| 40 | - voice-design |
| 41 | - diffusion |
| 42 | - audio |
| 43 | pipeline_tag: text-to-speech |
| 44 | --- |
| 45 | |
| 46 | # VoxCPM2 |
| 47 | |
| 48 | **VoxCPM2** is a tokenizer-free, diffusion autoregressive Text-to-Speech model — **2B parameters**, **30 languages**, **48kHz** audio output, trained on over **2 million hours** of multilingual speech data. |
| 49 | |
| 50 | [](https://github.com/OpenBMB/VoxCPM) |
| 51 | [](https://voxcpm.readthedocs.io/en/latest/) |
| 52 | [](https://huggingface.co/spaces/OpenBMB/VoxCPM-Demo) |
| 53 | [](https://openbmb.github.io/voxcpm2-demopage) |
| 54 | [](https://discord.gg/KZUx7tVNwz) |
| 55 | [](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=acds0b9d-23d8-4d7e-b696-d200f3e22a7f) |
| 56 | |
| 57 | ## Highlights |
| 58 | |
| 59 | - 🌍 **30-Language Multilingual** — No language tag needed; input text in any supported language directly |
| 60 | - 🎨 **Voice Design** — Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required |
| 61 | - 🎛️ **Controllable Cloning** — Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre |
| 62 | - 🎙️ **Ultimate Cloning** — Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced |
| 63 | - 🔊 **48kHz Studio-Quality Output** — Accepts 16kHz reference; outputs 48kHz via AudioVAE V2's built-in super-resolution, no external upsampler needed |
| 64 | - 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content |
| 65 | - ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-VLLM](https://github.com/a710128/nanovllm-voxcpm) |
| 66 | - 📜 **Fully Open-Source & Commercial-Ready** — Apache-2.0 license, free for commercial use |
| 67 | |
| 68 | |
| 69 | <summary><b>Supported Languages (30)</b></summary> |
| 70 | |
| 71 | Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese |
| 72 | |
| 73 | Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话 |
| 74 | |
| 75 | |
| 76 | ## Quick Start |
| 77 | |
| 78 | ### Installation |
| 79 | |
| 80 | ```bash |
| 81 | pip install voxcpm |
| 82 | ``` |
| 83 | |
| 84 | **Requirements:** Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · [Full Quick Start →](https://voxcpm.readthedocs.io/en/latest/quickstart.html) |
| 85 | |
| 86 | ### Text-to-Speech |
| 87 | |
| 88 | ```python |
| 89 | from voxcpm import VoxCPM |
| 90 | import soundfile as sf |
| 91 | |
| 92 | model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False) |
| 93 | |
| 94 | wav = model.generate( |
| 95 | text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.", |
| 96 | cfg_value=2.0, |
| 97 | inference_timesteps=10, |
| 98 | ) |
| 99 | sf.write("output.wav", wav, model.tts_model.sample_rate) |
| 100 | ``` |
| 101 | |
| 102 | ### Voice Design |
| 103 | |
| 104 | Put the voice description in parentheses at the start of `text`, followed by the content to synthesize: |
| 105 | |
| 106 | ```python |
| 107 | wav = model.generate( |
| 108 | text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!", |
| 109 | cfg_value=2.0, |
| 110 | inference_timesteps=10, |
| 111 | ) |
| 112 | sf.write("voice_design.wav", wav, model.tts_model.sample_rate) |
| 113 | ``` |
| 114 | |
| 115 | ### Controllable Voice Cloning |
| 116 | |
| 117 | ```python |
| 118 | # Basic cloning |
| 119 | wav = model.generate( |
| 120 | text="This is a cloned voice generated by VoxCPM2.", |
| 121 | reference_wav_path="speaker.wav", |
| 122 | ) |
| 123 | sf.write("clone.wav", wav, model.tts_model.sample_rate) |
| 124 | |
| 125 | # Cloning with style control |
| 126 | wav = model.generate( |
| 127 | text="(slightly faster, cheerful tone)This is a cloned voice with style control.", |
| 128 | reference_wav_path="speaker.wav", |
| 129 | cfg_value=2.0, |
| 130 | inference_timesteps=10, |
| 131 | ) |
| 132 | sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate) |
| 133 | ``` |
| 134 | |
| 135 | ### Ultimate Cloning |
| 136 | |
| 137 | Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both `reference_wav_path` and `prompt_wav_path` for highest similarity: |
| 138 | |
| 139 | ```python |
| 140 | wav = model.generate( |
| 141 | text="This is an ultimate cloning demonstration using VoxCPM2.", |
| 142 | prompt_wav_path="speaker_reference.wav", |
| 143 | prompt_text="The transcript of the reference audio.", |
| 144 | reference_wav_path="speaker_reference.wav", |
| 145 | ) |
| 146 | sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate) |
| 147 | ``` |
| 148 | |
| 149 | ### Streaming |
| 150 | |
| 151 | ```python |
| 152 | import numpy as np |
| 153 | |
| 154 | chunks = [] |
| 155 | for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"): |
| 156 | chunks.append(chunk) |
| 157 | wav = np.concatenate(chunks) |
| 158 | sf.write("streaming.wav", wav, model.tts_model.sample_rate) |
| 159 | ``` |
| 160 | |
| 161 | ## Model Details |
| 162 | |
| 163 | | Property | Value | |
| 164 | |---|---| |
| 165 | | Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) | |
| 166 | | Backbone | Based on MiniCPM-4, totally 2B parameters | |
| 167 | | Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) | |
| 168 | | Training Data | 2M+ hours multilingual speech | |
| 169 | | LM Token Rate | 6.25 Hz | |
| 170 | | Max Sequence Length | 8192 tokens | |
| 171 | | dtype | bfloat16 | |
| 172 | | VRAM | ~8 GB | |
| 173 | | RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) | |
| 174 | |
| 175 | ## Performance |
| 176 | |
| 177 | VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks. |
| 178 | |
| 179 | See the [GitHub repo](https://github.com/OpenBMB/VoxCPM#-performance) for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test). |
| 180 | |
| 181 | ## Fine-tuning |
| 182 | |
| 183 | VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio: |
| 184 | |
| 185 | ```bash |
| 186 | # LoRA fine-tuning (recommended) |
| 187 | python scripts/train_voxcpm_finetune.py \ |
| 188 | --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml |
| 189 | |
| 190 | # Full fine-tuning |
| 191 | python scripts/train_voxcpm_finetune.py \ |
| 192 | --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml |
| 193 | ``` |
| 194 | |
| 195 | See the [Fine-tuning Guide](https://voxcpm.readthedocs.io/en/latest/finetuning/finetune.html) for full instructions. |
| 196 | |
| 197 | ## Limitations |
| 198 | |
| 199 | - Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output. |
| 200 | - Performance varies across languages depending on training data availability. |
| 201 | - Occasional instability may occur with very long or highly expressive inputs. |
| 202 | - **Strictly forbidden** to use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled. |
| 203 | |
| 204 | ## Citation |
| 205 | |
| 206 | ```bibtex |
| 207 | @article{voxcpm2_2026, |
| 208 | title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning}, |
| 209 | author = {VoxCPM Team}, |
| 210 | journal = {GitHub}, |
| 211 | year = {2026}, |
| 212 | } |
| 213 | |
| 214 | @article{voxcpm2025, |
| 215 | title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning}, |
| 216 | author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and |
| 217 | Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and |
| 218 | Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan}, |
| 219 | journal = {arXiv preprint arXiv:2509.24650}, |
| 220 | year = {2025}, |
| 221 | } |
| 222 | ``` |
| 223 | |
| 224 | ## License |
| 225 | |
| 226 | Released under the [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case. |
| 227 | |