README.md
25.2 KB · 603 lines · markdown Raw
1 ---
2 license: other
3 language:
4 - en
5 - zh
6 - de
7 - ko
8 pipeline_tag: text-to-speech
9 library_name: transformers
10 ---
11
12 # Higgs Audio V2: Redefining Expressiveness in Audio Generation
13
14 <div align="center" style="display: flex; justify-content: center; margin-top: 10px; flex-wrap: wrap; gap: 8px;">
15 <a href="https://boson.ai/blog/higgs-audio-v2"><img src='https://img.shields.io/badge/🚀-Launch Blogpost-228B22' style="margin-right: 5px;"></a>
16 <a href="https://github.com/boson-ai/higgs-audio"><img src="https://img.shields.io/badge/💻-Github%20Repo-9C276A" style="margin-right: 5px;"></a>
17 <a href="https://huggingface.co/spaces/smola/higgs_audio_v2"><img src="https://img.shields.io/badge/🎮-HF%20Space%20Playground-8A2BE2" style="margin-right: 5px;"></a>
18 <a href="https://huggingface.co/bosonai/higgs-audio-v2-tokenizer"><img src="https://img.shields.io/badge/🎧-Audio%20Tokenizer-6A5ACD.svg" style="margin-right: 5px;"></a>
19 </div>
20
21 Check our open-source repository https://github.com/boson-ai/higgs-audio for more details!
22
23 We are open-sourcing Higgs Audio v2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data.
24 Despite having no post-training or fine-tuning, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.
25
26 On [EmergentTTS-Eval](https://github.com/boson-ai/emergenttts-eval-public), the model achieves win rates of **75.7%** and **55.7%** over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.
27
28
29 <p>
30 <img src="./emergent-tts-emotions-win-rate.png" width=900>
31 </p>
32
33 Here's the demo video that shows some of its emergent capabilities (remember to unmute):
34
35 <div align="left">
36 <video width="95%" controls>
37 <source src="https://cdn-uploads.huggingface.co/production/uploads/64fa072a52e82dd432460767/bjbWGg1IKoMtWXnl0Od8G.mp4" type="video/mp4">
38 Your browser does not support the video tag.
39 </video>
40 </div>
41
42 Here's another demo video that show-cases the model's multilingual capability and how it enabled live translation (remember to unmute):
43
44 <div align="left">
45 <video width="95%" controls>
46 <source src="https://cdn-uploads.huggingface.co/production/uploads/64fa072a52e82dd432460767/9cN-ky02GzmUgogsIh1Wg.mp4" type="video/mp4">
47 Your browser does not support the video tag.
48 </video>
49 </div>
50
51 ## Technical Details
52
53 <p>
54 <img src="./higgs_audio_v2_architecture_combined.png" width=900>
55 </p>
56
57 Higgs Audio v2 adopts the "generation variant" depicted in the architecture figure above. Its strong performance is driven by three key technical innovations:
58
59 - We developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as AudioVerse. The in-house understanding model is finetuned on top of Higgs Audio v1 Understanding, which adopts the "understanding variant" shown in the architecture figure.
60 - We trained a unified audio tokenizer from scratch that captures both semantic and acoustic features.
61 - We proposed the DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead.
62
63
64 ### Audio Tokenizer
65
66 <p>
67 <img src="./higgs_audio_tokenizer_architecture.png" width=900>
68 </p>
69
70 We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate.
71 Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system.
72 It also uses a simple non-diffusion encoder/decoder for fast, batch inference. It achieves state-of-the-art performance in semantic and acoustic evaluations.
73 Check https://huggingface.co/bosonai/higgs-audio-v2-tokenizer for more information about the tokenizer.
74
75 ### Model Architecture -- Dual FFN
76
77 Higgs Audio v2 is built on top of [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B). To enhance the model’s ability to process audio tokens,
78 we incorporate the "DualFFN" architecture as an audio adapter.
79 DualFFN acts as an audio-specific expert, boosting the LLM's performance with minimal computational overhead.
80 Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN, which has 2.2B parameters.
81 Thus, the total number of parameter for Higgs Audio v2 is 3.6B (LLM) + 2.2B (Audio Dual FFN), and it has the same training / inference FLOPs as Llama-3.2-3B.
82 Ablation study shows that the model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.
83 See [our architecture blog](https://github.com/boson-ai/higgs-audio/blob/main/tech_blogs/ARCHITECTURE_BLOG.md) for more information.
84
85
86 ## Evaluation
87
88 Here's the performance of Higgs Audio v2 on four benchmarks, [Seed-TTS Eval](https://github.com/BytedanceSpeech/seed-tts-eval), [Emotional Speech Dataset (ESD)](https://paperswithcode.com/dataset/esd), [EmergentTTS-Eval](https://arxiv.org/abs/2505.23009), and Multi-speaker Eval:
89
90 #### Seed-TTS Eval & ESD
91
92 We prompt Higgs Audio v2 with the reference text, reference audio, and target text for zero-shot TTS. We use the standard evaluation metrics from Seed-TTS Eval and ESD.
93
94 | | SeedTTS-Eval| | ESD | |
95 |------------------------------|--------|--------|---------|-------------------|
96 | | WER ↓ | SIM ↑ | WER ↓ | SIM (emo2vec) ↑ |
97 | Cosyvoice2 | 2.28 | 65.49 | 2.71 | 80.48 |
98 | Qwen2.5-omni† | 2.33 | 64.10 | - | - |
99 | ElevenLabs Multilingual V2 | **1.43** | 50.00 | 1.66 | 65.87 |
100 | Higgs Audio v1 | 2.18 | 66.27 | **1.49** | 82.84 |
101 | Higgs Audio v2 (base) | 2.44 | **67.70** | 1.78 | **86.13** |
102
103
104 #### EmergentTTS-Eval ("Emotions" and "Questions")
105
106 Following the [EmergentTTS-Eval Paper](https://arxiv.org/abs/2505.23009), we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. Results of Higgs Audio v2 is obtained with the voice of "belinda". The judge model is Gemini 2.5 Pro.
107
108 | Model | Emotions (%) ↑ | Questions (%) ↑ |
109 |------------------------------------|--------------|----------------|
110 | Higgs Audio v2 (base) | **75.71%** | **55.71%** |
111 | [gpt-4o-audio-preview†](https://platform.openai.com/docs/models/gpt-4o-audio-preview) | 61.64% | 47.85% |
112 | [Hume.AI](https://www.hume.ai/research) | 61.60% | 43.21% |
113 | **BASELINE:** [gpt-4o-mini-tts](https://platform.openai.com/docs/models/gpt-4o-mini-tts) | 50.00% | 50.00% |
114 | [Qwen 2.5 Omni†](https://github.com/QwenLM/Qwen2.5-Omni) | 41.60% | 51.78% |
115 | [minimax/speech-02-hd](https://replicate.com/minimax/speech-02-hd) | 40.86% | 47.32% |
116 | [ElevenLabs Multilingual v2](https://elevenlabs.io/blog/eleven-multilingual-v2) | 30.35% | 39.46% |
117 | [DeepGram Aura-2](https://deepgram.com/learn/introducing-aura-2-enterprise-text-to-speech) | 29.28% | 48.21% |
118 | [Sesame csm-1B](https://github.com/SesameAILabs/csm) | 15.96% | 31.78% |
119
120 <sup><sub>'†' means using the strong-prompting method described in the paper.</sub></sup>
121
122
123 #### Multi-speaker Eval
124
125 We also designed a multi-speaker evaluation benchmark to evaluate the capability of Higgs Audio v2 for multi-speaker dialog generation. The benchmark contains three subsets
126
127 - `two-speaker-conversation`: 1000 synthetic dialogues involving two speakers. We fix two reference audio clips to evaluate the model's ability in double voice cloning for utterances ranging from 4 to 10 dialogues between two randomly chosen persona.
128 - `small talk (no ref)`: 250 synthetic dialogues curated in the same way as above, but are characterized by short utterances and a limited number of turns (4–6), we do not fix reference audios in this case and this set is designed to evaluate the model's ability to automatically assign appropriate voices to speakers.
129 - `small talk (ref)`: 250 synthetic dialogues similar to above, but contains even shorter utterances as this set is meant to include reference clips in it's context, similar to `two-speaker-conversation`.
130
131
132 We report the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than Higgs Audio v2, we also evaluated [MoonCast](https://github.com/jzq2000/MoonCast) and [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626), two of the most popular open-source models capable of multi-speaker dialog generation.
133 Results are summarized in the following table. We are not able to run [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626) on our "two-speaker-conversation" subset due to its strict limitation on the length of the utterances and output audio.
134
135 | | two-speaker-conversation | |small talk | | small talk (no ref) | |
136 | ---------------------------------------------- | -------------- | ------------------ | ---------- | -------------- | ------------------- | -------------- |
137 | | WER ↓ | Mean Sim & Dis-sim ↑ | WER ↓ | Mean Sim & Dis-sim ↑ | WER ↓ | Mean Sim & Dis-sim ↑ |
138 | [MoonCast](https://github.com/jzq2000/MoonCast) | 38.77 | 46.02 | **8.33** | 63.68 | 24.65 | 53.94 |
139 | [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626) | \- | \- | 17.62 | 63.15 | 19.46 | **61.14** |
140 | Higgs Audio v2 (base) | **18.88** | **51.95** | 11.89 | **67.92** | **14.65** | 55.28 |
141
142
143 ## Usage
144
145 ### Transformers 🤗
146
147 Higgs Audio V2 is supported natively in `transformers`: [see the doc](https://huggingface.co/docs/transformers/en/model_doc/higgs_audio_v2).
148
149 ```bash
150 uv pip install "transformers>=5.3.0"
151 ```
152
153 <details>
154 <summary>Single-speaker smart voice</summary>
155
156 ```python
157 from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
158
159 model_id = "bosonai/higgs-audio-v2-generation-3B-base"
160 processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
161 model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
162
163 conversation = [
164 {
165 "role": "system",
166 "content": [{"type": "text", "text": "Generate audio following instruction."}],
167 },
168 {
169 "role": "scene",
170 "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}],
171 },
172 {
173 "role": "user",
174 "content": [
175 {
176 "type": "text",
177 "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
178 }
179 ],
180 },
181 ]
182
183 inputs = processor.apply_chat_template(
184 conversation,
185 add_generation_prompt=True,
186 tokenize=True,
187 return_dict=True,
188 sampling_rate=24000,
189 return_tensors="pt",
190 ).to(model.device)
191
192 outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
193 decoded = processor.batch_decode(outputs)
194 processor.save_audio(decoded, "output_single_speaker.wav")
195 ```
196
197 </details>
198
199 <details>
200 <summary>Multi-speaker smart voice</summary>
201
202 Use `[SPEAKER*]` tags to generate a multi-speaker dialogue. Speaker characteristics are described in the `scene` role.
203
204 ```python
205 from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
206
207 model_id = "bosonai/higgs-audio-v2-generation-3B-base"
208 processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
209 model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
210
211 system_message = """You are an AI assistant designed to convert text into speech.
212 If the user's message includes a [SPEAKER*] tag, do not read out the tag and generate speech for the following text, using the specified voice.
213 If no speaker tag is present, select a suitable voice on your own."""
214
215 user_message = """[SPEAKER0] I can't believe you did that without even asking me first!
216 [SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
217 [SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!
218 [SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""
219
220 conversation = [
221 {
222 "role": "system",
223 "content": [{"type": "text", "text": system_message}],
224 },
225 {
226 "role": "scene",
227 "content": [
228 {"type": "text", "text": "Audio is recorded from a quiet room."},
229 {"type": "text", "text": "SPEAKER0: feminine"},
230 {"type": "text", "text": "SPEAKER1: masculine"},
231 ],
232 },
233 {
234 "role": "user",
235 "content": [{"type": "text", "text": user_message}],
236 },
237 ]
238
239 inputs = processor.apply_chat_template(
240 conversation,
241 add_generation_prompt=True,
242 tokenize=True,
243 return_dict=True,
244 sampling_rate=24000,
245 return_tensors="pt",
246 ).to(model.device)
247
248 outputs = model.generate(**inputs, max_new_tokens=2000, do_sample=False)
249 decoded = processor.batch_decode(outputs)
250 processor.save_audio(decoded, "output_multi_speaker.wav")
251 ```
252
253 </details>
254
255 <details>
256 <summary>Zero-shot voice cloning</summary>
257
258 Clone a voice by providing a reference audio in the conversation history.
259
260 ```python
261 from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
262
263 model_id = "bosonai/higgs-audio-v2-generation-3B-base"
264 processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
265 model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
266
267 conversation = [
268 {
269 "role": "system",
270 "content": [{"type": "text", "text": "Generate audio following instruction."}],
271 },
272 {
273 "role": "scene",
274 "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}],
275 },
276 {
277 "role": "user",
278 "content": [
279 {
280 "type": "text",
281 "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",
282 }
283 ],
284 },
285 {
286 "role": "assistant",
287 "content": [
288 {
289 "type": "audio",
290 "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",
291 }
292 ],
293 },
294 {
295 "role": "user",
296 "content": [
297 {
298 "type": "text",
299 "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
300 }
301 ],
302 },
303 ]
304
305 inputs = processor.apply_chat_template(
306 conversation,
307 add_generation_prompt=True,
308 tokenize=True,
309 return_dict=True,
310 sampling_rate=24000,
311 return_tensors="pt",
312 ).to(model.device)
313
314 outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
315 decoded = processor.batch_decode(outputs)
316 processor.save_audio(decoded, "output_voice_cloning.wav")
317 ```
318
319 </details>
320
321 <details>
322 <summary>Multi-speaker voice cloning</summary>
323
324 Clone multiple voices by providing reference audio clips in the `scene` role.
325
326 ```python
327 from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
328
329 model_id = "bosonai/higgs-audio-v2-generation-3B-base"
330 processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
331 model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
332
333 user_message = """[SPEAKER0] I can't believe you did that without even asking me first!
334 [SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
335 [SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!
336 [SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""
337
338 conversation = [
339 {
340 "role": "system",
341 "content": [{"type": "text", "text": "Generate audio following instruction."}],
342 },
343 {
344 "role": "scene",
345 "content": [
346 {"type": "text", "text": "Audio is recorded from a quiet room."},
347 {"type": "text", "text": "SPEAKER0:"},
348 {
349 "type": "audio",
350 "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav",
351 },
352 {"type": "text", "text": "SPEAKER1:"},
353 {
354 "type": "audio",
355 "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac",
356 },
357 ],
358 },
359 {
360 "role": "user",
361 "content": [{"type": "text", "text": user_message}],
362 },
363 ]
364
365 inputs = processor.apply_chat_template(
366 conversation,
367 add_generation_prompt=True,
368 tokenize=True,
369 return_dict=True,
370 sampling_rate=24000,
371 return_tensors="pt",
372 ).to(model.device)
373
374 outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
375 decoded = processor.batch_decode(outputs)
376 processor.save_audio(decoded, "output_multi_speaker_cloning.wav")
377 ```
378
379 </details>
380
381 <details>
382 <summary>Batched inference</summary>
383
384 Process multiple conversations in a single forward pass.
385
386 ```python
387 from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
388
389 model_id = "bosonai/higgs-audio-v2-generation-3B-base"
390 processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
391 model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
392
393 conversation1 = [
394 {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
395 {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
396 {
397 "role": "user",
398 "content": [
399 {
400 "type": "text",
401 "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",
402 }
403 ],
404 },
405 {
406 "role": "assistant",
407 "content": [
408 {
409 "type": "audio",
410 "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",
411 }
412 ],
413 },
414 {
415 "role": "user",
416 "content": [
417 {
418 "type": "text",
419 "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
420 }
421 ],
422 },
423 ]
424
425 conversation2 = [
426 {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
427 {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
428 {
429 "role": "user",
430 "content": [
431 {
432 "type": "text",
433 "text": " It's super important to assess fairly the fact that our former model is over. And this is not a question of adjustment. This is not the same world, 2024, 2025. And on top of that, we are making the same mistakes, on top of the key elements I mentioned. We are over-regulating and under-investing. So just if, in the two to three years to come, if we follow our classical agenda, we will be out of the market. I have no doubts.",
434 }
435 ],
436 },
437 {
438 "role": "assistant",
439 "content": [
440 {
441 "type": "audio",
442 "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/macron.wav",
443 }
444 ],
445 },
446 {
447 "role": "user",
448 "content": [{"type": "text", "text": "Hey, here is a clone from the given voice."}],
449 },
450 ]
451
452 inputs = processor.apply_chat_template(
453 [conversation1, conversation2],
454 add_generation_prompt=True,
455 tokenize=True,
456 return_dict=True,
457 sampling_rate=24000,
458 return_tensors="pt",
459 ).to(model.device)
460
461 outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
462 decoded = processor.batch_decode(outputs)
463 processor.save_audio(decoded, ["output_batched_1.wav", "output_batched_2.wav"])
464 ```
465
466 </details>
467
468 <details>
469 <summary>Training</summary>
470
471 By default, the model does not load the text language modeling head to save memory (~1.5GiB reduction), as it's not required for generation. When training, set `use_text_head=True` to compute loss on text tokens.
472
473 ```python
474 from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
475
476 model_id = "bosonai/higgs-audio-v2-generation-3B-base"
477 processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
478 model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto", use_text_head=True)
479
480 conversation1 = [
481 {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
482 {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
483 {
484 "role": "user",
485 "content": [
486 {
487 "type": "text",
488 "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",
489 }
490 ],
491 },
492 {
493 "role": "assistant",
494 "content": [
495 {
496 "type": "audio",
497 "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",
498 }
499 ],
500 },
501 ]
502
503 conversation2 = [
504 {"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},
505 {"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},
506 {
507 "role": "user",
508 "content": [
509 {
510 "type": "text",
511 "text": " I would imagine so. A wand with a dragon heartstring core is capable of dazzling magic, and the bond between you and your wand should only grow stronger. Do not be surprised at your new wand's ability to perceive your intentions, particularly in a moment of need",
512 }
513 ],
514 },
515 {
516 "role": "assistant",
517 "content": [
518 {
519 "type": "audio",
520 "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/broom_salesman.wav",
521 }
522 ],
523 },
524 ]
525
526 inputs = processor.apply_chat_template(
527 [conversation1, conversation2],
528 add_generation_prompt=True,
529 tokenize=True,
530 return_dict=True,
531 sampling_rate=24000,
532 return_tensors="pt",
533 output_labels=True,
534 ).to(model.device)
535
536 outputs = model(**inputs)
537 outputs.loss.backward()
538 ```
539
540 </details>
541
542 ### Original codebase
543
544 You need to first install the [higgs-audio](https://github.com/boson-ai/higgs-audio):
545
546 ```bash
547 git clone https://github.com/boson-ai/higgs-audio.git
548
549 cd higgs-audio
550 python3 -m venv higgs_audio_env
551 source higgs_audio_env/bin/activate
552 pip install -r requirements.txt
553 pip install -e .
554 ```
555
556 Afterwards, try to run the following python code snippet to convert text to speech.
557
558 ```python
559 from boson_multimodal.serve.serve_engine import HiggsAudioServeEngine, HiggsAudioResponse
560 from boson_multimodal.data_types import ChatMLSample, Message, AudioContent
561
562 import torch
563 import torchaudio
564 import time
565 import click
566
567 MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"
568 AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"
569
570 system_prompt = (
571 "Generate audio following instruction.\n\n<|scene_desc_start|>\nAudio is recorded from a quiet room.\n<|scene_desc_end|>"
572 )
573
574 messages = [
575 Message(
576 role="system",
577 content=system_prompt,
578 ),
579 Message(
580 role="user",
581 content="The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",
582 ),
583 ]
584 device = "cuda" if torch.cuda.is_available() else "cpu"
585
586 serve_engine = HiggsAudioServeEngine(MODEL_PATH, AUDIO_TOKENIZER_PATH, device=device)
587
588 output: HiggsAudioResponse = serve_engine.generate(
589 chat_ml_sample=ChatMLSample(messages=messages),
590 max_new_tokens=1024,
591 temperature=0.3,
592 top_p=0.95,
593 top_k=50,
594 stop_strings=["<|end_of_text|>", "<|eot_id|>"],
595 )
596 torchaudio.save(f"output.wav", torch.from_numpy(output.audio)[None, :], output.sampling_rate)
597 ```
598
599 You can also check https://github.com/boson-ai/higgs-audio/tree/main/examples for more example scripts.
600
601 ## License
602
603 See [LICENSE](./LICENSE)