README.md · higgs-audio-v2-generation-3B-base

1

---

2

license: other

3

language:

4

- en

5

- zh

6

- de

7

- ko

8

pipeline_tag: text-to-speech

9

library_name: transformers

10

---

11

12

# Higgs Audio V2: Redefining Expressiveness in Audio Generation

13

14

<div align="center" style="display: flex; justify-content: center; margin-top: 10px; flex-wrap: wrap; gap: 8px;">

15

  <a href="https://boson.ai/blog/higgs-audio-v2"><img src='https://img.shields.io/badge/🚀-Launch Blogpost-228B22' style="margin-right: 5px;"></a>

16

  <a href="https://github.com/boson-ai/higgs-audio"><img src="https://img.shields.io/badge/💻-Github%20Repo-9C276A" style="margin-right: 5px;"></a>

17

  <a href="https://huggingface.co/spaces/smola/higgs_audio_v2"><img src="https://img.shields.io/badge/🎮-HF%20Space%20Playground-8A2BE2" style="margin-right: 5px;"></a>

18

  <a href="https://huggingface.co/bosonai/higgs-audio-v2-tokenizer"><img src="https://img.shields.io/badge/🎧-Audio%20Tokenizer-6A5ACD.svg" style="margin-right: 5px;"></a>

19

</div>

20

21

Check our open-source repository https://github.com/boson-ai/higgs-audio for more details!

22

23

We are open-sourcing Higgs Audio v2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data.

24

Despite having no post-training or fine-tuning, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.

25

26

On [EmergentTTS-Eval](https://github.com/boson-ai/emergenttts-eval-public), the model achieves win rates of **75.7%** and **55.7%** over "gpt-4o-mini-tts" on the "Emotions" and "Questions" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including automatic prosody adaptation during narration, zero-shot generation of natural multi-speaker dialogues in multiple languages, melodic humming with the cloned voice, and simultaneous generation of speech and background music.

27

28

29

<p>

30

<img src="./emergent-tts-emotions-win-rate.png" width=900>

31

</p>

32

33

Here's the demo video that shows some of its emergent capabilities (remember to unmute):

34

35

<div align="left">

36

<video width="95%" controls>

37

        <source src="https://cdn-uploads.huggingface.co/production/uploads/64fa072a52e82dd432460767/bjbWGg1IKoMtWXnl0Od8G.mp4" type="video/mp4">

38

Your browser does not support the video tag.

39

</video>

40

</div>

41

42

Here's another demo video that show-cases the model's multilingual capability and how it enabled live translation (remember to unmute):

43

44

<div align="left">

45

<video width="95%" controls>

46

        <source src="https://cdn-uploads.huggingface.co/production/uploads/64fa072a52e82dd432460767/9cN-ky02GzmUgogsIh1Wg.mp4" type="video/mp4">

47

Your browser does not support the video tag.

48

</video>

49

</div>

50

51

## Technical Details

52

53

<p>

54

<img src="./higgs_audio_v2_architecture_combined.png" width=900>

55

</p>

56

57

Higgs Audio v2 adopts the "generation variant" depicted in the architecture figure above. Its strong performance is driven by three key technical innovations:

58

59

- We developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as AudioVerse. The in-house understanding model is finetuned on top of Higgs Audio v1 Understanding, which adopts the "understanding variant" shown in the architecture figure.

60

- We trained a unified audio tokenizer from scratch that captures both semantic and acoustic features.

61

- We proposed the DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead.

62

63

64

### Audio Tokenizer

65

66

<p>

67

<img src="./higgs_audio_tokenizer_architecture.png" width=900>

68

</p>

69

70

We introduce a new discretized audio tokenizer that runs at just 25 frames per second while keeping—or even improving—audio quality compared to tokenizers with twice the bitrate.

71

Our model is the first to train on 24 kHz data covering speech, music, and sound events in one unified system.

72

It also uses a simple non-diffusion encoder/decoder for fast, batch inference. It achieves state-of-the-art performance in semantic and acoustic evaluations.

73

Check https://huggingface.co/bosonai/higgs-audio-v2-tokenizer for more information about the tokenizer.

74

75

### Model Architecture -- Dual FFN

76

77

Higgs Audio v2 is built on top of [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B). To enhance the model’s ability to process audio tokens,

78

we incorporate the "DualFFN" architecture as an audio adapter.

79

DualFFN acts as an audio-specific expert, boosting the LLM's performance with minimal computational overhead.

80

Our implementation preserves 91% of the original LLM’s training speed with the inclusion of DualFFN, which has 2.2B parameters.

81

Thus, the total number of parameter for Higgs Audio v2 is 3.6B (LLM) + 2.2B (Audio Dual FFN), and it has the same training / inference FLOPs as Llama-3.2-3B.

82

Ablation study shows that the model equipped with DualFFN consistently outperforms its counterpart in terms of word error rate (WER) and speaker similarity.

83

See [our architecture blog](https://github.com/boson-ai/higgs-audio/blob/main/tech_blogs/ARCHITECTURE_BLOG.md) for more information.

84

85

86

## Evaluation

87

88

Here's the performance of Higgs Audio v2 on four benchmarks,  [Seed-TTS Eval](https://github.com/BytedanceSpeech/seed-tts-eval), [Emotional Speech Dataset (ESD)](https://paperswithcode.com/dataset/esd), [EmergentTTS-Eval](https://arxiv.org/abs/2505.23009), and Multi-speaker Eval:

89

90

#### Seed-TTS Eval & ESD

91

92

We prompt Higgs Audio v2 with the reference text, reference audio, and target text for zero-shot TTS. We use the standard evaluation metrics from Seed-TTS Eval and ESD.

93

94

| | SeedTTS-Eval| | ESD | |

95

|------------------------------|--------|--------|---------|-------------------|

96

| | WER ↓ | SIM ↑ | WER ↓ | SIM (emo2vec) ↑ |

97

| Cosyvoice2 | 2.28 | 65.49 | 2.71 | 80.48 |

98

| Qwen2.5-omni† | 2.33 | 64.10 | - | - |

99

| ElevenLabs Multilingual V2 | **1.43** | 50.00 | 1.66 | 65.87 |

100

| Higgs Audio v1 | 2.18 | 66.27 | **1.49** | 82.84 |

101

| Higgs Audio v2 (base) | 2.44 | **67.70** | 1.78 | **86.13** |

102

103

104

#### EmergentTTS-Eval ("Emotions" and "Questions")

105

106

Following the [EmergentTTS-Eval Paper](https://arxiv.org/abs/2505.23009), we report the win-rate over "gpt-4o-mini-tts" with the "alloy" voice. Results of Higgs Audio v2 is obtained with the voice of "belinda". The judge model is Gemini 2.5 Pro.

107

108

| Model | Emotions (%) ↑ | Questions (%) ↑ |

109

|------------------------------------|--------------|----------------|

110

| Higgs Audio v2 (base) | **75.71%** | **55.71%** |

111

| [gpt-4o-audio-preview†](https://platform.openai.com/docs/models/gpt-4o-audio-preview)       | 61.64%       | 47.85%         |

112

| [Hume.AI](https://www.hume.ai/research) | 61.60% | 43.21% |

113

| **BASELINE:** [gpt-4o-mini-tts](https://platform.openai.com/docs/models/gpt-4o-mini-tts)  | 50.00%       | 50.00%         |

114

| [Qwen 2.5 Omni†](https://github.com/QwenLM/Qwen2.5-Omni) | 41.60% | 51.78% |

115

| [minimax/speech-02-hd](https://replicate.com/minimax/speech-02-hd) | 40.86% | 47.32% |

116

| [ElevenLabs Multilingual v2](https://elevenlabs.io/blog/eleven-multilingual-v2)         | 30.35%       | 39.46%         |

117

| [DeepGram Aura-2](https://deepgram.com/learn/introducing-aura-2-enterprise-text-to-speech)                    | 29.28%       | 48.21%         |

118

| [Sesame csm-1B](https://github.com/SesameAILabs/csm) | 15.96% | 31.78% |

119

120

<sup><sub>'†' means using the strong-prompting method described in the paper.</sub></sup>

121

122

123

#### Multi-speaker Eval

124

125

We also designed a multi-speaker evaluation benchmark to evaluate the capability of Higgs Audio v2 for multi-speaker dialog generation. The benchmark contains three subsets

126

127

- `two-speaker-conversation`: 1000 synthetic dialogues involving two speakers. We fix two reference audio clips to evaluate the model's ability in double voice cloning for utterances ranging from 4 to 10 dialogues between two randomly chosen persona.

128

- `small talk (no ref)`: 250 synthetic dialogues curated in the same way as above, but are characterized by short utterances and a limited number of turns (4–6), we do not fix reference audios in this case and this set is designed to evaluate the model's ability to automatically assign appropriate voices to speakers.

129

- `small talk (ref)`: 250 synthetic dialogues similar to above, but contains even shorter utterances as this set is meant to include reference clips in it's context, similar to `two-speaker-conversation`.

130

131

132

We report the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than Higgs Audio v2, we also evaluated [MoonCast](https://github.com/jzq2000/MoonCast) and [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626), two of the most popular open-source models capable of multi-speaker dialog generation.

133

Results are summarized in the following table. We are not able to run [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626) on our "two-speaker-conversation" subset due to its strict limitation on the length of the utterances and output audio.

134

135

|                                                | two-speaker-conversation |                |small talk |                | small talk (no ref) |                |

136

| ---------------------------------------------- | -------------- | ------------------ | ---------- | -------------- | ------------------- | -------------- |

137

|                                                | WER ↓                      | Mean Sim & Dis-sim ↑ | WER ↓       |  Mean Sim & Dis-sim ↑ | WER ↓               | Mean Sim & Dis-sim ↑ |

138

| [MoonCast](https://github.com/jzq2000/MoonCast) | 38.77                    | 46.02         | **8.33**       | 63.68          | 24.65               | 53.94 |

139

| [nari-labs/Dia-1.6B-0626](https://huggingface.co/nari-labs/Dia-1.6B-0626)         | \-                       | \-             | 17.62      | 63.15          | 19.46               | **61.14**          |

140

| Higgs Audio v2 (base)     | **18.88**                    | **51.95**          | 11.89      | **67.92**              | **14.65**               | 55.28              |

141

142

143

## Usage

144

145

### Transformers 🤗

146

147

Higgs Audio V2 is supported natively in `transformers`: [see the doc](https://huggingface.co/docs/transformers/en/model_doc/higgs_audio_v2).

148

149

```bash

150

uv pip install "transformers>=5.3.0"

151

```

152

153

<details>

154

<summary>Single-speaker smart voice</summary>

155

156

```python

157

from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

158

159

model_id = "bosonai/higgs-audio-v2-generation-3B-base"

160

processor = AutoProcessor.from_pretrained(model_id, device_map="auto")

161

model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

162

163

conversation = [

164

{

165

"role": "system",

166

"content": [{"type": "text", "text": "Generate audio following instruction."}],

167

},

168

{

169

"role": "scene",

170

"content": [{"type": "text", "text": "Audio is recorded from a quiet room."}],

171

},

172

{

173

"role": "user",

174

"content": [

175

{

176

"type": "text",

177

                "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",

178

}

179

],

180

},

181

]

182

183

inputs = processor.apply_chat_template(

184

conversation,

185

add_generation_prompt=True,

186

tokenize=True,

187

return_dict=True,

188

sampling_rate=24000,

189

return_tensors="pt",

190

).to(model.device)

191

192

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)

193

decoded = processor.batch_decode(outputs)

194

processor.save_audio(decoded, "output_single_speaker.wav")

195

```

196

197

</details>

198

199

<details>

200

<summary>Multi-speaker smart voice</summary>

201

202

Use `[SPEAKER*]` tags to generate a multi-speaker dialogue. Speaker characteristics are described in the `scene` role.

203

204

```python

205

from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

206

207

model_id = "bosonai/higgs-audio-v2-generation-3B-base"

208

processor = AutoProcessor.from_pretrained(model_id, device_map="auto")

209

model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

210

211

system_message = """You are an AI assistant designed to convert text into speech.

212

If the user's message includes a [SPEAKER*] tag, do not read out the tag and generate speech for the following text, using the specified voice.

213

If no speaker tag is present, select a suitable voice on your own."""

214

215

user_message = """[SPEAKER0] I can't believe you did that without even asking me first!

216

[SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.

217

[SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!

218

[SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""

219

220

conversation = [

221

{

222

"role": "system",

223

"content": [{"type": "text", "text": system_message}],

224

},

225

{

226

"role": "scene",

227

"content": [

228

{"type": "text", "text": "Audio is recorded from a quiet room."},

229

{"type": "text", "text": "SPEAKER0: feminine"},

230

{"type": "text", "text": "SPEAKER1: masculine"},

231

],

232

},

233

{

234

"role": "user",

235

"content": [{"type": "text", "text": user_message}],

236

},

237

]

238

239

inputs = processor.apply_chat_template(

240

conversation,

241

add_generation_prompt=True,

242

tokenize=True,

243

return_dict=True,

244

sampling_rate=24000,

245

return_tensors="pt",

246

).to(model.device)

247

248

outputs = model.generate(**inputs, max_new_tokens=2000, do_sample=False)

249

decoded = processor.batch_decode(outputs)

250

processor.save_audio(decoded, "output_multi_speaker.wav")

251

```

252

253

</details>

254

255

<details>

256

<summary>Zero-shot voice cloning</summary>

257

258

Clone a voice by providing a reference audio in the conversation history.

259

260

```python

261

from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

262

263

model_id = "bosonai/higgs-audio-v2-generation-3B-base"

264

processor = AutoProcessor.from_pretrained(model_id, device_map="auto")

265

model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

266

267

conversation = [

268

{

269

"role": "system",

270

"content": [{"type": "text", "text": "Generate audio following instruction."}],

271

},

272

{

273

"role": "scene",

274

"content": [{"type": "text", "text": "Audio is recorded from a quiet room."}],

275

},

276

{

277

"role": "user",

278

"content": [

279

{

280

"type": "text",

281

                "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",

282

}

283

],

284

},

285

{

286

"role": "assistant",

287

"content": [

288

{

289

"type": "audio",

290

"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",

291

}

292

],

293

},

294

{

295

"role": "user",

296

"content": [

297

{

298

"type": "text",

299

                "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",

300

}

301

],

302

},

303

]

304

305

inputs = processor.apply_chat_template(

306

conversation,

307

add_generation_prompt=True,

308

tokenize=True,

309

return_dict=True,

310

sampling_rate=24000,

311

return_tensors="pt",

312

).to(model.device)

313

314

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)

315

decoded = processor.batch_decode(outputs)

316

processor.save_audio(decoded, "output_voice_cloning.wav")

317

```

318

319

</details>

320

321

<details>

322

<summary>Multi-speaker voice cloning</summary>

323

324

Clone multiple voices by providing reference audio clips in the `scene` role.

325

326

```python

327

from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

328

329

model_id = "bosonai/higgs-audio-v2-generation-3B-base"

330

processor = AutoProcessor.from_pretrained(model_id, device_map="auto")

331

model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

332

333

user_message = """[SPEAKER0] I can't believe you did that without even asking me first!

334

[SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.

335

[SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!

336

[SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""

337

338

conversation = [

339

{

340

"role": "system",

341

"content": [{"type": "text", "text": "Generate audio following instruction."}],

342

},

343

{

344

"role": "scene",

345

"content": [

346

{"type": "text", "text": "Audio is recorded from a quiet room."},

347

{"type": "text", "text": "SPEAKER0:"},

348

{

349

"type": "audio",

350

"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav",

351

},

352

{"type": "text", "text": "SPEAKER1:"},

353

{

354

"type": "audio",

355

"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac",

356

},

357

],

358

},

359

{

360

"role": "user",

361

"content": [{"type": "text", "text": user_message}],

362

},

363

]

364

365

inputs = processor.apply_chat_template(

366

conversation,

367

add_generation_prompt=True,

368

tokenize=True,

369

return_dict=True,

370

sampling_rate=24000,

371

return_tensors="pt",

372

).to(model.device)

373

374

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)

375

decoded = processor.batch_decode(outputs)

376

processor.save_audio(decoded, "output_multi_speaker_cloning.wav")

377

```

378

379

</details>

380

381

<details>

382

<summary>Batched inference</summary>

383

384

Process multiple conversations in a single forward pass.

385

386

```python

387

from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

388

389

model_id = "bosonai/higgs-audio-v2-generation-3B-base"

390

processor = AutoProcessor.from_pretrained(model_id, device_map="auto")

391

model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")

392

393

conversation1 = [

394

{"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},

395

{"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},

396

{

397

"role": "user",

398

"content": [

399

{

400

"type": "text",

401

                "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",

402

}

403

],

404

},

405

{

406

"role": "assistant",

407

"content": [

408

{

409

"type": "audio",

410

"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",

411

}

412

],

413

},

414

{

415

"role": "user",

416

"content": [

417

{

418

"type": "text",

419

                "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",

420

}

421

],

422

},

423

]

424

425

conversation2 = [

426

{"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},

427

{"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},

428

{

429

"role": "user",

430

"content": [

431

{

432

"type": "text",

433

                "text": " It's super important to assess fairly the fact that our former model is over. And this is not a question of adjustment. This is not the same world, 2024, 2025. And on top of that, we are making the same mistakes, on top of the key elements I mentioned. We are over-regulating and under-investing. So just if, in the two to three years to come, if we follow our classical agenda, we will be out of the market. I have no doubts.",

434

}

435

],

436

},

437

{

438

"role": "assistant",

439

"content": [

440

{

441

"type": "audio",

442

"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/macron.wav",

443

}

444

],

445

},

446

{

447

"role": "user",

448

"content": [{"type": "text", "text": "Hey, here is a clone from the given voice."}],

449

},

450

]

451

452

inputs = processor.apply_chat_template(

453

[conversation1, conversation2],

454

add_generation_prompt=True,

455

tokenize=True,

456

return_dict=True,

457

sampling_rate=24000,

458

return_tensors="pt",

459

).to(model.device)

460

461

outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)

462

decoded = processor.batch_decode(outputs)

463

processor.save_audio(decoded, ["output_batched_1.wav", "output_batched_2.wav"])

464

```

465

466

</details>

467

468

<details>

469

<summary>Training</summary>

470

471

By default, the model does not load the text language modeling head to save memory (~1.5GiB reduction), as it's not required for generation. When training, set `use_text_head=True` to compute loss on text tokens.

472

473

```python

474

from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration

475

476

model_id = "bosonai/higgs-audio-v2-generation-3B-base"

477

processor = AutoProcessor.from_pretrained(model_id, device_map="auto")

478

model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto", use_text_head=True)

479

480

conversation1 = [

481

{"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},

482

{"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},

483

{

484

"role": "user",

485

"content": [

486

{

487

"type": "text",

488

                "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year.",

489

}

490

],

491

},

492

{

493

"role": "assistant",

494

"content": [

495

{

496

"type": "audio",

497

"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav",

498

}

499

],

500

},

501

]

502

503

conversation2 = [

504

{"role": "system", "content": [{"type": "text", "text": "Generate audio following instruction."}]},

505

{"role": "scene", "content": [{"type": "text", "text": "Audio is recorded from a quiet room."}]},

506

{

507

"role": "user",

508

"content": [

509

{

510

"type": "text",

511

                "text": " I would imagine so. A wand with a dragon heartstring core is capable of dazzling magic, and the bond between you and your wand should only grow stronger. Do not be surprised at your new wand's ability to perceive your intentions, particularly in a moment of need",

512

}

513

],

514

},

515

{

516

"role": "assistant",

517

"content": [

518

{

519

"type": "audio",

520

                "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/broom_salesman.wav",

521

}

522

],

523

},

524

]

525

526

inputs = processor.apply_chat_template(

527

[conversation1, conversation2],

528

add_generation_prompt=True,

529

tokenize=True,

530

return_dict=True,

531

sampling_rate=24000,

532

return_tensors="pt",

533

output_labels=True,

534

).to(model.device)

535

536

outputs = model(**inputs)

537

outputs.loss.backward()

538

```

539

540

</details>

541

542

### Original codebase

543

544

You need to first install the [higgs-audio](https://github.com/boson-ai/higgs-audio):

545

546

```bash

547

git clone https://github.com/boson-ai/higgs-audio.git

548

549

cd higgs-audio

550

python3 -m venv higgs_audio_env

551

source higgs_audio_env/bin/activate

552

pip install -r requirements.txt

553

pip install -e .

554

```

555

556

Afterwards, try to run the following python code snippet to convert text to speech.

557

558

```python

559

from boson_multimodal.serve.serve_engine import HiggsAudioServeEngine, HiggsAudioResponse

560

from boson_multimodal.data_types import ChatMLSample, Message, AudioContent

561

562

import torch

563

import torchaudio

564

import time

565

import click

566

567

MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"

568

AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"

569

570

system_prompt = (

571

    "Generate audio following instruction.\n\n<|scene_desc_start|>\nAudio is recorded from a quiet room.\n<|scene_desc_end|>"

572

)

573

574

messages = [

575

Message(

576

role="system",

577

content=system_prompt,

578

),

579

Message(

580

role="user",

581

        content="The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.",

582

),

583

]

584

device = "cuda" if torch.cuda.is_available() else "cpu"

585

586

serve_engine = HiggsAudioServeEngine(MODEL_PATH, AUDIO_TOKENIZER_PATH, device=device)

587

588

output: HiggsAudioResponse = serve_engine.generate(

589

chat_ml_sample=ChatMLSample(messages=messages),

590

max_new_tokens=1024,

591

temperature=0.3,

592

top_p=0.95,

593

top_k=50,

594

stop_strings=["<|end_of_text|>", "<|eot_id|>"],

595

)

596

torchaudio.save(f"output.wav", torch.from_numpy(output.audio)[None, :], output.sampling_rate)

597

```

598

599

You can also check https://github.com/boson-ai/higgs-audio/tree/main/examples for more example scripts.

600

601

## License

602

603

See [LICENSE](./LICENSE)