README.md · distil-large-v3

1

---

2

language:

3

- en

4

license: mit

5

library_name: transformers

6

tags:

7

- audio

8

- automatic-speech-recognition

9

- transformers.js

10

widget:

11

- example_title: LibriSpeech sample 1

12

src: https://cdn-media.huggingface.co/speech_samples/sample1.flac

13

- example_title: LibriSpeech sample 2

14

src: https://cdn-media.huggingface.co/speech_samples/sample2.flac

15

pipeline_tag: automatic-speech-recognition

16

---

17

18

# Distil-Whisper: distil-large-v3

19

20

Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).

21

22

This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of

23

OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), the latest and most performant Whisper model

24

to date.

25

26

Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give

27

**superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.

28

29

The result is a distilled model that performs to within 1% WER of large-v3 on long-form audio using both the sequential

30

and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster

31

than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.

32

33

| Model                                                                        | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |

34

|------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|

35

| [large-v3](https://huggingface.co/openai/whisper-large-v3)                   | 1550       | 1.0          | 8.4        | 10.0                 | 11.0              |

36

| **[distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)** | **756**    | **6.3**      | **9.7**    | **10.8**             | **10.9**          |

37

| [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2)     | 756        | 5.8          | 10.1       | 15.6                 | 11.6              |

38

39

Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries

40

(Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.

41

You can expect significant performance gains by switching from previous Distil-Whisper checkpoints to distil-large-v3

42

when using these libraries. For convenience, the weights for the most popular libraries are already converted,

43

with instructions for getting started below.

44

45

## Table of Contents

46

47

1. [Transformers Usage](#transformers-usage)

48

* [Short-Form Transcription](#short-form-transcription)

49

* [Sequential Long-Form](#sequential-long-form)

50

* [Chunked Long-Form](#chunked-long-form)

51

* [Speculative Decoding](#speculative-decoding)

52

* [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)

53

2. [Library Integrations](#library-integrations)

54

* [Whisper cpp](#whispercpp)

55

* [Faster Whisper](#faster-whisper)

56

* [OpenAI Whisper](#openai-whisper)

57

* [Transformers.js](#transformersjs)

58

* [Candle](#candle)

59

3. [Model Details](#model-details)

60

4. [License](#license)

61

62

## Transformers Usage

63

64

distil-large-v3 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first

65

install the latest version of Transformers. For this example, we'll also install 🤗 Datasets to load a toy audio dataset

66

from the Hugging Face Hub:

67

68

```bash

69

pip install --upgrade pip

70

pip install --upgrade transformers accelerate datasets[audio]

71

```

72

73

### Short-Form Transcription

74

75

The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)

76

class to transcribe short-form audio files (< 30-seconds) as follows:

77

78

```python

79

import torch

80

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

81

from datasets import load_dataset

82

83

84

device = "cuda:0" if torch.cuda.is_available() else "cpu"

85

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

86

87

model_id = "distil-whisper/distil-large-v3"

88

89

model = AutoModelForSpeechSeq2Seq.from_pretrained(

90

model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

91

)

92

model.to(device)

93

94

processor = AutoProcessor.from_pretrained(model_id)

95

96

pipe = pipeline(

97

"automatic-speech-recognition",

98

model=model,

99

tokenizer=processor.tokenizer,

100

feature_extractor=processor.feature_extractor,

101

max_new_tokens=128,

102

torch_dtype=torch_dtype,

103

device=device,

104

)

105

106

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

107

sample = dataset[0]["audio"]

108

109

result = pipe(sample)

110

print(result["text"])

111

```

112

113

To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:

114

```diff

115

- result = pipe(sample)

116

+ result = pipe("audio.mp3")

117

```

118

119

For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:

120

```python

121

result = pipe(sample, return_timestamps=True)

122

print(result["chunks"])

123

```

124

125

<details>

126

127

<summary> For more control over the generation parameters, use the model + processor API directly: </summary>

128

129

Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`

130

for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)

131

for more details.

132

133

```python

134

import torch

135

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

136

from datasets import Audio, load_dataset

137

138

139

device = "cuda:0" if torch.cuda.is_available() else "cpu"

140

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

141

142

model_id = "distil-whisper/distil-large-v3"

143

144

model = AutoModelForSpeechSeq2Seq.from_pretrained(

145

model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

146

)

147

model.to(device)

148

149

processor = AutoProcessor.from_pretrained(model_id)

150

151

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

152

dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))

153

sample = dataset[0]["audio"]

154

155

input_features = processor(

156

sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"

157

).input_features

158

159

input_features = input_features.to(device, dtype=torch_dtype)

160

161

gen_kwargs = {

162

"max_new_tokens": 128,

163

"num_beams": 1,

164

"return_timestamps": False,

165

}

166

167

pred_ids = model.generate(input_features, **gen_kwargs)

168

pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])

169

170

print(pred_text)

171

```

172

173

</details>

174

175

### Sequential Long-Form

176

177

Unlike previous Distil-Whisper releases, distil-large-v3 is specifically designed to be compatible with OpenAI's sequential

178

long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),

179

and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).

180

181

The sequential long-form algorithm should be used in either of the following scenarios:

182

1. Transcription accuracy is the most important factor, and latency is less of a consideration

183

2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate

184

185

If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm

186

described [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of

187

the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf).

188

189

The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)

190

class can be used to transcribe long audio files with the sequential algorithm as follows:

191

192

```python

193

import torch

194

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

195

from datasets import load_dataset

196

197

198

device = "cuda:0" if torch.cuda.is_available() else "cpu"

199

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

200

201

model_id = "distil-whisper/distil-large-v3"

202

203

model = AutoModelForSpeechSeq2Seq.from_pretrained(

204

model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

205

)

206

model.to(device)

207

208

processor = AutoProcessor.from_pretrained(model_id)

209

210

pipe = pipeline(

211

"automatic-speech-recognition",

212

model=model,

213

tokenizer=processor.tokenizer,

214

feature_extractor=processor.feature_extractor,

215

max_new_tokens=128,

216

torch_dtype=torch_dtype,

217

device=device,

218

)

219

220

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")

221

sample = dataset[0]["audio"]

222

223

result = pipe(sample)

224

print(result["text"])

225

```

226

227

<details>

228

229

<summary> For more control over the generation parameters, use the model + processor API directly: </summary>

230

231

```python

232

import torch

233

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

234

from datasets import Audio, load_dataset

235

236

237

device = "cuda:0" if torch.cuda.is_available() else "cpu"

238

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

239

240

model_id = "distil-whisper/distil-large-v3"

241

242

model = AutoModelForSpeechSeq2Seq.from_pretrained(

243

model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

244

)

245

model.to(device)

246

247

processor = AutoProcessor.from_pretrained(model_id)

248

249

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

250

dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))

251

sample = dataset[0]["audio"]

252

253

inputs = processor(

254

sample["array"],

255

sampling_rate=sample["sampling_rate"],

256

return_tensors="pt",

257

truncation=False,

258

padding="longest",

259

return_attention_mask=True,

260

)

261

inputs = inputs.to(device, dtype=torch_dtype)

262

263

gen_kwargs = {

264

"max_new_tokens": 448,

265

"num_beams": 1,

266

"condition_on_prev_tokens": False,

267

"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)

268

"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),

269

"logprob_threshold": -1.0,

270

"no_speech_threshold": 0.6,

271

"return_timestamps": True,

272

}

273

274

pred_ids = model.generate(**i nputs, **gen_kwargs)

275

pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)

276

277

print(pred_text)

278

```

279

280

</details>

281

282

### Chunked Long-Form

283

284

distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when

285

a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,

286

the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the

287

[Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).

288

289

To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds

290

is optimal. To activate batching over long audio files, pass the argument `batch_size`:

291

292

```python

293

import torch

294

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

295

from datasets import load_dataset

296

297

298

device = "cuda:0" if torch.cuda.is_available() else "cpu"

299

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

300

301

model_id = "distil-whisper/distil-large-v3"

302

303

model = AutoModelForSpeechSeq2Seq.from_pretrained(

304

model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

305

)

306

model.to(device)

307

308

processor = AutoProcessor.from_pretrained(model_id)

309

310

pipe = pipeline(

311

"automatic-speech-recognition",

312

model=model,

313

tokenizer=processor.tokenizer,

314

feature_extractor=processor.feature_extractor,

315

max_new_tokens=128,

316

chunk_length_s=25,

317

batch_size=16,

318

torch_dtype=torch_dtype,

319

device=device,

320

)

321

322

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")

323

sample = dataset[0]["audio"]

324

325

result = pipe(sample)

326

print(result["text"])

327

```

328

329

### Speculative Decoding

330

331

distil-large-v3 is the first Distil-Whisper model that can be used as an assistant to Whisper large-v3 for [speculative decoding](https://huggingface.co/blog/whisper-speculative-decoding).

332

Speculative decoding mathematically ensures that exactly the same outputs as Whisper are obtained, while being 2 times faster.

333

This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.

334

335

In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then

336

specify it as the "assistant model" for generation:

337

338

```python

339

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor

340

import torch

341

from datasets import load_dataset

342

343

device = "cuda:0" if torch.cuda.is_available() else "cpu"

344

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

345

346

assistant_model_id = "distil-whisper/distil-large-v3"

347

348

assistant_model = AutoModelForCausalLM.from_pretrained(

349

assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

350

)

351

assistant_model.to(device)

352

353

model_id = "openai/whisper-large-v3"

354

355

model = AutoModelForSpeechSeq2Seq.from_pretrained(

356

model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True

357

)

358

model.to(device)

359

360

processor = AutoProcessor.from_pretrained(model_id)

361

362

pipe = pipeline(

363

"automatic-speech-recognition",

364

model=model,

365

tokenizer=processor.tokenizer,

366

feature_extractor=processor.feature_extractor,

367

max_new_tokens=128,

368

generate_kwargs={"assistant_model": assistant_model},

369

torch_dtype=torch_dtype,

370

device=device,

371

)

372

373

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

374

sample = dataset[0]["audio"]

375

376

result = pipe(sample)

377

print(result["text"])

378

```

379

380

For more details on speculative decoding, refer to the blog post [Speculative Decoding for 2x Faster Whisper Inference](https://huggingface.co/blog/whisper-speculative-decoding).

381

382

### Additional Speed & Memory Improvements

383

384

You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM

385

requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a

386

more efficient flash attention version.

387

388

#### Flash Attention 2

389

390

We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)

391

if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):

392

393

```

394

pip install flash-attn --no-build-isolation

395

```

396

397

Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:

398

399

```diff

400

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)

401

+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")

402

```

403

404

#### Torch Scale-Product-Attention (SDPA)

405

406

If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).

407

This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check

408

whether you have a compatible PyTorch version, run the following Python code snippet:

409

410

```python

411

from transformers.utils import is_torch_sdpa_available

412

413

print(is_torch_sdpa_available())

414

```

415

416

If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it

417

returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)

418

419

Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying

420

`attn_implementation="sdpa"` as follows:

421

422

```diff

423

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)

424

+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")

425

```

426

427

For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).

428

429

#### Torch compile

430

431

Coming soon...

432

433

#### 4-bit and 8-bit Inference

434

435

Coming soon...

436

437

## Library Integrations

438

439

### Whisper.cpp

440

441

Distil-Whisper can be run with the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) package with the original

442

sequential long-form transcription algorithm. In a provisional benchmark on Mac M1, distil-large-v3 is over 5x faster

443

than Whisper large-v3, while performing to within 0.8% WER over long-form audio.

444

445

Steps for getting started:

446

447

1. Clone the Whisper.cpp repository:

448

```

449

git clone https://github.com/ggerganov/whisper.cpp.git

450

cd whisper.cpp

451

```

452

2. Install the Hugging Face Hub Python package:

453

```bash

454

pip install --upgrade huggingface_hub

455

```

456

And download the GGML weights for distil-large-v3 using the following Python snippet:

457

458

```python

459

from huggingface_hub import hf_hub_download

460

461

hf_hub_download(repo_id='distil-whisper/distil-large-v3-ggml', filename='ggml-distil-large-v3.bin', local_dir='./models')

462

```

463

464

Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:

465

466

```bash

467

wget https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3.bin -P ./models

468

```

469

470

3. Run inference using the provided sample audio:

471

472

```bash

473

make -j && ./main -m models/ggml-distil-large-v3.bin -f samples/jfk.wav

474

```

475

476

### Faster-Whisper

477

478

Faster-Whisper is a reimplementation of Whisper using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), a fast

479

inference engine for Transformer models.

480

481

First, install the Faster-Whisper package according to the [official instructions](https://github.com/SYSTRAN/faster-whisper#installation).

482

For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:

483

484

```bash

485

pip install --upgrade pip

486

pip install --upgrade git+https://github.com/SYSTRAN/faster-whisper datasets[audio]

487

```

488

489

The following code snippet loads the distil-large-v3 model and runs inference on an example file from the LibriSpeech ASR

490

dataset:

491

492

```python

493

import torch

494

from faster_whisper import WhisperModel

495

from datasets import load_dataset

496

497

# define our torch configuration

498

device = "cuda:0" if torch.cuda.is_available() else "cpu"

499

compute_type = "float16" if torch.cuda.is_available() else "float32"

500

501

# load model on GPU if available, else cpu

502

model = WhisperModel("distil-large-v3", device=device, compute_type=compute_type)

503

504

# load toy dataset for example

505

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

506

sample = dataset[1]["audio"]["path"]

507

508

segments, info = model.transcribe(sample, beam_size=1)

509

510

for segment in segments:

511

print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

512

```

513

514

To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:

515

516

```python

517

segments, info = model.transcribe("audio.mp3", beam_size=1)

518

```

519

520

### OpenAI Whisper

521

522

To use the model in the original Whisper format, first ensure you have the [`openai-whisper`](https://pypi.org/project/openai-whisper/) package installed.

523

For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:

524

525

```bash

526

pip install --upgrade pip

527

pip install --upgrade openai-whisper datasets[audio]

528

```

529

530

The following code-snippet demonstrates how to transcribe a sample file from the LibriSpeech dataset loaded using

531

🤗 Datasets:

532

533

```python

534

from huggingface_hub import hf_hub_download

535

from datasets import load_dataset

536

from whisper import load_model, transcribe

537

538

model_path = hf_hub_download(repo_id="distil-whisper/distil-large-v3-openai", filename="model.bin")

539

model = load_model(model_path)

540

541

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

542

sample = dataset[0]["audio"]["path"]

543

544

pred_out = transcribe(model, audio=sample, language="en")

545

print(pred_out["text"])

546

```

547

548

Note that the model weights will be downloaded and saved to your cache the first time you run the example. Subsequently,

549

you can re-use the same example, and the weights will be loaded directly from your cache without having to download them

550

again.

551

552

To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:

553

554

```python

555

pred_out = transcribe(model, audio=sample, language="en")

556

```

557

558

The Distil-Whisper model can also be used with the OpenAI Whisper CLI. Refer to the [following instructions](https://huggingface.co/distil-whisper/distil-large-v3-openai#cli-usage)

559

for details.

560

561

### Transformers.js

562

563

Distil-Whisper can be run completely in your web browser with [Transformers.js](http://github.com/xenova/transformers.js):

564

565

1. Install Transformers.js from [NPM](https://www.npmjs.com/package/@xenova/transformers):

566

567

```bash

568

npm i @xenova/transformers

569

```

570

571

2. Import the library and perform inference with the pipeline API.

572

573

```js

574

import { pipeline } from '@xenova/transformers';

575

576

const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v3');

577

578

const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';

579

const output = await transcriber(url);

580

// { text: " And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." }

581

```

582

583

Check out the online [Distil-Whisper Web Demo](https://huggingface.co/spaces/Xenova/distil-whisper-web) to try it out yourself.

584

As you'll see, it runs locally in your browser: no server required!

585

586

Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)

587

for further information.

588

589

### Candle

590

591

Through an integration with Hugging Face [Candle](https://github.com/huggingface/candle/tree/main) 🕯️, Distil-Whisper is

592

available in the Rust library 🦀

593

594

Benefit from:

595

* Optimised CPU backend with optional MKL support for Linux x86 and Accelerate for Macs

596

* Metal support for efficiently running on Macs

597

* CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL

598

* WASM support: run Distil-Whisper in a browser

599

600

Steps for getting started:

601

1. Install [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) as explained [here](https://huggingface.github.io/candle/guide/installation.html)

602

2. Clone the `candle` repository locally:

603

```

604

git clone https://github.com/huggingface/candle.git

605

```

606

3. Enter the example directory for [Whisper](https://github.com/huggingface/candle/tree/main/candle-examples/examples/whisper):

607

```

608

cd candle/candle-examples/examples/whisper

609

```

610

4. Run an example:

611

```

612

cargo run --example whisper --release --features symphonia -- --model distil-large-v3

613

```

614

5. To specify your own audio file, add the `--input` flag:

615

```

616

cargo run --example whisper --release --features symphonia -- --model distil-large-v3 --input audio.wav

617

```

618

619

**Tip:** for compiling using Apple Metal, specify the `metal` feature when you run the example:

620

```

621

cargo run --example whisper --release --features="symphonia,metal" -- --model distil-large-v3

622

```

623

624

Note that if you encounter the error:

625

```

626

error: target `whisper` in package `candle-examples` requires the features: `symphonia`

627

Consider enabling them by passing, e.g., `--features="symphonia"`

628

```

629

You should clean your `cargo` installation:

630

```

631

cargo clean

632

```

633

And subsequently recompile:

634

```

635

cargo run --example whisper --release --features symphonia -- --model distil-large-v3

636

```

637

638

## Model Details

639

640

Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector

641

inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all

642

previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder

643

is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of

644

total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.

645

646

To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed.

647

The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training.

648

The student's decoder consists of a subset of the teacher decoder layers, which are intialised from maximally spaced layers.

649

The model is then trained on a weighted sum of the KL divergence and pseudo-label loss terms.

650

651

<p align="center">

652

<img src="https://huggingface.co/datasets/distil-whisper/figures/resolve/main/architecture.png?raw=true" width="600"/>

653

</p>

654

655

## Differences with distil-large-v2

656

657

Compared to previous version of Distil-Whisper, distil-large-v3 is specifically designed to target the OpenAI sequential

658

long-form transcription algorithm. There are no architectural differences compared to distil-large-v2, other than the fact

659

the model layers are intialised from the latest large-v3 model rather than the older large-v2 one. The differences lie

660

in the way the model was trained.

661

662

Previous Distil-Whisper models were trained on a mean input length of 7-seconds, whereas the original Whisper models were

663

pre-trained on 30-second inputs. During distillation, we shift the distribution of the model weights to the distribution

664

of our training data. If our training data contains shorter utterances (e.g. on average 7-seconds audio instead of 30-seconds),

665

then the predicted distribution shifts to this shorter context length. At inference time, the optimal context window for

666

distil-large-v2 was an interpolation of these two values: 15-seconds. Beyond this time, the predictions for the distil-large-v2

667

model were largely inaccurate, particularly for the timestamp predictions. However, the sequential long-form algorithm

668

uses 30-second sliding windows for inference, with the window shifted according to the last predicted timestamp. Since the

669

last timestamp typically occurs after the 15-second mark, it was predicted with low accuracy, causing the long-form

670

transcription to often fail.

671

672

To preserve Whisper's ability to transcribe sliding 30-second windows, as is done with sequential decoding, we need to

673

ensure the context length of distil-large-v3 is also 30-seconds. This was primarily achieved with four strategies:

674

675

1. **Packing the audio samples in the training dataset to 30-seconds:** since the model is both pre-trained and distilled on audio data packed to 30-seconds, distil-large-v3 now operates on the same ideal context window as Whisper, predicting accurate timestamps up to and including 30-seconds.

676

2. **Freezing the decoder input embeddings:** we use the same input embeds representation as the original model, which is designed to handle longer context lengths than previous Distil-Whisper iterations.

677

3. **Using a longer maximum context length during training:** instead of training on a maximum target length of 128, we train on a maximum of 256. This helps distil-large-v3 transcribe 30-second segments where the number of tokens possibly exceeds 128.

678

4. **Appending prompt conditioning to 50% of the training samples:** enables the model to be used with the `condition_on_prev_tokens` argument, and context windows up to 448 tokens.

679

680

There were further tricks that were employed to improve the performance of distil-large-v3 under the sequential decoding

681

algorithm, which we be explained fully in an upcoming blog post.

682

683

## Evaluation

684

685

The following code-snippets demonstrates how to evaluate the Distil-Whisper model on the LibriSpeech validation-clean

686

dataset with [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet), meaning no

687

audio data has to be downloaded to your local device.

688

689

First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to

690

perform the WER calculation:

691

692

```bash

693

pip install --upgrade pip

694

pip install --upgrade transformers datasets[audio] evaluate jiwer

695

```

696

697

Evaluation can then be run end-to-end with the following example:

698

699

```python

700

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

701

from datasets import load_dataset

702

from evaluate import load

703

import torch

704

from tqdm import tqdm

705

706

# define our torch configuration

707

device = "cuda:0" if torch.cuda.is_available() else "cpu"

708

torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

709

710

model_id = "distil-whisper/distil-large-v3"

711

712

# load the model + processor

713

model =  AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True, low_cpu_mem_usage=True)

714

model = model.to(device)

715

processor = AutoProcessor.from_pretrained(model_id)

716

717

# load the dataset with streaming mode

718

dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)

719

720

# define the evaluation metric

721

wer_metric = load("wer")

722

723

def inference(batch):

724

# 1. Pre-process the audio data to log-mel spectrogram inputs

725

audio = [sample["array"] for sample in batch["audio"]]

726

    input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features

727

input_features = input_features.to(device, dtype=torch_dtype)

728

729

# 2. Auto-regressively generate the predicted token ids

730

pred_ids = model.generate(input_features, max_new_tokens=128)

731

732

# 3. Decode the token ids to the final transcription

733

batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)

734

batch["reference"] = batch["text"]

735

return batch

736

737

# batch size 16 inference

738

dataset = dataset.map(function=inference, batched=True, batch_size=16)

739

740

all_transcriptions = []

741

all_references = []

742

743

# iterate over the dataset and run inference

744

for result in tqdm(dataset, desc="Evaluating..."):

745

all_transcriptions.append(result["transcription"])

746

all_references.append(result["reference"])

747

748

# normalize predictions and references

749

all_transcriptions = [processor.normalize(transcription) for transcription in all_transcriptions]

750

all_references = [processor.normalize(reference) for reference in all_references]

751

752

# compute the WER metric

753

wer = 100 * wer_metric.compute(predictions=all_transcriptions, references=all_references)

754

print(wer)

755

756

```

757

**Print Output:**

758

```

759

2.428920763531516

760

```

761

762

## Intended Use

763

764

Distil-Whisper is intended to be a drop-in replacement for Whisper large-v3 on English speech recognition. In particular, it

765

achieves comparable WER results over out-of-distribution (OOD) test data, while being 6x faster on both short and long-form audio.

766

767

## Data

768

769

Distil-Whisper is trained on 22,000 hours of audio data from nine open-source, permissively licensed speech datasets on the

770

Hugging Face Hub:

771

772

| Dataset                                                                                 | Size / h | Speakers | Domain                      | Licence         |

773

|-----------------------------------------------------------------------------------------|----------|----------|-----------------------------|-----------------|

774

| [People's Speech](https://huggingface.co/datasets/MLCommons/peoples_speech)             | 12,000   | unknown  | Internet Archive            | CC-BY-SA-4.0    |

775

| [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 3,000    | unknown  | Narrated Wikipedia          | CC0-1.0         |

776

| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)                    | 2,500    | unknown  | Audiobook, podcast, YouTube | apache-2.0      |

777

| Fisher                                                                                  | 1,960    | 11,900   | Telephone conversations     | LDC             |

778

| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)                          | 960      | 2,480    | Audiobooks                  | CC-BY-4.0       |

779

| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                         | 540      | 1,310    | European Parliament         | CC0             |

780

| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium)                                | 450      | 2,030    | TED talks                   | CC-BY-NC-ND 3.0 |

781

| SwitchBoard                                                                             | 260      | 540      | Telephone conversations     | LDC             |

782

| [AMI](https://huggingface.co/datasets/edinburghcstr/ami)                                | 100      | unknown  | Meetings                    | CC-BY-4.0       |

783

||||||

784

| **Total**                                                                               | 21,770   | 18,260+  |                             |                 |

785

786

The combined dataset spans 10 distinct domains and over 50k speakers. The diversity of this dataset is crucial to ensuring

787

the distilled model is robust to audio distributions and noise.

788

789

The audio data is then pseudo-labelled using the Whisper large-v3 model: we use Whisper to generate predictions for all

790

the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the

791

transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.

792

793

## WER Filter

794

795

The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on

796

accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels

797

and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds

798

a specified threshold, we discard the training example. Otherwise, we keep it for training.

799

800

Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter

801

for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to

802

hallucinations to this filter.

803

804

## Training

805

806

The model was trained for 80,000 optimisation steps (or 11 epochs) with batch size 256. The Tensorboard training logs can

807

be found under: https://huggingface.co/distil-whisper/distil-large-v3/tensorboard?params=scalars#frame

808

809

## Results

810

811

The distilled model performs to within 1.5% WER of Whisper large-v3 on out-of-distribution (OOD) short-form audio, within

812

1% WER on sequential long-form decoding, and outperforms large-v3 by 0.1% on chunked long-form. This performance gain is

813

attributed to lower hallucinations.

814

815

For a detailed per-dataset breakdown of the evaluation results, refer to Tables 16 and 17 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)

816

817

Distil-Whisper is also evaluated on the [ESB benchmark](https://arxiv.org/abs/2210.13352) datasets as part of the [OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard),

818

where it performs to within 0.2% WER of Whisper.

819

820

## Reproducing Distil-Whisper

821

822

Training and evaluation code to reproduce Distil-Whisper is available under the Distil-Whisper repository: https://github.com/huggingface/distil-whisper/tree/main/training

823

824

This code will shortly be updated to include the training updates described in the section [Differences with distil-large-v2](#differences-with-distil-large-v2).

825

826

## License

827

828

Distil-Whisper inherits the [MIT license](https://github.com/huggingface/distil-whisper/blob/main/LICENSE) from OpenAI's Whisper model.

829

830

## Citation

831

832

If you use this model, please consider citing the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430):

833

```

834

@misc{gandhi2023distilwhisper,

835

title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},

836

author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},

837

year={2023},

838

eprint={2311.00430},

839

archivePrefix={arXiv},

840

primaryClass={cs.CL}

841

}

842

```

843

844

## Acknowledgements

845

* OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3), in particular Jong Wook Kim for the [original codebase](https://github.com/openai/whisper) and training discussions

846

* Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration

847

* [Georgi Gerganov](https://huggingface.co/ggerganov) for the Whisper cpp integration

848

* [Systran team](https://github.com/SYSTRAN) for the Faster-Whisper integration

849

* [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration

850

* [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration

851

* [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution

852

* Google's [TPU Research Cloud (TRC)](https://sites.research.google/trc/about/) programme for Cloud TPU v4 compute resource

853

* [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech dataset