README.md
37.1 KB · 853 lines · markdown Raw
1 ---
2 language:
3 - en
4 license: mit
5 library_name: transformers
6 tags:
7 - audio
8 - automatic-speech-recognition
9 - transformers.js
10 widget:
11 - example_title: LibriSpeech sample 1
12 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
13 - example_title: LibriSpeech sample 2
14 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
15 pipeline_tag: automatic-speech-recognition
16 ---
17
18 # Distil-Whisper: distil-large-v3
19
20 Distil-Whisper was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
21
22 This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of
23 OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), the latest and most performant Whisper model
24 to date.
25
26 Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
27 **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**.
28
29 The result is a distilled model that performs to within 1% WER of large-v3 on long-form audio using both the sequential
30 and chunked algorithms, and outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster
31 than previous Distil-Whisper models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
32
33 | Model | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |
34 |------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|
35 | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 | 8.4 | 10.0 | 11.0 |
36 | **[distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)** | **756** | **6.3** | **9.7** | **10.8** | **10.9** |
37 | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 10.1 | 15.6 | 11.6 |
38
39 Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
40 (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible with these libraries.
41 You can expect significant performance gains by switching from previous Distil-Whisper checkpoints to distil-large-v3
42 when using these libraries. For convenience, the weights for the most popular libraries are already converted,
43 with instructions for getting started below.
44
45 ## Table of Contents
46
47 1. [Transformers Usage](#transformers-usage)
48 * [Short-Form Transcription](#short-form-transcription)
49 * [Sequential Long-Form](#sequential-long-form)
50 * [Chunked Long-Form](#chunked-long-form)
51 * [Speculative Decoding](#speculative-decoding)
52 * [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
53 2. [Library Integrations](#library-integrations)
54 * [Whisper cpp](#whispercpp)
55 * [Faster Whisper](#faster-whisper)
56 * [OpenAI Whisper](#openai-whisper)
57 * [Transformers.js](#transformersjs)
58 * [Candle](#candle)
59 3. [Model Details](#model-details)
60 4. [License](#license)
61
62 ## Transformers Usage
63
64 distil-large-v3 is supported in the Hugging Face 🤗 Transformers library from version 4.39 onwards. To run the model, first
65 install the latest version of Transformers. For this example, we'll also install 🤗 Datasets to load a toy audio dataset
66 from the Hugging Face Hub:
67
68 ```bash
69 pip install --upgrade pip
70 pip install --upgrade transformers accelerate datasets[audio]
71 ```
72
73 ### Short-Form Transcription
74
75 The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
76 class to transcribe short-form audio files (< 30-seconds) as follows:
77
78 ```python
79 import torch
80 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
81 from datasets import load_dataset
82
83
84 device = "cuda:0" if torch.cuda.is_available() else "cpu"
85 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
86
87 model_id = "distil-whisper/distil-large-v3"
88
89 model = AutoModelForSpeechSeq2Seq.from_pretrained(
90 model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
91 )
92 model.to(device)
93
94 processor = AutoProcessor.from_pretrained(model_id)
95
96 pipe = pipeline(
97 "automatic-speech-recognition",
98 model=model,
99 tokenizer=processor.tokenizer,
100 feature_extractor=processor.feature_extractor,
101 max_new_tokens=128,
102 torch_dtype=torch_dtype,
103 device=device,
104 )
105
106 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
107 sample = dataset[0]["audio"]
108
109 result = pipe(sample)
110 print(result["text"])
111 ```
112
113 To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
114 ```diff
115 - result = pipe(sample)
116 + result = pipe("audio.mp3")
117 ```
118
119 For segment-level timestamps, pass the argument `return_timestamps=True` and return the `"chunks"` output:
120 ```python
121 result = pipe(sample, return_timestamps=True)
122 print(result["chunks"])
123 ```
124
125 <details>
126
127 <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
128
129 Ad-hoc generation arguments can be passed to `model.generate`, including `num_beams` for beam-search, `return_timestamps`
130 for segment-level timestamps, and `prompt_ids` for prompting. See the [docstrings](https://huggingface.co/docs/transformers/en/model_doc/whisper#transformers.WhisperForConditionalGeneration.generate)
131 for more details.
132
133 ```python
134 import torch
135 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
136 from datasets import Audio, load_dataset
137
138
139 device = "cuda:0" if torch.cuda.is_available() else "cpu"
140 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
141
142 model_id = "distil-whisper/distil-large-v3"
143
144 model = AutoModelForSpeechSeq2Seq.from_pretrained(
145 model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
146 )
147 model.to(device)
148
149 processor = AutoProcessor.from_pretrained(model_id)
150
151 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
152 dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
153 sample = dataset[0]["audio"]
154
155 input_features = processor(
156 sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
157 ).input_features
158
159 input_features = input_features.to(device, dtype=torch_dtype)
160
161 gen_kwargs = {
162 "max_new_tokens": 128,
163 "num_beams": 1,
164 "return_timestamps": False,
165 }
166
167 pred_ids = model.generate(input_features, **gen_kwargs)
168 pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
169
170 print(pred_text)
171 ```
172
173 </details>
174
175 ### Sequential Long-Form
176
177 Unlike previous Distil-Whisper releases, distil-large-v3 is specifically designed to be compatible with OpenAI's sequential
178 long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds),
179 and returns more accurate transcriptions compared to the [chunked long-form algorithm](#chunked-long-form).
180
181 The sequential long-form algorithm should be used in either of the following scenarios:
182 1. Transcription accuracy is the most important factor, and latency is less of a consideration
183 2. You are transcribing **batches** of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
184
185 If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm
186 described [below](#chunked-long-form). For a detailed explanation of the different algorithms, refer to Sections 5 of
187 the [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf).
188
189 The [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
190 class can be used to transcribe long audio files with the sequential algorithm as follows:
191
192 ```python
193 import torch
194 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
195 from datasets import load_dataset
196
197
198 device = "cuda:0" if torch.cuda.is_available() else "cpu"
199 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
200
201 model_id = "distil-whisper/distil-large-v3"
202
203 model = AutoModelForSpeechSeq2Seq.from_pretrained(
204 model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
205 )
206 model.to(device)
207
208 processor = AutoProcessor.from_pretrained(model_id)
209
210 pipe = pipeline(
211 "automatic-speech-recognition",
212 model=model,
213 tokenizer=processor.tokenizer,
214 feature_extractor=processor.feature_extractor,
215 max_new_tokens=128,
216 torch_dtype=torch_dtype,
217 device=device,
218 )
219
220 dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
221 sample = dataset[0]["audio"]
222
223 result = pipe(sample)
224 print(result["text"])
225 ```
226
227 <details>
228
229 <summary> For more control over the generation parameters, use the model + processor API directly: </summary>
230
231 ```python
232 import torch
233 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
234 from datasets import Audio, load_dataset
235
236
237 device = "cuda:0" if torch.cuda.is_available() else "cpu"
238 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
239
240 model_id = "distil-whisper/distil-large-v3"
241
242 model = AutoModelForSpeechSeq2Seq.from_pretrained(
243 model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
244 )
245 model.to(device)
246
247 processor = AutoProcessor.from_pretrained(model_id)
248
249 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
250 dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
251 sample = dataset[0]["audio"]
252
253 inputs = processor(
254 sample["array"],
255 sampling_rate=sample["sampling_rate"],
256 return_tensors="pt",
257 truncation=False,
258 padding="longest",
259 return_attention_mask=True,
260 )
261 inputs = inputs.to(device, dtype=torch_dtype)
262
263 gen_kwargs = {
264 "max_new_tokens": 448,
265 "num_beams": 1,
266 "condition_on_prev_tokens": False,
267 "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
268 "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
269 "logprob_threshold": -1.0,
270 "no_speech_threshold": 0.6,
271 "return_timestamps": True,
272 }
273
274 pred_ids = model.generate(**i nputs, **gen_kwargs)
275 pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
276
277 print(pred_text)
278 ```
279
280 </details>
281
282 ### Chunked Long-Form
283
284 distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when
285 a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances,
286 the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the
287 [Distil-Whisper paper](https://arxiv.org/pdf/2311.00430.pdf)).
288
289 To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For distil-large-v3, a chunk length of 25-seconds
290 is optimal. To activate batching over long audio files, pass the argument `batch_size`:
291
292 ```python
293 import torch
294 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
295 from datasets import load_dataset
296
297
298 device = "cuda:0" if torch.cuda.is_available() else "cpu"
299 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
300
301 model_id = "distil-whisper/distil-large-v3"
302
303 model = AutoModelForSpeechSeq2Seq.from_pretrained(
304 model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
305 )
306 model.to(device)
307
308 processor = AutoProcessor.from_pretrained(model_id)
309
310 pipe = pipeline(
311 "automatic-speech-recognition",
312 model=model,
313 tokenizer=processor.tokenizer,
314 feature_extractor=processor.feature_extractor,
315 max_new_tokens=128,
316 chunk_length_s=25,
317 batch_size=16,
318 torch_dtype=torch_dtype,
319 device=device,
320 )
321
322 dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
323 sample = dataset[0]["audio"]
324
325 result = pipe(sample)
326 print(result["text"])
327 ```
328
329 ### Speculative Decoding
330
331 distil-large-v3 is the first Distil-Whisper model that can be used as an assistant to Whisper large-v3 for [speculative decoding](https://huggingface.co/blog/whisper-speculative-decoding).
332 Speculative decoding mathematically ensures that exactly the same outputs as Whisper are obtained, while being 2 times faster.
333 This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.
334
335 In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
336 specify it as the "assistant model" for generation:
337
338 ```python
339 from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
340 import torch
341 from datasets import load_dataset
342
343 device = "cuda:0" if torch.cuda.is_available() else "cpu"
344 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
345
346 assistant_model_id = "distil-whisper/distil-large-v3"
347
348 assistant_model = AutoModelForCausalLM.from_pretrained(
349 assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
350 )
351 assistant_model.to(device)
352
353 model_id = "openai/whisper-large-v3"
354
355 model = AutoModelForSpeechSeq2Seq.from_pretrained(
356 model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
357 )
358 model.to(device)
359
360 processor = AutoProcessor.from_pretrained(model_id)
361
362 pipe = pipeline(
363 "automatic-speech-recognition",
364 model=model,
365 tokenizer=processor.tokenizer,
366 feature_extractor=processor.feature_extractor,
367 max_new_tokens=128,
368 generate_kwargs={"assistant_model": assistant_model},
369 torch_dtype=torch_dtype,
370 device=device,
371 )
372
373 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
374 sample = dataset[0]["audio"]
375
376 result = pipe(sample)
377 print(result["text"])
378 ```
379
380 For more details on speculative decoding, refer to the blog post [Speculative Decoding for 2x Faster Whisper Inference](https://huggingface.co/blog/whisper-speculative-decoding).
381
382 ### Additional Speed & Memory Improvements
383
384 You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
385 requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
386 more efficient flash attention version.
387
388 #### Flash Attention 2
389
390 We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
391 if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
392
393 ```
394 pip install flash-attn --no-build-isolation
395 ```
396
397 Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
398
399 ```diff
400 - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
401 + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
402 ```
403
404 #### Torch Scale-Product-Attention (SDPA)
405
406 If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
407 This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
408 whether you have a compatible PyTorch version, run the following Python code snippet:
409
410 ```python
411 from transformers.utils import is_torch_sdpa_available
412
413 print(is_torch_sdpa_available())
414 ```
415
416 If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
417 returns `False`, you need to upgrade your PyTorch version according to the [official instructions](https://pytorch.org/get-started/locally/)
418
419 Once a valid PyTorch version is installed, SDPA is activated by default. It can also be set explicitly by specifying
420 `attn_implementation="sdpa"` as follows:
421
422 ```diff
423 - model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
424 + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
425 ```
426
427 For more information about how to use the SDPA refer to the [Transformers SDPA documentation](https://huggingface.co/docs/transformers/en/perf_infer_gpu_one#pytorch-scaled-dot-product-attention).
428
429 #### Torch compile
430
431 Coming soon...
432
433 #### 4-bit and 8-bit Inference
434
435 Coming soon...
436
437 ## Library Integrations
438
439 ### Whisper.cpp
440
441 Distil-Whisper can be run with the [Whisper.cpp](https://github.com/ggerganov/whisper.cpp) package with the original
442 sequential long-form transcription algorithm. In a provisional benchmark on Mac M1, distil-large-v3 is over 5x faster
443 than Whisper large-v3, while performing to within 0.8% WER over long-form audio.
444
445 Steps for getting started:
446
447 1. Clone the Whisper.cpp repository:
448 ```
449 git clone https://github.com/ggerganov/whisper.cpp.git
450 cd whisper.cpp
451 ```
452 2. Install the Hugging Face Hub Python package:
453 ```bash
454 pip install --upgrade huggingface_hub
455 ```
456 And download the GGML weights for distil-large-v3 using the following Python snippet:
457
458 ```python
459 from huggingface_hub import hf_hub_download
460
461 hf_hub_download(repo_id='distil-whisper/distil-large-v3-ggml', filename='ggml-distil-large-v3.bin', local_dir='./models')
462 ```
463
464 Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
465
466 ```bash
467 wget https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3.bin -P ./models
468 ```
469
470 3. Run inference using the provided sample audio:
471
472 ```bash
473 make -j && ./main -m models/ggml-distil-large-v3.bin -f samples/jfk.wav
474 ```
475
476 ### Faster-Whisper
477
478 Faster-Whisper is a reimplementation of Whisper using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), a fast
479 inference engine for Transformer models.
480
481 First, install the Faster-Whisper package according to the [official instructions](https://github.com/SYSTRAN/faster-whisper#installation).
482 For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:
483
484 ```bash
485 pip install --upgrade pip
486 pip install --upgrade git+https://github.com/SYSTRAN/faster-whisper datasets[audio]
487 ```
488
489 The following code snippet loads the distil-large-v3 model and runs inference on an example file from the LibriSpeech ASR
490 dataset:
491
492 ```python
493 import torch
494 from faster_whisper import WhisperModel
495 from datasets import load_dataset
496
497 # define our torch configuration
498 device = "cuda:0" if torch.cuda.is_available() else "cpu"
499 compute_type = "float16" if torch.cuda.is_available() else "float32"
500
501 # load model on GPU if available, else cpu
502 model = WhisperModel("distil-large-v3", device=device, compute_type=compute_type)
503
504 # load toy dataset for example
505 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
506 sample = dataset[1]["audio"]["path"]
507
508 segments, info = model.transcribe(sample, beam_size=1)
509
510 for segment in segments:
511 print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
512 ```
513
514 To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:
515
516 ```python
517 segments, info = model.transcribe("audio.mp3", beam_size=1)
518 ```
519
520 ### OpenAI Whisper
521
522 To use the model in the original Whisper format, first ensure you have the [`openai-whisper`](https://pypi.org/project/openai-whisper/) package installed.
523 For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:
524
525 ```bash
526 pip install --upgrade pip
527 pip install --upgrade openai-whisper datasets[audio]
528 ```
529
530 The following code-snippet demonstrates how to transcribe a sample file from the LibriSpeech dataset loaded using
531 🤗 Datasets:
532
533 ```python
534 from huggingface_hub import hf_hub_download
535 from datasets import load_dataset
536 from whisper import load_model, transcribe
537
538 model_path = hf_hub_download(repo_id="distil-whisper/distil-large-v3-openai", filename="model.bin")
539 model = load_model(model_path)
540
541 dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
542 sample = dataset[0]["audio"]["path"]
543
544 pred_out = transcribe(model, audio=sample, language="en")
545 print(pred_out["text"])
546 ```
547
548 Note that the model weights will be downloaded and saved to your cache the first time you run the example. Subsequently,
549 you can re-use the same example, and the weights will be loaded directly from your cache without having to download them
550 again.
551
552 To transcribe a local audio file, simply pass the path to the audio file as the `audio` argument to transcribe:
553
554 ```python
555 pred_out = transcribe(model, audio=sample, language="en")
556 ```
557
558 The Distil-Whisper model can also be used with the OpenAI Whisper CLI. Refer to the [following instructions](https://huggingface.co/distil-whisper/distil-large-v3-openai#cli-usage)
559 for details.
560
561 ### Transformers.js
562
563 Distil-Whisper can be run completely in your web browser with [Transformers.js](http://github.com/xenova/transformers.js):
564
565 1. Install Transformers.js from [NPM](https://www.npmjs.com/package/@xenova/transformers):
566
567 ```bash
568 npm i @xenova/transformers
569 ```
570
571 2. Import the library and perform inference with the pipeline API.
572
573 ```js
574 import { pipeline } from '@xenova/transformers';
575
576 const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v3');
577
578 const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
579 const output = await transcriber(url);
580 // { text: " And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." }
581 ```
582
583 Check out the online [Distil-Whisper Web Demo](https://huggingface.co/spaces/Xenova/distil-whisper-web) to try it out yourself.
584 As you'll see, it runs locally in your browser: no server required!
585
586 Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
587 for further information.
588
589 ### Candle
590
591 Through an integration with Hugging Face [Candle](https://github.com/huggingface/candle/tree/main) 🕯️, Distil-Whisper is
592 available in the Rust library 🦀
593
594 Benefit from:
595 * Optimised CPU backend with optional MKL support for Linux x86 and Accelerate for Macs
596 * Metal support for efficiently running on Macs
597 * CUDA backend for efficiently running on GPUs, multiple GPU distribution via NCCL
598 * WASM support: run Distil-Whisper in a browser
599
600 Steps for getting started:
601 1. Install [`candle-core`](https://github.com/huggingface/candle/tree/main/candle-core) as explained [here](https://huggingface.github.io/candle/guide/installation.html)
602 2. Clone the `candle` repository locally:
603 ```
604 git clone https://github.com/huggingface/candle.git
605 ```
606 3. Enter the example directory for [Whisper](https://github.com/huggingface/candle/tree/main/candle-examples/examples/whisper):
607 ```
608 cd candle/candle-examples/examples/whisper
609 ```
610 4. Run an example:
611 ```
612 cargo run --example whisper --release --features symphonia -- --model distil-large-v3
613 ```
614 5. To specify your own audio file, add the `--input` flag:
615 ```
616 cargo run --example whisper --release --features symphonia -- --model distil-large-v3 --input audio.wav
617 ```
618
619 **Tip:** for compiling using Apple Metal, specify the `metal` feature when you run the example:
620 ```
621 cargo run --example whisper --release --features="symphonia,metal" -- --model distil-large-v3
622 ```
623
624 Note that if you encounter the error:
625 ```
626 error: target `whisper` in package `candle-examples` requires the features: `symphonia`
627 Consider enabling them by passing, e.g., `--features="symphonia"`
628 ```
629 You should clean your `cargo` installation:
630 ```
631 cargo clean
632 ```
633 And subsequently recompile:
634 ```
635 cargo run --example whisper --release --features symphonia -- --model distil-large-v3
636 ```
637
638 ## Model Details
639
640 Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector
641 inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all
642 previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder
643 is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of
644 total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
645
646 To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed.
647 The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training.
648 The student's decoder consists of a subset of the teacher decoder layers, which are intialised from maximally spaced layers.
649 The model is then trained on a weighted sum of the KL divergence and pseudo-label loss terms.
650
651 <p align="center">
652 <img src="https://huggingface.co/datasets/distil-whisper/figures/resolve/main/architecture.png?raw=true" width="600"/>
653 </p>
654
655 ## Differences with distil-large-v2
656
657 Compared to previous version of Distil-Whisper, distil-large-v3 is specifically designed to target the OpenAI sequential
658 long-form transcription algorithm. There are no architectural differences compared to distil-large-v2, other than the fact
659 the model layers are intialised from the latest large-v3 model rather than the older large-v2 one. The differences lie
660 in the way the model was trained.
661
662 Previous Distil-Whisper models were trained on a mean input length of 7-seconds, whereas the original Whisper models were
663 pre-trained on 30-second inputs. During distillation, we shift the distribution of the model weights to the distribution
664 of our training data. If our training data contains shorter utterances (e.g. on average 7-seconds audio instead of 30-seconds),
665 then the predicted distribution shifts to this shorter context length. At inference time, the optimal context window for
666 distil-large-v2 was an interpolation of these two values: 15-seconds. Beyond this time, the predictions for the distil-large-v2
667 model were largely inaccurate, particularly for the timestamp predictions. However, the sequential long-form algorithm
668 uses 30-second sliding windows for inference, with the window shifted according to the last predicted timestamp. Since the
669 last timestamp typically occurs after the 15-second mark, it was predicted with low accuracy, causing the long-form
670 transcription to often fail.
671
672 To preserve Whisper's ability to transcribe sliding 30-second windows, as is done with sequential decoding, we need to
673 ensure the context length of distil-large-v3 is also 30-seconds. This was primarily achieved with four strategies:
674
675 1. **Packing the audio samples in the training dataset to 30-seconds:** since the model is both pre-trained and distilled on audio data packed to 30-seconds, distil-large-v3 now operates on the same ideal context window as Whisper, predicting accurate timestamps up to and including 30-seconds.
676 2. **Freezing the decoder input embeddings:** we use the same input embeds representation as the original model, which is designed to handle longer context lengths than previous Distil-Whisper iterations.
677 3. **Using a longer maximum context length during training:** instead of training on a maximum target length of 128, we train on a maximum of 256. This helps distil-large-v3 transcribe 30-second segments where the number of tokens possibly exceeds 128.
678 4. **Appending prompt conditioning to 50% of the training samples:** enables the model to be used with the `condition_on_prev_tokens` argument, and context windows up to 448 tokens.
679
680 There were further tricks that were employed to improve the performance of distil-large-v3 under the sequential decoding
681 algorithm, which we be explained fully in an upcoming blog post.
682
683 ## Evaluation
684
685 The following code-snippets demonstrates how to evaluate the Distil-Whisper model on the LibriSpeech validation-clean
686 dataset with [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet), meaning no
687 audio data has to be downloaded to your local device.
688
689 First, we need to install the required packages, including 🤗 Datasets to stream and load the audio data, and 🤗 Evaluate to
690 perform the WER calculation:
691
692 ```bash
693 pip install --upgrade pip
694 pip install --upgrade transformers datasets[audio] evaluate jiwer
695 ```
696
697 Evaluation can then be run end-to-end with the following example:
698
699 ```python
700 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
701 from datasets import load_dataset
702 from evaluate import load
703 import torch
704 from tqdm import tqdm
705
706 # define our torch configuration
707 device = "cuda:0" if torch.cuda.is_available() else "cpu"
708 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
709
710 model_id = "distil-whisper/distil-large-v3"
711
712 # load the model + processor
713 model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, use_safetensors=True, low_cpu_mem_usage=True)
714 model = model.to(device)
715 processor = AutoProcessor.from_pretrained(model_id)
716
717 # load the dataset with streaming mode
718 dataset = load_dataset("librispeech_asr", "clean", split="validation", streaming=True)
719
720 # define the evaluation metric
721 wer_metric = load("wer")
722
723 def inference(batch):
724 # 1. Pre-process the audio data to log-mel spectrogram inputs
725 audio = [sample["array"] for sample in batch["audio"]]
726 input_features = processor(audio, sampling_rate=batch["audio"][0]["sampling_rate"], return_tensors="pt").input_features
727 input_features = input_features.to(device, dtype=torch_dtype)
728
729 # 2. Auto-regressively generate the predicted token ids
730 pred_ids = model.generate(input_features, max_new_tokens=128)
731
732 # 3. Decode the token ids to the final transcription
733 batch["transcription"] = processor.batch_decode(pred_ids, skip_special_tokens=True)
734 batch["reference"] = batch["text"]
735 return batch
736
737 # batch size 16 inference
738 dataset = dataset.map(function=inference, batched=True, batch_size=16)
739
740 all_transcriptions = []
741 all_references = []
742
743 # iterate over the dataset and run inference
744 for result in tqdm(dataset, desc="Evaluating..."):
745 all_transcriptions.append(result["transcription"])
746 all_references.append(result["reference"])
747
748 # normalize predictions and references
749 all_transcriptions = [processor.normalize(transcription) for transcription in all_transcriptions]
750 all_references = [processor.normalize(reference) for reference in all_references]
751
752 # compute the WER metric
753 wer = 100 * wer_metric.compute(predictions=all_transcriptions, references=all_references)
754 print(wer)
755
756 ```
757 **Print Output:**
758 ```
759 2.428920763531516
760 ```
761
762 ## Intended Use
763
764 Distil-Whisper is intended to be a drop-in replacement for Whisper large-v3 on English speech recognition. In particular, it
765 achieves comparable WER results over out-of-distribution (OOD) test data, while being 6x faster on both short and long-form audio.
766
767 ## Data
768
769 Distil-Whisper is trained on 22,000 hours of audio data from nine open-source, permissively licensed speech datasets on the
770 Hugging Face Hub:
771
772 | Dataset | Size / h | Speakers | Domain | Licence |
773 |-----------------------------------------------------------------------------------------|----------|----------|-----------------------------|-----------------|
774 | [People's Speech](https://huggingface.co/datasets/MLCommons/peoples_speech) | 12,000 | unknown | Internet Archive | CC-BY-SA-4.0 |
775 | [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 3,000 | unknown | Narrated Wikipedia | CC0-1.0 |
776 | [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | 2,500 | unknown | Audiobook, podcast, YouTube | apache-2.0 |
777 | Fisher | 1,960 | 11,900 | Telephone conversations | LDC |
778 | [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | 960 | 2,480 | Audiobooks | CC-BY-4.0 |
779 | [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 540 | 1,310 | European Parliament | CC0 |
780 | [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | 450 | 2,030 | TED talks | CC-BY-NC-ND 3.0 |
781 | SwitchBoard | 260 | 540 | Telephone conversations | LDC |
782 | [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | 100 | unknown | Meetings | CC-BY-4.0 |
783 ||||||
784 | **Total** | 21,770 | 18,260+ | | |
785
786 The combined dataset spans 10 distinct domains and over 50k speakers. The diversity of this dataset is crucial to ensuring
787 the distilled model is robust to audio distributions and noise.
788
789 The audio data is then pseudo-labelled using the Whisper large-v3 model: we use Whisper to generate predictions for all
790 the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
791 transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
792
793 ## WER Filter
794
795 The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
796 accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
797 and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
798 a specified threshold, we discard the training example. Otherwise, we keep it for training.
799
800 Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
801 for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
802 hallucinations to this filter.
803
804 ## Training
805
806 The model was trained for 80,000 optimisation steps (or 11 epochs) with batch size 256. The Tensorboard training logs can
807 be found under: https://huggingface.co/distil-whisper/distil-large-v3/tensorboard?params=scalars#frame
808
809 ## Results
810
811 The distilled model performs to within 1.5% WER of Whisper large-v3 on out-of-distribution (OOD) short-form audio, within
812 1% WER on sequential long-form decoding, and outperforms large-v3 by 0.1% on chunked long-form. This performance gain is
813 attributed to lower hallucinations.
814
815 For a detailed per-dataset breakdown of the evaluation results, refer to Tables 16 and 17 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)
816
817 Distil-Whisper is also evaluated on the [ESB benchmark](https://arxiv.org/abs/2210.13352) datasets as part of the [OpenASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard),
818 where it performs to within 0.2% WER of Whisper.
819
820 ## Reproducing Distil-Whisper
821
822 Training and evaluation code to reproduce Distil-Whisper is available under the Distil-Whisper repository: https://github.com/huggingface/distil-whisper/tree/main/training
823
824 This code will shortly be updated to include the training updates described in the section [Differences with distil-large-v2](#differences-with-distil-large-v2).
825
826 ## License
827
828 Distil-Whisper inherits the [MIT license](https://github.com/huggingface/distil-whisper/blob/main/LICENSE) from OpenAI's Whisper model.
829
830 ## Citation
831
832 If you use this model, please consider citing the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430):
833 ```
834 @misc{gandhi2023distilwhisper,
835 title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
836 author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
837 year={2023},
838 eprint={2311.00430},
839 archivePrefix={arXiv},
840 primaryClass={cs.CL}
841 }
842 ```
843
844 ## Acknowledgements
845 * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3), in particular Jong Wook Kim for the [original codebase](https://github.com/openai/whisper) and training discussions
846 * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration
847 * [Georgi Gerganov](https://huggingface.co/ggerganov) for the Whisper cpp integration
848 * [Systran team](https://github.com/SYSTRAN) for the Faster-Whisper integration
849 * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
850 * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
851 * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
852 * Google's [TPU Research Cloud (TRC)](https://sites.research.google/trc/about/) programme for Cloud TPU v4 compute resource
853 * [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech dataset