README.md
14.6 KB · 280 lines · markdown Raw
1 ---
2 library_name: vllm
3 language:
4 - en
5 - fr
6 - es
7 - de
8 - ru
9 - zh
10 - ja
11 - it
12 - pt
13 - nl
14 - ar
15 - hi
16 - ko
17 license: apache-2.0
18 inference: false
19 base_model:
20 - mistralai/Ministral-3-3B-Base-2512
21 extra_gated_description: >-
22 If you want to learn more about how we process your personal data, please read
23 our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
24 pipeline_tag: automatic-speech-recognition
25 tags:
26 - mistral-common
27 ---
28
29 # Voxtral Mini 4B Realtime 2602
30
31 Voxtral Mini 4B Realtime 2602 is a **multilingual, realtime speech-transcription model** and among the first open-source solutions to achieve accuracy comparable to offline systems with a delay of **<500ms**.
32 It supports **13 languages** and outperforms existing open-source baselines across a range of tasks, making it ideal for applications like voice assistants and live subtitling.
33
34 Built with a **natively streaming architecture** and a custom causal audio encoder - it allows configurable transcription delays (240ms to 2.4s), enabling users to balance **latency and accuracy** based on their needs.
35 At a **480ms delay**, it matches the performance of leading offline open-source transcription models, as well as realtime APIs.
36
37 As a **4B-parameter model**, is optimized for **on-device deployment**, requiring minimal hardware resources.
38 It runs in realtime with on devices minimal hardware with throughput exceeding 12.5 tokens/second.
39
40 This model is released in **BF16** under the **Apache-2 license**, ensuring flexibility for both research and commercial use.
41
42 For more details, see our:
43 - [Blog post](https://mistral.ai/news/voxtral-transcribe-2)
44 - [Demo](https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime)
45 - [Technical report](https://arxiv.org/abs/2602.11298)
46 - [vLLM's blog on streaming input](https://blog.vllm.ai/2026/01/31/streaming-realtime.html)
47
48
49 ## Key Features
50 Voxtral Mini 4B Realtime consists of two main architectural components:
51 - **≈3.4B Language Model**
52 - **≈970M Audio Encoder**
53 - The audio encoder was trained from scratch with causal attention enabling streaming capability
54 - Both the audio encoder as well as the LLM backbone use sliding window attention allowing for "infinite" streaming
55 - For more details, refer to the [technical report](https://arxiv.org/abs/2602.11298)
56
57 ![Voxtral-Realtime Architecture](https://raw.githubusercontent.com/sanchit-gandhi/notebooks/refs/heads/main/voxtral-realtime.jpeg)
58
59 The Voxtral Mini 4B Realtime model offers the following capabilities:
60 - **High-Quality Transcription**: Transcribe audio to text with confidence.
61 - **Multilingual**: Supports dozens of languages, making it perfect for multilingual transcription tasks.
62 - **Real-Time**: Fast streaming ASR model, enabling real-time transcription use cases.
63 - **Configurable Transcription Delays**: Customize the transcription delay to balance quality and latency, from 80ms to 2.4s.
64
65 ### Use Cases
66 **Real-Time Transcription Purposes:**
67 - Private meeting transcriptions
68 - Live subtitle creation
69 - Real-time assistants with speech understanding
70 - And more
71
72 Bringing real-time transcription capabilities to all.
73
74 ### Recommended Settings
75
76 We recommend deploying with the following best practices:
77 - Always set the temperature to 0.0
78 - A single text-token is worth 80ms. Hence, make sure to set your `--max-model-len` accordingly. To live-record a 1h meeting, you need to set `--max-model-len >= 3600 / 0.8 = 45000`.
79 In theory, you should be able to record with no limit; in practice, pre-allocations of RoPE parameters among other things limits `--max-model-len`.
80 For the best user experience, we recommend to simply instantiate vLLM with the default parameters which will automatically set a maximum model length of 131072 (~ca. 3h).
81 - We strongly recommend using websockets to set up audio streaming sessions. For more info on how to do so, check [Usage](#usage).
82 - We recommend using a delay of 480ms as we found it to be the sweet spot of performance and low latency. If, however, you want to adapt the delay, you can change the `"transcription_delay_ms": 480` parameter
83 in the [tekken.json](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602/blob/main/tekken.json) file to any multiple of 80ms between 80 and 1200, as well as 2400 as a standalone value.
84
85 ## Benchmark Results
86
87 We compare Voxtral Mini 4B Realtime to similar models - both offline models and realtime.
88 Voxtral Mini 4B Realtime is competitive to leading offline models and shows significant gains over existing open-source realtime solutions.
89
90 ### Fleurs
91
92 | Model | Delay | AVG | Arabic | German | English | Spanish | French | Hindi | Italian | Dutch | Portuguese | Chinese | Japanese | Korean | Russian |
93 |-----------------------------------------|-------------|---------|--------|--------|---------|---------|--------|--------|---------|-------|------------|---------|----------|--------|---------|
94 | Voxtral Mini Transcribe 2.0 | Offline | 5.90% | 13.54% | 3.54% | 3.32% | 2.63% | 4.32% | 10.33% | 2.17% | 4.78% | 3.56% | 7.30% | 4.14% | 12.29% | 4.75% |
95 | **Voxtral Mini 4B Realtime 2602** | 480 ms | 8.72% | 22.53% | 6.19% | 4.90% | 3.31% | 6.42% | 12.88% | 3.27% | 7.07% | 5.03% | 10.45% | 9.59% | 15.74% | 6.02% |
96 | | | | | | | | | | | | | | | | |
97 | | 160 ms | 12.60% | 24.33% | 9.50% | 6.46% | 5.34% | 9.75% | 15.28% | 5.59% | 11.39%| 10.01% | 17.67% | 19.17% | 19.81% | 9.53% |
98 | | 240 ms | 10.80% | 23.95% | 8.15% | 5.91% | 4.59% | 8.00% | 14.26% | 4.41% | 9.23% | 7.51% | 13.84% | 15.17% | 17.56% | 7.87% |
99 | | 960 ms | 7.70% | 20.32% | 4.87% | 4.34% | 2.98% | 5.68% | 11.82% | 2.46% | 6.76% | 4.57% | 8.99% | 6.80% | 14.90% | 5.56% |
100 | | 2400 ms | 6.73% | 14.71% | 4.15% | 4.05% | 2.71% | 5.23% | 10.73% | 2.37% | 5.91% | 3.93% | 8.48% | 5.50% | 14.30% | 5.41% |
101
102 ### Long-form English
103
104 | Model | Delay | Meanwhile (<10m) | E-21 (<10m) | E-22 (<10m) | TEDLIUM (<20m) |
105 | ---------------------------------- | ------ | ---------------- | ----------- | ----------- | -------------- |
106 | Voxtral Mini Transcribe 2.0 | Offline| 4.08% | 9.81% | 11.69% | 2.86% |
107 | **Voxtral Mini 4B Realtime 2602** | 480ms | 5.05% | 10.23% | 12.30% | 3.17% |
108
109
110 ### Short-form English
111
112 | Model | Delay | CHiME-4 | GigaSpeech 2k Subset | AMI IHM | SwitchBoard | CHiME-4 SP | GISpeech 2k Subset |
113 | ---------------------------------- | ------ | ------- | -------------------- | ------- | ----------- | ---------- | ------------------ |
114 | Voxtral Mini Transcribe 2.0 | Offline | 10.39% | 6.81% | 14.43% | 11.54% | 10.42% | 1.74% |
115 | **Voxtral Mini 4B Realtime 2602** | 480ms | 10.50% | 7.35% | 15.05% | 11.65% | 12.41% | 1.73% |
116
117 ## Usage
118
119 The model can also be deployed with the following libraries:
120 - [`vllm (recommended)`](https://github.com/vllm-project/vllm): See [here](#vllm-recommended)
121 - [`transformers`](https://github.com/huggingface/transformers): See [here](#transformers)
122 - [`executorch` (untested)](https://github.com/pytorch/executorch/tree/main/examples/models/voxtral_realtime): See [here](#executorch-untested)
123
124 - *Community Contributions*: See [here](#community-contributions-untested)
125
126 ### vLLM (recommended)
127
128 > [!Tip]
129 > We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM.
130 > Special thanks goes out to [Joshua Deng](https://github.com/joshuadeng), [Yu Luo](https://github.com/ErickLuo90), [Chen Zhang](https://github.com/heheda12345), [Nick Hill](https://github.com/njhill), [Nicolò Lucchesi](https://github.com/NickLucche), [Roger Wang](https://github.com/ywang96), and [Cyrus Leung](https://github.com/DarkLight1337)
131 > for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.
132
133 > [!Warning]
134 > Due to its novel architecture, Voxtral Realtime is currently only support in vLLM. We very much welcome community contributions
135 > to add the architecture to [Transformers](https://github.com/huggingface/transformers) and [Llama.cpp](https://github.com/ggml-org/llama.cpp).
136
137 We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM.
138 [vLLM](https://github.com/vllm-project/vllm)'s [new Realtime API](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/?h=realtime#realtime-api) is perfectly suited to
139 run audio streaming sessions with the model.
140
141 #### Installation
142
143 Make sure to install [vllm](https://github.com/vllm-project/vllm) from the nightly pypi package.
144 See [here](https://docs.vllm.ai/en/latest/getting_started/installation/) for a full installation guide.
145
146 ```
147 uv pip install -U vllm
148 ```
149
150 Doing so should automatically install [`mistral_common >= 1.9.0`](https://github.com/mistralai/mistral-common/releases/tag/v1.9.0).
151
152 To check:
153 ```
154 python -c "import mistral_common; print(mistral_common.__version__)"
155 ```
156
157 You can also make use of a ready-to-go [docker image](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile) or on the [docker hub](https://hub.docker.com/layers/vllm/vllm-openai/nightly/images/sha256-6ae33f5001ab9d32346ce2c82c660fe57021c4f0c162ed0c60b843319829b810).
158
159 Make sure to also install all required audio processing libraries:
160
161 ```
162 uv pip install soxr librosa soundfile
163 ```
164
165 Also we recommend using Transformers v5 as v4 can clutter the terminal with unnecessary warnings (see [here](https://github.com/vllm-project/vllm/issues/34642))
166
167 ```
168 uv pip install --upgrade transformers
169 ```
170
171 #### Serve
172
173 Due to size and the BF16 format of the weights - `Voxtral-Mini-4B-Realtime-2602` can run on a single GPU with >= 16GB memory.
174
175 The model can be launched in both "eager" mode:
176
177 ```bash
178 VLLM_DISABLE_COMPILE_CACHE=1 vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 --compilation_config '{"cudagraph_mode": "PIECEWISE"}'
179 ```
180
181 Additional flags:
182 * You can set `--max-num-batched-tokens` to balance throughput and latency, higher means higher throughput but higher latency.
183 * You can reduce the default `--max-model-len` to allocate less memory for the pre-computed RoPE frequencies,
184 if you are certain that you won't have to transcribe for more than X hours. By default the model uses a `--max-model-len` of 131072 (> 3h).
185
186 #### Client
187
188 After serving `vllm`, you should see that the model is compatible with `vllm's` new realtime endpoint:
189 ```
190 ...
191 (APIServer pid=3246965) INFO 02-03 17:04:43 [launcher.py:58] Route: /v1/realtime, Endpoint: realtime_endpoint
192 ...
193 ```
194
195 We have added two simple example files that allow you to:
196 - [Stream audio files](https://docs.vllm.ai/en/latest/examples/online_serving/openai_realtime_client/?h=realtime#openai-realtime-client)
197 - [Simple gradio live transcription demo](https://docs.vllm.ai/en/latest/examples/online_serving/openai_realtime_microphone_client/#openai-realtime-microphone-client)
198
199 [![Screenshot 2026-02-03 at 18.30.08](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/STM6r9lsL8_NRmS3bcZ9x.png)](https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime)
200
201 **To try out a demo, click [here](https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtime)**
202
203 ### Transformers
204
205 Starting with `transformers >= 5.2.0`, you can run Voxtral Realtime natively in Transformers!
206
207 For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/voxtral_realtime).
208
209 #### Installation
210
211 Install Transformers:
212
213 ```bash
214 pip install --upgrade transformers
215 ```
216
217 Make sure to have `mistral-common` installed with audio dependencies:
218
219 ```bash
220 pip install --upgrade "mistral-common[audio]"
221 ```
222
223 #### Usage
224
225 ```python
226 from transformers import VoxtralRealtimeForConditionalGeneration, AutoProcessor
227 from mistral_common.tokens.tokenizers.audio import Audio
228 from huggingface_hub import hf_hub_download
229
230 repo_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
231
232 processor = AutoProcessor.from_pretrained(repo_id)
233 model = VoxtralRealtimeForConditionalGeneration.from_pretrained(repo_id, device_map="auto")
234
235 repo_id = "patrickvonplaten/audio_samples"
236 audio_file = hf_hub_download(repo_id=repo_id, filename="bcn_weather.mp3", repo_type="dataset")
237
238 audio = Audio.from_file(audio_file, strict=False)
239 audio.resample(processor.feature_extractor.sampling_rate)
240
241 inputs = processor(audio.audio_array, return_tensors="pt")
242 inputs = inputs.to(model.device, dtype=model.dtype)
243
244 outputs = model.generate(**inputs)
245 decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
246
247 print(decoded_outputs[0])
248 ```
249
250 ### ExecuTorch (Untested)
251
252 > [!Warning]
253 > Running Voxtral-Realtime on-device with ExecuTorch is not throughly tested and hence
254 > there might be some sharp edges. If you encounter any problems, please file a bug report directly on
255 > [ExecuTorch's GitHub](https://github.com/pytorch/executorch/issues/new/choose)
256
257 [ExecuTorch](https://github.com/pytorch/executorch) enables you to deploy **Voxtral-Realtime** locally—either on-device or on your laptop.
258
259 For a quick, offline demo on your MacBook, check out [Voxtral-Mini-4B-Realtime-2602-ExecuTorch](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-ExecuTorch).
260
261 To deploy **Voxtral-Realtime** in a custom environment or on any device, refer to the [Official Readme](https://github.com/pytorch/executorch/blob/main/examples/models/voxtral_realtime/README.md).
262
263 > [!Tip]
264 > If you're looking for an implementation that is purely written in C,
265 > we recommend to take a look at [voxtral.c](https://github.com/antirez/voxtral.c)
266
267
268 ### Community Contributions (Untested)
269
270 Voxtral Realtime integrations in:
271 - [Pure C](https://github.com/antirez/voxtral.c) - thanks [Salvatore Sanfilippo](https://github.com/antirez)
272 - [mlx-audio framework](https://github.com/Blaizzy/mlx-audio) - thanks [Shreyas Karnik](https://github.com/shreyaskarnik)
273 - [MLX](https://github.com/awni/voxmlx) - thanks [Awni Hannun](https://github.com/awni)
274 - [Rust](https://github.com/TrevorS/voxtral-mini-realtime-rs) - thanks [TrevorS](https://github.com/TrevorS)
275
276 ## License
277
278 This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.txt).
279
280 *You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.*