README.md
56.1 KB · 1393 lines · markdown Raw
1 ---
2 license: apache-2.0
3 pipeline_tag: automatic-speech-recognition
4 ---
5
6 # Qwen3-ASR
7
8 ## Overview
9
10 ### Introduction
11
12 <p align="center">
13 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/qwen3_asr_introduction.png" width="90%"/>
14 <p>
15
16 The Qwen3-ASR family includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, which support language identification and ASR for 52 languages and dialects. Both leverage large-scale speech training data and the strong audio understanding capability of their foundation model, Qwen3-Omni. Experiments show that the 1.7B version achieves state-of-the-art performance among open-source ASR models and is competitive with the strongest proprietary commercial APIs. Here are the main features:
17
18 * **All-in-one**: Qwen3-ASR-1.7B and Qwen3-ASR-0.6B support language identification and speech recognition for 30 languages and 22 Chinese dialects, so as to English accents from multiple countries and regions.
19
20 * **Excellent and Fast**: The Qwen3-ASR family ASR models maintains high-quality and robust recognition under complex acoustic environments and challenging text patterns. Qwen3-ASR-1.7B achieves strong performance on both open-sourced and internal benchmarks. While the 0.6B version achieves accuracy-efficient trade-off, it reaches 2000 times throughput at a concurrency of 128. They both achieve streaming / offline unified inference with single model and support transcribe long audio.
21
22 * **Novel and strong forced alignment Solution**: We introduce Qwen3-ForcedAligner-0.6B, which supports timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages. Evaluations show its timestamp accuracy surpasses E2E based forced-alignment models.
23
24 * **Comprehensive inference toolkit**: In addition to open-sourcing the architectures and weights of the Qwen3-ASR series, we also release a powerful, full-featured inference framework that supports vLLM-based batch inference, asynchronous serving, streaming inference, timestamp prediction, and more.
25
26 ### Model Architecture
27
28 <p align="center">
29 <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/overview.jpg" width="100%"/>
30 <p>
31
32
33 ### Released Models Description and Download
34
35 Below is an introduction and download information for the Qwen3-ASR models. Please select and download the model that fits your needs.
36
37 | Model | Supported Languages | Supported Dialects | Inference Mode | Audio Types |
38 |---|---|---|---|---|
39 | Qwen3-ASR-1.7B & Qwen3-ASR-0.6B | Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro) | Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language. | Offline / Streaming | Speech, Singing Voice, Songs with BGM |
40 | Qwen3-ForcedAligner-0.6B | Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish | -- | NAR | Speech |
41
42 During model loading in the `qwen-asr` package or vLLM, model weights will be downloaded automatically based on the model name. However, if your runtime environment does not allow downloading weights during execution, you can use the following commands to manually download the model weights to a local directory:
43
44 ```bash
45 # Download through ModelScope (recommended for users in Mainland China)
46 pip install -U modelscope
47 modelscope download --model Qwen/Qwen3-ASR-1.7B --local_dir ./Qwen3-ASR-1.7B
48 modelscope download --model Qwen/Qwen3-ASR-0.6B --local_dir ./Qwen3-ASR-0.6B
49 modelscope download --model Qwen/Qwen3-ForcedAligner-0.6B --local_dir ./Qwen3-ForcedAligner-0.6B
50 # Download through Hugging Face
51 pip install -U "huggingface_hub[cli]"
52 huggingface-cli download Qwen/Qwen3-ASR-1.7B --local-dir ./Qwen3-ASR-1.7B
53 huggingface-cli download Qwen/Qwen3-ASR-0.6B --local-dir ./Qwen3-ASR-0.6B
54 huggingface-cli download Qwen/Qwen3-ForcedAligner-0.6B --local-dir ./Qwen3-ForcedAligner-0.6B
55 ```
56
57
58 ## Quickstart
59
60 ### Environment Setup
61
62 The easiest way to use Qwen3-ASR is to install the `qwen-asr` Python package from PyPI. This will pull in the required runtime dependencies and allow you to load any released Qwen3-ASR model. If you’d like to simplify environment setup further, you can also use our official [Docker image](#docker). The `qwen-asr` package provides two backends: the transformers backend and the vLLM backend. For usage instructions for different backends, please refer to [Python Package Usage](#python-package-usage). We recommend using a **fresh, isolated environment** to avoid dependency conflicts with existing packages. You can create a clean Python 3.12 environment like this:
63
64 ```bash
65 conda create -n qwen3-asr python=3.12 -y
66 conda activate qwen3-asr
67 ```
68
69 Run the following command to get the minimal installation with transformers-backend support:
70
71 ```bash
72 pip install -U qwen-asr
73 ```
74
75 To enable the vLLM backend for faster inference and streaming support, run:
76
77 ```bash
78 pip install -U qwen-asr[vllm]
79 ```
80
81 If you want to develop or modify the code locally, install from source in editable mode:
82
83 ```bash
84 git clone https://github.com/QwenLM/Qwen3-ASR.git
85 cd Qwen3-ASR
86 pip install -e .
87 # support vLLM backend
88 # pip install -e ".[vllm]"
89 ```
90
91 Additionally, we recommend using FlashAttention 2 to reduce GPU memory usage and accelerate inference speed, especially for long inputs and large batch sizes.
92
93 ```bash
94 pip install -U flash-attn --no-build-isolation
95 ```
96
97 If your machine has less than 96GB of RAM and lots of CPU cores, run:
98
99 ```bash
100 MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
101 ```
102
103 Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [FlashAttention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
104
105 ### Python Package Usage
106
107 #### Quick Inference
108
109 The `qwen-asr` package provides two backends: **transformers backend** and **vLLM backend**. You can pass audio inputs as a local path, a URL, base64 data, or a `(np.ndarray, sr)` tuple, and run batch inference. To quickly try Qwen3-ASR, you can use `Qwen3ASRModel.from_pretrained(...)` for the transformers backend with the following code:
110
111 ```python
112 import torch
113 from qwen_asr import Qwen3ASRModel
114
115 model = Qwen3ASRModel.from_pretrained(
116 "Qwen/Qwen3-ASR-1.7B",
117 dtype=torch.bfloat16,
118 device_map="cuda:0",
119 # attn_implementation="flash_attention_2",
120 max_inference_batch_size=32, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.
121 max_new_tokens=256, # Maximum number of tokens to generate. Set a larger value for long audio input.
122 )
123
124 results = model.transcribe(
125 audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
126 language=None, # set "English" to force the language
127 )
128
129 print(results[0].language)
130 print(results[0].text)
131 ```
132
133 If you want to return timestamps, pass `forced_aligner` and its init kwargs. Here is an example of batch inference with timestamps output:
134
135 ```python
136 import torch
137 from qwen_asr import Qwen3ASRModel
138
139 model = Qwen3ASRModel.from_pretrained(
140 "Qwen/Qwen3-ASR-1.7B",
141 dtype=torch.bfloat16,
142 device_map="cuda:0",
143 # attn_implementation="flash_attention_2",
144 max_inference_batch_size=32, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.
145 max_new_tokens=256, # Maximum number of tokens to generate. Set a larger value for long audio input.
146 forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
147 forced_aligner_kwargs=dict(
148 dtype=torch.bfloat16,
149 device_map="cuda:0",
150 # attn_implementation="flash_attention_2",
151 ),
152 )
153
154 results = model.transcribe(
155 audio=[
156 "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
157 "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
158 ],
159 language=["Chinese", "English"], # can also be set to None for automatic language detection
160 return_time_stamps=True,
161 )
162
163 for r in results:
164 print(r.language, r.text, r.time_stamps[0])
165 ```
166
167 For more detailed usage examples, please refer to the [example code](https://github.com/QwenLM/Qwen3-ASR/blob/main/examples/example_qwen3_asr_transformers.py) for the transformers backend.
168
169 #### vLLM Backend
170
171 If you want the fastest inference speed with Qwen3-ASR, we strongly recommend using the vLLM backend by initializing the model with `Qwen3ASRModel.LLM(...)`. Example code is provided below. Note that you must install it via `pip install -U qwen-asr[vllm]`. If you want the model to output timestamps, it’s best to install FlashAttention via `pip install -U flash-attn --no-build-isolation` to speed up inference for the forced aligner model. Remember to wrap your code under `if __name__ == '__main__':` to avoid the `spawn` error described in [vLLM Troubleshooting](https://docs.vllm.ai/en/latest/usage/troubleshooting/#python-multiprocessing).
172
173 ```python
174 import torch
175 from qwen_asr import Qwen3ASRModel
176
177 if __name__ == '__main__':
178 model = Qwen3ASRModel.LLM(
179 model="Qwen/Qwen3-ASR-1.7B",
180 gpu_memory_utilization=0.7,
181 max_inference_batch_size=128, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.
182 max_new_tokens=4096, # Maximum number of tokens to generate. Set a larger value for long audio input.
183 forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
184 forced_aligner_kwargs=dict(
185 dtype=torch.bfloat16,
186 device_map="cuda:0",
187 # attn_implementation="flash_attention_2",
188 ),
189 )
190
191 results = model.transcribe(
192 audio=[
193 "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
194 "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav",
195 ],
196 language=["Chinese", "English"], # can also be set to None for automatic language detection
197 return_time_stamps=True,
198 )
199
200 for r in results:
201 print(r.language, r.text, r.time_stamps[0])
202 ```
203
204 For more detailed usage examples, please refer to the [example code](https://github.com/QwenLM/Qwen3-ASR/blob/main/examples/example_qwen3_asr_vllm.py) for the vLLM backend. In addition, you can start a vLLM server via the `qwen-asr-serve` command, which is a wrapper around `vllm serve`. You can pass any arguments supported by `vllm serve`, for example:
205
206 ```bash
207 qwen-asr-serve Qwen/Qwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
208 ```
209
210 And send requests to the server via:
211
212 ```python
213 import requests
214
215 url = "http://localhost:8000/v1/chat/completions"
216 headers = {"Content-Type": "application/json"}
217
218 data = {
219 "messages": [
220 {
221 "role": "user",
222 "content": [
223 {
224 "type": "audio_url",
225 "audio_url": {
226 "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
227 },
228 }
229 ],
230 }
231 ]
232 }
233
234 response = requests.post(url, headers=headers, json=data, timeout=300)
235 response.raise_for_status()
236 content = response.json()['choices'][0]['message']['content']
237 print(content)
238
239 # parse ASR output if you want
240 from qwen_asr import parse_asr_output
241 language, text = parse_asr_output(content)
242 print(language)
243 print(text)
244 ```
245
246 #### Streaming Inference
247
248 Qwen3-ASR fully supports streaming inference. Currently, streaming inference is only available with the vLLM backend. Note that streaming inference does not support batch inference or returning timestamps. Please refer to the [example code](https://github.com/QwenLM/Qwen3-ASR/blob/main/examples/example_qwen3_asr_vllm_streaming.py) for details. You can also launch a streaming web demo through the [guide](#streaming-demo) to experience Qwen3-ASR’s streaming transcription capabilities.
249
250 #### ForcedAligner Usage
251
252 `Qwen3-ForcedAligner-0.6B` can align text–speech pairs and return word or character level timestamps. Here is an example of using the forced aligner directly:
253
254 ```python
255 import torch
256 from qwen_asr import Qwen3ForcedAligner
257
258 model = Qwen3ForcedAligner.from_pretrained(
259 "Qwen/Qwen3-ForcedAligner-0.6B",
260 dtype=torch.bfloat16,
261 device_map="cuda:0",
262 # attn_implementation="flash_attention_2",
263 )
264
265 results = model.align(
266 audio="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_zh.wav",
267 text="甚至出现交易几乎停滞的情况。",
268 language="Chinese",
269 )
270
271 print(results[0])
272 print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)
273 ```
274
275 In addition, the forced aligner supports local paths / URLs / base64 data / `(np.ndarray, sr)` inputs and batch inference. Please refer to the [example code](https://github.com/QwenLM/Qwen3-ASR/blob/main/examples/example_qwen3_forced_aligner.py) for details.
276
277 ### DashScope API Usage
278
279 To further explore Qwen3-ASR, we encourage you to try our DashScope API for a faster and more efficient experience. For detailed API information and documentation, please refer to the following:
280
281 | API Description | API Documentation (Mainland China) | API Documentation (International) |
282 |------------------|-----------------------------------|------------------------------------|
283 | Real-time API for Qwen3-ASR. | [https://help.aliyun.com/zh/model-studio/qwen-real-time-speech-recognition](https://help.aliyun.com/zh/model-studio/qwen-real-time-speech-recognition) | [https://www.alibabacloud.com/help/en/model-studio/qwen-real-time-speech-recognition](https://www.alibabacloud.com/help/en/model-studio/qwen-real-time-speech-recognition) |
284 | FileTrans API for Qwen3-ASR. | [https://help.aliyun.com/zh/model-studio/qwen-speech-recognition](https://help.aliyun.com/zh/model-studio/qwen-speech-recognition) | [https://www.alibabacloud.com/help/en/model-studio/qwen-speech-recognition](https://www.alibabacloud.com/help/en/model-studio/qwen-speech-recognition) |
285
286
287 ## Launch Local Web UI Demo
288
289 ### Gradio Demo
290
291 To launch the Qwen3-ASR web UI gradio demo, install the `qwen-asr` package and run `qwen-asr-demo`. Use the command below for help:
292
293 ```bash
294 qwen-asr-demo --help
295 ```
296
297 To launch the demo, you can use the following commands:
298
299 ```bash
300 # Transformers backend
301 qwen-asr-demo \
302 --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
303 --backend transformers \
304 --cuda-visible-devices 0 \
305 --ip 0.0.0.0 --port 8000
306
307 # Transformers backend + Forced Aligner (enable timestamps)
308 qwen-asr-demo \
309 --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
310 --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B \
311 --backend transformers \
312 --cuda-visible-devices 0 \
313 --backend-kwargs '{"device_map":"cuda:0","dtype":"bfloat16","max_inference_batch_size":8,"max_new_tokens":256}' \
314 --aligner-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}' \
315 --ip 0.0.0.0 --port 8000
316
317 # vLLM backend + Forced Aligner (enable timestamps)
318 qwen-asr-demo \
319 --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
320 --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B \
321 --backend vllm \
322 --cuda-visible-devices 0 \
323 --backend-kwargs '{"gpu_memory_utilization":0.7,"max_inference_batch_size":8,"max_new_tokens":2048}' \
324 --aligner-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}' \
325 --ip 0.0.0.0 --port 8000
326 ```
327
328 Then open `http://<your-ip>:8000`, or access it via port forwarding in tools like VS Code.
329
330 #### Backend Notes
331
332 This demo supports two backends: transformers and vLLM. All backend-specific initialization parameters should be passed via `--backend-kwargs` as a JSON dict. If not provided, the demo will use sensible defaults.
333
334 ```bash
335 # Example: override transformers init args without flash attention
336 --backend-kwargs '{"device_map":"cuda:0","dtype":"bfloat16"}'
337
338 # Example: override vLLM init args with 65% GPU memory
339 --backend-kwargs '{"gpu_memory_utilization":0.65}'
340 ```
341
342 #### CUDA Device Notes
343
344 Because vLLM does not follow `cuda:0` style device selection, this demo selects GPUs by setting `CUDA_VISIBLE_DEVICES` via `--cuda-visible-devices`.
345
346 ```bash
347 # Use GPU 0
348 --cuda-visible-devices 0
349
350 # Use GPU 1
351 --cuda-visible-devices 1
352 ```
353
354 #### Timestamps Notes
355
356 Timestamps are only available when `--aligner-checkpoint` is provided. If you launch the demo without a forced aligner, the timestamps UI will be hidden automatically.
357
358 ```bash
359 # No forced aligner
360 qwen-asr-demo --asr-checkpoint Qwen/Qwen3-ASR-1.7B
361
362 # With forced aligner
363 qwen-asr-demo \
364 --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
365 --aligner-checkpoint Qwen/Qwen3-ForcedAligner-0.6B
366 ```
367
368 #### HTTPS Notes
369
370 To avoid browser microphone permission issues after deploying the server, it is recommended/required to run the gradio service over HTTPS (especially when accessed remotely or behind modern browsers/gateways). Use `--ssl-certfile` and `--ssl-keyfile` to enable HTTPS. First, generate a private key and a self-signed certificate (valid for 365 days):
371
372 ```bash
373 openssl req -x509 -newkey rsa:2048 \
374 -keyout key.pem -out cert.pem \
375 -days 365 -nodes \
376 -subj "/CN=localhost"
377 ```
378
379 Then run the demo with HTTPS:
380
381 ```bash
382 qwen-asr-demo \
383 --asr-checkpoint Qwen/Qwen3-ASR-1.7B \
384 --backend transformers \
385 --cuda-visible-devices 0 \
386 --ip 0.0.0.0 --port 8000 \
387 --ssl-certfile cert.pem \
388 --ssl-keyfile key.pem \
389 --no-ssl-verify
390 ```
391
392 Then open `https://<your-ip>:8000` to use it. If your browser shows a warning, that’s expected for self-signed certificates. For production, use a real certificate.
393
394 ### Streaming Demo
395
396 To experience Qwen3-ASR’s streaming transcription capability in a web UI, we provide a minimal Flask-based streaming demo. The demo captures microphone audio in the browser, resamples it to 16,000 Hz, and continuously pushes PCM chunks to the model. Run the demo with the following command:
397
398 ```bash
399 qwen-asr-demo-streaming \
400 --asr-model-path Qwen/Qwen3-ASR-1.7B \
401 --host 0.0.0.0 \
402 --port 8000 \
403 --gpu-memory-utilization 0.9
404 ```
405
406 Then open `http://<your-ip>:8000`, or access it via port forwarding in tools like VS Code.
407
408 ## Deployment with vLLM
409
410 vLLM officially provides day-0 model support for Qwen3-ASR for efficient inference.
411
412 ### Installation
413 You can run Qwen3-ASR with vLLM nightly wheel or docker image. To install the nightly version of vLLM, we recommend using `uv` as the environment manager
414 ```bash
415 uv venv
416 source .venv/bin/activate
417 uv pip install -U vllm --pre \
418 --extra-index-url https://wheels.vllm.ai/nightly/cu129 \
419 --extra-index-url https://download.pytorch.org/whl/cu129 \
420 --index-strategy unsafe-best-match
421 uv pip install "vllm[audio]" # For additional audio dependencies
422 ```
423
424 ### Online Serving
425 You can easily deploy Qwen3-ASR with vLLM by running the following command
426 ```bash
427 vllm serve Qwen/Qwen3-ASR-1.7B
428 ```
429 After the model server is successfully deployed, you can interact with it in multiple ways.
430
431 #### Using OpenAI SDK
432 ```python
433 import base64
434 import httpx
435 from openai import OpenAI
436
437 # Initialize client
438 client = OpenAI(
439 base_url="http://localhost:8000/v1",
440 api_key="EMPTY"
441 )
442
443 # Create multimodal chat completion request
444 response = client.chat.completions.create(
445 model="Qwen/Qwen3-ASR-1.7B",
446 messages=[
447 {
448 "role": "user",
449 "content": [
450 {
451 "type": "audio_url",
452 "audio_url": {
453 {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}
454 }
455 }
456 ]
457 }
458 ],
459 )
460
461 print(response.choices[0].message.content)
462 ```
463 This model is also supported on vLLM with OpenAI transcription API.
464 ```python
465 import httpx
466 from openai import OpenAI
467
468 # Initialize client
469 client = OpenAI(
470 base_url="http://localhost:8000/v1",
471 api_key="EMPTY"
472 )
473 audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"
474 audio_file = httpx.get(audio_url).content
475
476 transcription = client.audio.transcriptions.create(
477 model="Qwen/Qwen3-ASR-1.7B",
478 file=audio_file,
479 )
480
481 print(transcription.text)
482 ```
483
484 #### Using cURL
485 ```bash
486 curl http://localhost:8000/v1/chat/completions \
487 -H "Content-Type: application/json" \
488 -d '{
489 "messages": [
490 {"role": "user", "content": [
491 {"type": "audio_url", "audio_url": {"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav"}}
492 ]}
493 ]
494 }'
495 ```
496
497 ### Offline Inference
498 See the following example on using vLLM to run offline inference with Qwen3-ASR
499 ```python
500 from vllm import LLM, SamplingParams
501 from vllm.assets.audio import AudioAsset
502 import base64
503 import requests
504
505 # Initialize the LLM
506 llm = LLM(
507 model="Qwen/Qwen3-ASR-1.7B"
508 )
509
510 # Load audio
511 audio_asset = AudioAsset("winning_call")
512
513 # Create conversation with audio content
514 conversation = [
515 {
516 "role": "user",
517 "content": [
518 {
519 "type": "audio_url",
520 "audio_url": {"url": audio_asset.url}
521 }
522 ]
523 }
524 ]
525
526 sampling_params = SamplingParams(temperature=0.01, max_tokens=256)
527
528 # Run inference using .chat()
529 outputs = llm.chat(conversation, sampling_params=sampling_params)
530 print(outputs[0].outputs[0].text)
531 ```
532
533
534 ## Docker
535
536 To make it easier to use our `qwen-asr` Python package, we provide a pre-built Docker image: [qwenllm/qwen3-asr](https://hub.docker.com/r/qwenllm/qwen3-asr). You only need to install the GPU driver and download the model files to run the code. Please follow the [NVIDIA Container Toolkit installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) to ensure Docker can access your GPU. If you are in Mainland China and have trouble reaching Docker Hub, you may use a registry mirror to accelerate image pulls.
537
538 First, pull the image and start a container:
539
540 ```bash
541 LOCAL_WORKDIR=/path/to/your/workspace
542 HOST_PORT=8000
543 CONTAINER_PORT=80
544 docker run --gpus all --name qwen3-asr \
545 -v /var/run/docker.sock:/var/run/docker.sock -p $HOST_PORT:$CONTAINER_PORT \
546 --mount type=bind,source=$LOCAL_WORKDIR,target=/data/shared/Qwen3-ASR \
547 --shm-size=4gb \
548 -it qwenllm/qwen3-asr:latest
549 ```
550
551 After running the command, you will enter the container’s bash shell. Your local workspace (**replace** `/path/to/your/workspace` **with the actual path**) will be mounted inside the container at `/data/shared/Qwen3-ASR`. Port `8000` on the host is mapped to port `80` in the container, so you can access services running in the container via `http://<host-ip>:8000`. Note that services inside the container must bind to `0.0.0.0` (not `127.0.0.1`) for port forwarding to work.
552
553 If you exit the container, you can start it again and re-enter it with:
554
555 ```bash
556 docker start qwen3-asr
557 docker exec -it qwen3-asr bash
558 ```
559
560 To remove the container completely, run:
561
562 ```bash
563 docker rm -f qwen3-asr
564 ```
565
566
567 ## Evaluation
568
569 During evaluation, we ran inference for all models with `dtype=torch.bfloat16` and set `max_new_tokens=1024` using vLLM. Greedy search was used for all decoding, and none of the tests specified a language parameter. The detailed evaluation results are shown below.
570
571 <details>
572 <summary>ASR Benchmarks on Public Datasets (WER ↓)</summary>
573
574 <table>
575 <thead>
576 <tr>
577 <th colspan="2" style="text-align: left;"></th>
578 <th style="text-align: center;">GPT-4o<br>-Transcribe</th>
579 <th style="text-align: center;">Gemini-2.5<br>-Pro</th>
580 <th style="text-align: center;">Doubao-ASR</th>
581 <th style="text-align: center;">Whisper<br>-large-v3</th>
582 <th style="text-align: center;">Fun-ASR<br>-MLT-Nano</th>
583 <th style="text-align: center;">Qwen3-ASR<br>-0.6B</th>
584 <th style="text-align: center;">Qwen3-ASR<br>-1.7B</th>
585 </tr>
586 </thead>
587 <tbody>
588 <tr>
589 <td colspan="9" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">English (en)</td>
590 </tr>
591 <tr>
592 <td colspan="2" style="text-align: left;">Librispeech<br>clean | other</td>
593 <td style="text-align: center;"><strong>1.39</strong> | 3.75</td>
594 <td style="text-align: center;">2.89 | 3.56</td>
595 <td style="text-align: center;">2.78 | 5.70</td>
596 <td style="text-align: center;">1.51 | 3.97</td>
597 <td style="text-align: center;">1.68 | 4.03</td>
598 <td style="text-align: center;">2.11 | 4.55</td>
599 <td style="text-align: center;">1.63 | <strong>3.38</strong></td>
600 </tr>
601 <tr>
602 <td colspan="2" style="text-align: left;">GigaSpeech</td>
603 <td style="text-align: center;">25.50</td>
604 <td style="text-align: center;">9.37</td>
605 <td style="text-align: center;">9.55</td>
606 <td style="text-align: center;">9.76</td>
607 <td style="text-align: center;">-</td>
608 <td style="text-align: center;">8.88</td>
609 <td style="text-align: center;"><strong>8.45</strong></td>
610 </tr>
611 <tr>
612 <td colspan="2" style="text-align: left;">CV-en</td>
613 <td style="text-align: center;">9.08</td>
614 <td style="text-align: center;">14.49</td>
615 <td style="text-align: center;">13.78</td>
616 <td style="text-align: center;">9.90</td>
617 <td style="text-align: center;">9.90</td>
618 <td style="text-align: center;">9.92</td>
619 <td style="text-align: center;"><strong>7.39</strong></td>
620 </tr>
621 <tr>
622 <td colspan="2" style="text-align: left;">Fleurs-en</td>
623 <td style="text-align: center;"><strong>2.40</strong></td>
624 <td style="text-align: center;">2.94</td>
625 <td style="text-align: center;">6.31</td>
626 <td style="text-align: center;">4.08</td>
627 <td style="text-align: center;">5.49</td>
628 <td style="text-align: center;">4.39</td>
629 <td style="text-align: center;">3.35</td>
630 </tr>
631 <tr>
632 <td colspan="2" style="text-align: left;">MLS-en</td>
633 <td style="text-align: center;">5.12</td>
634 <td style="text-align: center;"><strong>3.68</strong></td>
635 <td style="text-align: center;">7.09</td>
636 <td style="text-align: center;">4.87</td>
637 <td style="text-align: center;">-</td>
638 <td style="text-align: center;">6.00</td>
639 <td style="text-align: center;">4.58</td>
640 </tr>
641 <tr>
642 <td colspan="2" style="text-align: left;">Tedlium</td>
643 <td style="text-align: center;">7.69</td>
644 <td style="text-align: center;">6.15</td>
645 <td style="text-align: center;">4.91</td>
646 <td style="text-align: center;">6.84</td>
647 <td style="text-align: center;">-</td>
648 <td style="text-align: center;"><strong>3.85<strong></td>
649 <td style="text-align: center;"><strong>4.50</strong></td>
650 </tr>
651 <tr>
652 <td colspan="2" style="text-align: left;">VoxPopuli</td>
653 <td style="text-align: center;">10.29</td>
654 <td style="text-align: center;">11.36</td>
655 <td style="text-align: center;">12.12</td>
656 <td style="text-align: center;">12.05</td>
657 <td style="text-align: center;">-</td>
658 <td style="text-align: center;"><strong>9.96<strong></td>
659 <td style="text-align: center;"><strong>9.15</strong></td>
660 </tr>
661 <tr>
662 <td colspan="9" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Chinese (zh)</td>
663 </tr>
664 <tr>
665 <td colspan="2" style="text-align: left;">WenetSpeech<br>net | meeting</td>
666 <td style="text-align: center;">15.30 | 32.27</td>
667 <td style="text-align: center;">14.43 | 13.47</td>
668 <td style="text-align: center;">N/A</td>
669 <td style="text-align: center;">9.86 | 19.11</td>
670 <td style="text-align: center;">6.35 | -</td>
671 <td style="text-align: center;">5.97 | 6.88</td>
672 <td style="text-align: center;"><strong>4.97</strong> | <strong>5.88</strong></td>
673 </tr>
674 <tr>
675 <td colspan="2" style="text-align: left;">AISHELL-2-test</td>
676 <td style="text-align: center;">4.24</td>
677 <td style="text-align: center;">11.62</td>
678 <td style="text-align: center;">2.85</td>
679 <td style="text-align: center;">5.06</td>
680 <td style="text-align: center;">-</td>
681 <td style="text-align: center;">3.15</td>
682 <td style="text-align: center;"><strong>2.71</strong></td>
683 </tr>
684 <tr>
685 <td colspan="2" style="text-align: left;">SpeechIO</td>
686 <td style="text-align: center;">12.86</td>
687 <td style="text-align: center;">5.30</td>
688 <td style="text-align: center;">2.93</td>
689 <td style="text-align: center;">7.56</td>
690 <td style="text-align: center;">-</td>
691 <td style="text-align: center;">3.44</td>
692 <td style="text-align: center;"><strong>2.88</strong></td>
693 </tr>
694 <tr>
695 <td colspan="2" style="text-align: left;">Fleurs-zh</td>
696 <td style="text-align: center;">2.44</td>
697 <td style="text-align: center;">2.71</td>
698 <td style="text-align: center;">2.69</td>
699 <td style="text-align: center;">4.09</td>
700 <td style="text-align: center;">3.51</td>
701 <td style="text-align: center;">2.88</td>
702 <td style="text-align: center;"><strong>2.41</strong></td>
703 </tr>
704 <tr>
705 <td colspan="2" style="text-align: left;">CV-zh</td>
706 <td style="text-align: center;">6.32</td>
707 <td style="text-align: center;">7.70</td>
708 <td style="text-align: center;">5.95</td>
709 <td style="text-align: center;">12.91</td>
710 <td style="text-align: center;">6.20</td>
711 <td style="text-align: center;">6.89</td>
712 <td style="text-align: center;"><strong>5.35</strong></td>
713 </tr>
714 <tr>
715 <td colspan="9" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Chinese Dialect</td>
716 </tr>
717 <tr>
718 <td colspan="2" style="text-align: left;">KeSpeech</td>
719 <td style="text-align: center;">26.87</td>
720 <td style="text-align: center;">24.71</td>
721 <td style="text-align: center;">5.27</td>
722 <td style="text-align: center;">28.79</td>
723 <td style="text-align: center;">-</td>
724 <td style="text-align: center;">7.08</td>
725 <td style="text-align: center;"><strong>5.10</strong></td>
726 </tr>
727 <tr>
728 <td colspan="2" style="text-align: left;">Fleurs-yue</td>
729 <td style="text-align: center;">4.98</td>
730 <td style="text-align: center;">9.43</td>
731 <td style="text-align: center;">4.98</td>
732 <td style="text-align: center;">9.18</td>
733 <td style="text-align: center;">-</td>
734 <td style="text-align: center;">5.79</td>
735 <td style="text-align: center;"><strong>3.98</strong></td>
736 </tr>
737 <tr>
738 <td colspan="2" style="text-align: left;">CV-yue</td>
739 <td style="text-align: center;">11.36</td>
740 <td style="text-align: center;">18.76</td>
741 <td style="text-align: center;">13.20</td>
742 <td style="text-align: center;">16.23</td>
743 <td style="text-align: center;">-</td>
744 <td style="text-align: center;">9.50</td>
745 <td style="text-align: center;"><strong>7.57</strong></td>
746 </tr>
747 <tr>
748 <td colspan="2" style="text-align: left;">CV-zh-tw</td>
749 <td style="text-align: center;">6.32</td>
750 <td style="text-align: center;">7.31</td>
751 <td style="text-align: center;">4.06</td>
752 <td style="text-align: center;">7.84</td>
753 <td style="text-align: center;">-</td>
754 <td style="text-align: center;">5.59</td>
755 <td style="text-align: center;"><strong>3.77</strong></td>
756 </tr>
757 <tr>
758 <td colspan="2" style="text-align: left;">WenetSpeech-Yue<br>short | long</td>
759 <td style="text-align: center;">15.62 | 25.29</td>
760 <td style="text-align: center;">25.19 | 11.23</td>
761 <td style="text-align: center;">9.74 | 11.40</td>
762 <td style="text-align: center;">32.26 | 46.64</td>
763 <td style="text-align: center;">- | -</td>
764 <td style="text-align: center;">7.54 | 9.92</td>
765 <td style="text-align: center;"><strong>5.82</strong> | <strong>8.85</strong></td>
766 </tr>
767 <tr>
768 <td colspan="2" style="text-align: left;">WenetSpeech-Chuan<br>easy | hard</td>
769 <td style="text-align: center;">34.81 | 53.98</td>
770 <td style="text-align: center;">43.79 | 67.30</td>
771 <td style="text-align: center;"><strong>11.40<strong> | <strong>20.20</strong></td>
772 <td style="text-align: center;">14.35 | 26.80</td>
773 <td style="text-align: center;">- | -</td>
774 <td style="text-align: center;">13.92 | 24.45</td>
775 <td style="text-align: center;">11.99 | 21.63</td>
776 </tr>
777 </tbody>
778 </table>
779
780 </details>
781
782 <details>
783 <summary>ASR Benchmarks on Internal Datasets (WER ↓)</summary>
784
785 <table>
786 <thead>
787 <tr>
788 <th style="text-align: left;"></th>
789 <th style="text-align: center;">GPT-4o<br>-Transcribe</th>
790 <th style="text-align: center;">Gemini-2.5<br>-Pro</th>
791 <th style="text-align: center;">Doubao-ASR</th>
792 <th style="text-align: center;">Whisper<br>-large-v3</th>
793 <th style="text-align: center;">Fun-ASR<br>-MLT-Nano</th>
794 <th style="text-align: center;">Qwen3-ASR<br>-0.6B</th>
795 <th style="text-align: center;">Qwen3-ASR<br>-1.7B</th>
796 </tr>
797 </thead>
798 <tbody>
799 <tr>
800 <td colspan="8" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Accented English</td>
801 </tr>
802 <tr>
803 <td style="text-align: left;">Dialog-Accented English</td>
804 <td style="text-align: center;">28.56</td>
805 <td style="text-align: center;">23.85</td>
806 <td style="text-align: center;">20.41</td>
807 <td style="text-align: center;">21.30</td>
808 <td style="text-align: center;">19.96</td>
809 <td style="text-align: center;"><strong>16.62<strong></td>
810 <td style="text-align: center;"><strong>16.07</strong></td>
811 </tr>
812 <tr>
813 <td colspan="8" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Chinese Mandarin</td>
814 </tr>
815 <tr>
816 <td style="text-align: left;">Elders&Kids</td>
817 <td style="text-align: center;">14.27</td>
818 <td style="text-align: center;">36.93</td>
819 <td style="text-align: center;">4.17</td>
820 <td style="text-align: center;">10.61</td>
821 <td style="text-align: center;">4.54</td>
822 <td style="text-align: center;">4.48</td>
823 <td style="text-align: center;"><strong>3.81</strong></td>
824 </tr>
825 <tr>
826 <td style="text-align: left;">ExtremeNoise</td>
827 <td style="text-align: center;">36.11</td>
828 <td style="text-align: center;">29.06</td>
829 <td style="text-align: center;">17.04</td>
830 <td style="text-align: center;">63.17</td>
831 <td style="text-align: center;">36.55</td>
832 <td style="text-align: center;">17.88</td>
833 <td style="text-align: center;"><strong>16.17</strong></td>
834 </tr>
835 <tr>
836 <td style="text-align: left;">TongueTwister</td>
837 <td style="text-align: center;">20.87</td>
838 <td style="text-align: center;">4.97</td>
839 <td style="text-align: center;">3.47</td>
840 <td style="text-align: center;">16.63</td>
841 <td style="text-align: center;">9.02</td>
842 <td style="text-align: center;">4.06</td>
843 <td style="text-align: center;"><strong>2.44</strong></td>
844 </tr>
845 <tr>
846 <td style="text-align: left;">Dialog-Mandarin</td>
847 <td style="text-align: center;">20.73</td>
848 <td style="text-align: center;">12.50</td>
849 <td style="text-align: center;">6.61</td>
850 <td style="text-align: center;">14.01</td>
851 <td style="text-align: center;">7.32</td>
852 <td style="text-align: center;">7.06</td>
853 <td style="text-align: center;"><strong>6.54</strong></td>
854 </tr>
855 <tr>
856 <td colspan="8" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Chinese Dialect</td>
857 </tr>
858 <tr>
859 <td style="text-align: left;">Dialog-Cantonese</td>
860 <td style="text-align: center;">16.05</td>
861 <td style="text-align: center;">14.98</td>
862 <td style="text-align: center;">7.56</td>
863 <td style="text-align: center;">31.04</td>
864 <td style="text-align: center;">5.85</td>
865 <td style="text-align: center;"><strong>4.80<strong></td>
866 <td style="text-align: center;"><strong>4.12</strong></td>
867 </tr>
868 <tr>
869 <td style="text-align: left;">Dialog-Chinese Dialects</td>
870 <td style="text-align: center;">45.37</td>
871 <td style="text-align: center;">47.70</td>
872 <td style="text-align: center;">19.85</td>
873 <td style="text-align: center;">44.55</td>
874 <td style="text-align: center;">19.41</td>
875 <td style="text-align: center;"><strong>18.24<strong></td>
876 <td style="text-align: center;"><strong>15.94</strong></td>
877 </tr>
878 </tbody>
879 </table>
880 <p><strong>Dialect coverage:</strong> Results for <em>Dialog-Accented English</em> are averaged over 16 accents, and results for <em>Dialog-Chinese Dialects</em> are averaged over 22 Chinese dialects.</p>
881
882 </details>
883
884 <details>
885 <summary>Multilingual ASR Benchmarks (WER ↓)</summary>
886
887 <table>
888 <thead>
889 <tr>
890 <th style="text-align: left;"></th>
891 <th style="text-align: center;">GLM-ASR<br>-Nano-2512</th>
892 <th style="text-align: center;">Whisper<br>-large-v3</th>
893 <th style="text-align: center;">Fun-ASR<br>-MLT-Nano</th>
894 <th style="text-align: center;">Qwen3-ASR<br>-0.6B</th>
895 <th style="text-align: center;">Qwen3-ASR<br>-1.7B</th>
896 </tr>
897 </thead>
898 <tbody>
899 <tr>
900 <td colspan="6" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Open-sourced Benchmarks</td>
901 </tr>
902 <tr>
903 <td style="text-align: left;">MLS</td>
904 <td style="text-align: center;">13.32</td>
905 <td style="text-align: center;">8.62</td>
906 <td style="text-align: center;">28.70</td>
907 <td style="text-align: center;">13.19</td>
908 <td style="text-align: center;"><strong>8.55</strong></td>
909 </tr>
910 <tr>
911 <td style="text-align: left;">CommonVoice</td>
912 <td style="text-align: center;">19.40</td>
913 <td style="text-align: center;">10.77</td>
914 <td style="text-align: center;">17.25</td>
915 <td style="text-align: center;">12.75</td>
916 <td style="text-align: center;"><strong>9.18</strong></td>
917 </tr>
918 <tr>
919 <td style="text-align: left;">MLC-SLM</td>
920 <td style="text-align: center;">34.93</td>
921 <td style="text-align: center;">15.68</td>
922 <td style="text-align: center;">29.94</td>
923 <td style="text-align: center;">15.84</td>
924 <td style="text-align: center;"><strong>12.74</strong></td>
925 </tr>
926 <tr>
927 <td style="text-align: left;">Fleurs</td>
928 <td style="text-align: center;">16.08</td>
929 <td style="text-align: center;">5.27</td>
930 <td style="text-align: center;">10.03</td>
931 <td style="text-align: center;">7.57</td>
932 <td style="text-align: center;"><strong>4.90</strong></td>
933 </tr>
934 <tr>
935 <td style="text-align: left;">Fleurs<sup>†</sup></td>
936 <td style="text-align: center;">20.05</td>
937 <td style="text-align: center;">6.85</td>
938 <td style="text-align: center;">31.89</td>
939 <td style="text-align: center;">10.37</td>
940 <td style="text-align: center;"><strong>6.62</strong></td>
941 </tr>
942 <tr>
943 <td style="text-align: left;">Fleurs<sup>††</sup></td>
944 <td style="text-align: center;">24.83</td>
945 <td style="text-align: center;"><strong>8.16</strong></td>
946 <td style="text-align: center;">47.84</td>
947 <td style="text-align: center;">21.80</td>
948 <td style="text-align: center;">12.60</td>
949 </tr>
950 <tr>
951 <td colspan="6" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Qwen-ASR Internal Benchmarks</td>
952 </tr>
953 <tr>
954 <td style="text-align: left;">News-Multilingual</td>
955 <td style="text-align: center;">49.40</td>
956 <td style="text-align: center;">14.80</td>
957 <td style="text-align: center;">65.07</td>
958 <td style="text-align: center;">17.39</td>
959 <td style="text-align: center;"><strong>12.80</strong></td>
960 </tr>
961 </tbody>
962 </table>
963 <p><strong>Language coverage:</strong> <em>MLS</em> includes 8 languages: {da, de, en, es, fr, it, pl, pt}.<br><em>CommonVoice</em> includes 13 languages: {en, zh, yue, zh_TW, ar, de, es, fr, it, ja, ko, pt, ru}.<br><em>MLC-SLM</em> includes 11 languages: {en, fr, de, it, pt, es, ja, ko, ru, th, vi}.<br><em>Fleurs</em> includes 12 languages: {en, zh, yue, ar, de, es, fr, it, ja, ko, pt, ru }.<br><em>Fleurs<sup>†</sup></em> includes 8 additional languages beyond Fleurs: {hi, id, ms, nl, pl, th, tr, vi}.<br><em>Fleurs<sup>††</sup></em> includes 10 additional languages beyond Fleurs<sup>†</sup>: {cs, da, el, fa, fi, fil, hu, mk, ro, sv}.<br><em>News-Multilingual</em> includes 15 languages: {ar, de, es, fr, hi, id, it, ja, ko, nl, pl, pt, ru, th, vi}.</p>
964
965 </details>
966
967 <details>
968 <summary>Language Identification Accuracy (%) ↑</summary>
969
970 <table>
971 <thead>
972 <tr>
973 <th style="text-align: left;"></th>
974 <th style="text-align: center;">Whisper-large-v3</th>
975 <th style="text-align: center;">Qwen3-ASR-0.6B</th>
976 <th style="text-align: center;">Qwen3-ASR-1.7B</th>
977 </tr>
978 </thead>
979 <tbody>
980 <tr>
981 <td style="text-align: left;">MLS</td>
982 <td style="text-align: center;"><strong>99.9</strong></td>
983 <td style="text-align: center;">99.3</td>
984 <td style="text-align: center;"><strong>99.9</strong></td>
985 </tr>
986 <tr>
987 <td style="text-align: left;">CommonVoice</td>
988 <td style="text-align: center;">92.7</td>
989 <td style="text-align: center;"><strong>98.2<strong></td>
990 <td style="text-align: center;"><strong>98.7</strong></td>
991 </tr>
992 <tr>
993 <td style="text-align: left;">MLC-SLM</td>
994 <td style="text-align: center;">89.2</td>
995 <td style="text-align: center;"><strong>92.7<strong></td>
996 <td style="text-align: center;"><strong>94.1</strong></td>
997 </tr>
998 <tr>
999 <td style="text-align: left;">Fleurs</td>
1000 <td style="text-align: center;">94.6</td>
1001 <td style="text-align: center;"><strong>97.1<strong></td>
1002 <td style="text-align: center;"><strong>98.7</strong></td>
1003 </tr>
1004 <tr style="border-top: 1px solid #ddd;">
1005 <td style="text-align: left;"><em>Avg.</em></td>
1006 <td style="text-align: center;">94.1</td>
1007 <td style="text-align: center;"><strong>96.8<strong></td>
1008 <td style="text-align: center;"><strong>97.9</strong></td>
1009 </tr>
1010 </tbody>
1011 </table>
1012 <p><strong>Language coverage:</strong> The language sets follow Multilingual ASR Benchmarks. Here, Fleurs corresponds to Fleurs<sup>††</sup> in Multilingual ASR Benchmarks and covers 30 languages.</p>
1013
1014 </details>
1015
1016 <details>
1017 <summary>Singing Voice & Song Transcription (WER ↓)</summary>
1018
1019 <table>
1020 <thead>
1021 <tr>
1022 <th style="text-align: left;"></th>
1023 <th style="text-align: center;">GPT-4o<br>-Transcribe</th>
1024 <th style="text-align: center;">Gemini-2.5<br>-Pro</th>
1025 <th style="text-align: center;">Doubao-ASR<br>-1.0</th>
1026 <th style="text-align: center;">Whisper<br>-large-v3</th>
1027 <th style="text-align: center;">Fun-ASR-MLT<br>-Nano</th>
1028 <th style="text-align: center;">Qwen3-ASR<br>-1.7B</th>
1029 </tr>
1030 </thead>
1031 <tbody>
1032 <tr>
1033 <td colspan="7" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Singing</td>
1034 </tr>
1035 <tr>
1036 <td style="text-align: left;">M4Singer</td>
1037 <td style="text-align: center;">16.77</td>
1038 <td style="text-align: center;">20.88</td>
1039 <td style="text-align: center;">7.88</td>
1040 <td style="text-align: center;">13.58</td>
1041 <td style="text-align: center;">7.29</td>
1042 <td style="text-align: center;"><strong>5.98</strong></td>
1043 </tr>
1044 <tr>
1045 <td style="text-align: left;">MIR-1k-vocal</td>
1046 <td style="text-align: center;">11.87</td>
1047 <td style="text-align: center;">9.85</td>
1048 <td style="text-align: center;">6.56</td>
1049 <td style="text-align: center;">11.71</td>
1050 <td style="text-align: center;">8.17</td>
1051 <td style="text-align: center;"><strong>6.25</strong></td>
1052 </tr>
1053 <tr>
1054 <td style="text-align: left;">Opencpop</td>
1055 <td style="text-align: center;">7.93</td>
1056 <td style="text-align: center;">6.49</td>
1057 <td style="text-align: center;">3.80</td>
1058 <td style="text-align: center;">9.52</td>
1059 <td style="text-align: center;"><strong>2.98</strong></td>
1060 <td style="text-align: center;">3.08</td>
1061 </tr>
1062 <tr>
1063 <td style="text-align: left;">Popcs</td>
1064 <td style="text-align: center;">32.84</td>
1065 <td style="text-align: center;">15.13</td>
1066 <td style="text-align: center;">8.97</td>
1067 <td style="text-align: center;">13.77</td>
1068 <td style="text-align: center;">9.42</td>
1069 <td style="text-align: center;"><strong>8.52</strong></td>
1070 </tr>
1071 <tr>
1072 <td colspan="7" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Songs with BGM</td>
1073 </tr>
1074 <tr>
1075 <td style="text-align: left;">EntireSongs-en</td>
1076 <td style="text-align: center;">30.71</td>
1077 <td style="text-align: center;"><strong>12.18</strong></td>
1078 <td style="text-align: center;">33.51</td>
1079 <td style="text-align: center;">N/A</td>
1080 <td style="text-align: center;">N/A</td>
1081 <td style="text-align: center;">14.60</td>
1082 </tr>
1083 <tr>
1084 <td style="text-align: left;">EntireSongs-zh</td>
1085 <td style="text-align: center;">34.86</td>
1086 <td style="text-align: center;">18.68</td>
1087 <td style="text-align: center;">23.99</td>
1088 <td style="text-align: center;">N/A</td>
1089 <td style="text-align: center;">N/A</td>
1090 <td style="text-align: center;"><strong>13.91</strong></td>
1091 </tr>
1092 </tbody>
1093 </table>
1094
1095 </details>
1096
1097 <details>
1098 <summary>ASR Inference Mode Performance (WER ↓)</summary>
1099
1100 <table>
1101 <thead>
1102 <tr>
1103 <th style="text-align: left;">Model</th>
1104 <th style="text-align: left;">Infer. Mode</th>
1105 <th style="text-align: center;">Librispeech</th>
1106 <th style="text-align: center;">Fleurs-en</th>
1107 <th style="text-align: center;">Fleurs-zh</th>
1108 <th style="text-align: center;">Avg.</th>
1109 </tr>
1110 </thead>
1111 <tbody>
1112 <tr>
1113 <td rowspan="2" style="text-align: left; vertical-align: middle;">Qwen3-ASR-1.7B</td>
1114 <td style="text-align: left;">Offline</td>
1115 <td style="text-align: center;">1.63 | 3.38</td>
1116 <td style="text-align: center;">3.35</td>
1117 <td style="text-align: center;">2.41</td>
1118 <td style="text-align: center;">2.69</td>
1119 </tr>
1120 <tr>
1121 <td style="text-align: left;">Streaming</td>
1122 <td style="text-align: center;">1.95 | 4.51</td>
1123 <td style="text-align: center;">4.02</td>
1124 <td style="text-align: center;">2.84</td>
1125 <td style="text-align: center;">3.33</td>
1126 </tr>
1127 <tr style="border-top: 1px solid #ddd;">
1128 <td rowspan="2" style="text-align: left; vertical-align: middle;">Qwen3-ASR-0.6B</td>
1129 <td style="text-align: left;">Offline</td>
1130 <td style="text-align: center;">2.11 | 4.55</td>
1131 <td style="text-align: center;">4.39</td>
1132 <td style="text-align: center;">2.88</td>
1133 <td style="text-align: center;">3.48</td>
1134 </tr>
1135 <tr>
1136 <td style="text-align: left;">Streaming</td>
1137 <td style="text-align: center;">2.54 | 6.27</td>
1138 <td style="text-align: center;">5.38</td>
1139 <td style="text-align: center;">3.40</td>
1140 <td style="text-align: center;">4.40</td>
1141 </tr>
1142 </tbody>
1143 </table>
1144
1145 </details>
1146
1147 <details>
1148 <summary>Forced Alignment Benchmarks (AAS ms ↓)</summary>
1149
1150 <table>
1151 <thead>
1152 <tr>
1153 <th style="text-align: left;"></th>
1154 <th style="text-align: center;">Monotonic-Aligner</th>
1155 <th style="text-align: center;">NFA</th>
1156 <th style="text-align: center;">WhisperX</th>
1157 <th style="text-align: center;">Qwen3-ForcedAligner-0.6B</th>
1158 </tr>
1159 </thead>
1160 <tbody>
1161 <tr>
1162 <td colspan="5" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">MFA-Labeled Raw</td>
1163 </tr>
1164 <tr>
1165 <td style="text-align: left;">Chinese</td>
1166 <td style="text-align: center;">161.1</td>
1167 <td style="text-align: center;">109.8</td>
1168 <td style="text-align: center;">-</td>
1169 <td style="text-align: center;"><strong>33.1</strong></td>
1170 </tr>
1171 <tr>
1172 <td style="text-align: left;">English</td>
1173 <td style="text-align: center;">-</td>
1174 <td style="text-align: center;">107.5</td>
1175 <td style="text-align: center;">92.1</td>
1176 <td style="text-align: center;"><strong>37.5</strong></td>
1177 </tr>
1178 <tr>
1179 <td style="text-align: left;">French</td>
1180 <td style="text-align: center;">-</td>
1181 <td style="text-align: center;">100.7</td>
1182 <td style="text-align: center;">145.3</td>
1183 <td style="text-align: center;"><strong>41.7</strong></td>
1184 </tr>
1185 <tr>
1186 <td style="text-align: left;">German</td>
1187 <td style="text-align: center;">-</td>
1188 <td style="text-align: center;">122.7</td>
1189 <td style="text-align: center;">165.1</td>
1190 <td style="text-align: center;"><strong>46.5</strong></td>
1191 </tr>
1192 <tr>
1193 <td style="text-align: left;">Italian</td>
1194 <td style="text-align: center;">-</td>
1195 <td style="text-align: center;">142.7</td>
1196 <td style="text-align: center;">155.5</td>
1197 <td style="text-align: center;"><strong>75.5</strong></td>
1198 </tr>
1199 <tr>
1200 <td style="text-align: left;">Japanese</td>
1201 <td style="text-align: center;">-</td>
1202 <td style="text-align: center;">-</td>
1203 <td style="text-align: center;">-</td>
1204 <td style="text-align: center;"><strong>42.2</strong></td>
1205 </tr>
1206 <tr>
1207 <td style="text-align: left;">Korean</td>
1208 <td style="text-align: center;">-</td>
1209 <td style="text-align: center;">-</td>
1210 <td style="text-align: center;">-</td>
1211 <td style="text-align: center;"><strong>37.2</strong></td>
1212 </tr>
1213 <tr>
1214 <td style="text-align: left;">Portuguese</td>
1215 <td style="text-align: center;">-</td>
1216 <td style="text-align: center;">-</td>
1217 <td style="text-align: center;">-</td>
1218 <td style="text-align: center;"><strong>38.4</strong></td>
1219 </tr>
1220 <tr>
1221 <td style="text-align: left;">Russian</td>
1222 <td style="text-align: center;">-</td>
1223 <td style="text-align: center;">200.7</td>
1224 <td style="text-align: center;">-</td>
1225 <td style="text-align: center;"><strong>40.2</strong></td>
1226 </tr>
1227 <tr>
1228 <td style="text-align: left;">Spanish</td>
1229 <td style="text-align: center;">-</td>
1230 <td style="text-align: center;">124.7</td>
1231 <td style="text-align: center;">108.0</td>
1232 <td style="text-align: center;"><strong>36.8</strong></td>
1233 </tr>
1234 <tr>
1235 <td style="text-align: left;"><em>Avg.</em></td>
1236 <td style="text-align: center;">161.1</td>
1237 <td style="text-align: center;">129.8</td>
1238 <td style="text-align: center;">133.2</td>
1239 <td style="text-align: center;"><strong>42.9</strong></td>
1240 </tr>
1241 <tr>
1242 <td colspan="5" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">MFA-Labeled Concat-300s</td>
1243 </tr>
1244 <tr>
1245 <td style="text-align: left;">Chinese</td>
1246 <td style="text-align: center;">1742.4</td>
1247 <td style="text-align: center;">235.0</td>
1248 <td style="text-align: center;">-</td>
1249 <td style="text-align: center;"><strong>36.5</strong></td>
1250 </tr>
1251 <tr>
1252 <td style="text-align: left;">English</td>
1253 <td style="text-align: center;">-</td>
1254 <td style="text-align: center;">226.7</td>
1255 <td style="text-align: center;">227.2</td>
1256 <td style="text-align: center;"><strong>58.6</strong></td>
1257 </tr>
1258 <tr>
1259 <td style="text-align: left;">French</td>
1260 <td style="text-align: center;">-</td>
1261 <td style="text-align: center;">230.6</td>
1262 <td style="text-align: center;">2052.2</td>
1263 <td style="text-align: center;"><strong>53.4</strong></td>
1264 </tr>
1265 <tr>
1266 <td style="text-align: left;">German</td>
1267 <td style="text-align: center;">-</td>
1268 <td style="text-align: center;">220.3</td>
1269 <td style="text-align: center;">993.4</td>
1270 <td style="text-align: center;"><strong>62.4</strong></td>
1271 </tr>
1272 <tr>
1273 <td style="text-align: left;">Italian</td>
1274 <td style="text-align: center;">-</td>
1275 <td style="text-align: center;">290.5</td>
1276 <td style="text-align: center;">5719.4</td>
1277 <td style="text-align: center;"><strong>81.6</strong></td>
1278 </tr>
1279 <tr>
1280 <td style="text-align: left;">Japanese</td>
1281 <td style="text-align: center;">-</td>
1282 <td style="text-align: center;">-</td>
1283 <td style="text-align: center;">-</td>
1284 <td style="text-align: center;"><strong>81.3</strong></td>
1285 </tr>
1286 <tr>
1287 <td style="text-align: left;">Korean</td>
1288 <td style="text-align: center;">-</td>
1289 <td style="text-align: center;">-</td>
1290 <td style="text-align: center;">-</td>
1291 <td style="text-align: center;"><strong>42.2</strong></td>
1292 </tr>
1293 <tr>
1294 <td style="text-align: left;">Portuguese</td>
1295 <td style="text-align: center;">-</td>
1296 <td style="text-align: center;">-</td>
1297 <td style="text-align: center;">-</td>
1298 <td style="text-align: center;"><strong>50.0</strong></td>
1299 </tr>
1300 <tr>
1301 <td style="text-align: left;">Russian</td>
1302 <td style="text-align: center;">-</td>
1303 <td style="text-align: center;">283.3</td>
1304 <td style="text-align: center;">-</td>
1305 <td style="text-align: center;"><strong>43.0</strong></td>
1306 </tr>
1307 <tr>
1308 <td style="text-align: left;">Spanish</td>
1309 <td style="text-align: center;">-</td>
1310 <td style="text-align: center;">240.2</td>
1311 <td style="text-align: center;">4549.9</td>
1312 <td style="text-align: center;"><strong>39.6</strong></td>
1313 </tr>
1314 <tr>
1315 <td style="text-align: left;">Cross-lingual</td>
1316 <td style="text-align: center;">-</td>
1317 <td style="text-align: center;">-</td>
1318 <td style="text-align: center;">-</td>
1319 <td style="text-align: center;"><strong>34.2</strong></td>
1320 </tr>
1321 <tr>
1322 <td style="text-align: left;"><em>Avg.</em></td>
1323 <td style="text-align: center;">1742.4</td>
1324 <td style="text-align: center;">246.7</td>
1325 <td style="text-align: center;">2708.4</td>
1326 <td style="text-align: center;"><strong>52.9</strong></td>
1327 </tr>
1328 <tr>
1329 <td colspan="5" style="text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;">Human-Labeled</td>
1330 </tr>
1331 <tr>
1332 <td style="text-align: left;">Raw</td>
1333 <td style="text-align: center;">49.9</td>
1334 <td style="text-align: center;">88.6</td>
1335 <td style="text-align: center;">-</td>
1336 <td style="text-align: center;"><strong>27.8</strong></td>
1337 </tr>
1338 <tr>
1339 <td style="text-align: left;">Raw-Noisy</td>
1340 <td style="text-align: center;">53.3</td>
1341 <td style="text-align: center;">89.5</td>
1342 <td style="text-align: center;">-</td>
1343 <td style="text-align: center;"><strong>41.8</strong></td>
1344 </tr>
1345 <tr>
1346 <td style="text-align: left;">Concat-60s</td>
1347 <td style="text-align: center;">51.1</td>
1348 <td style="text-align: center;">86.7</td>
1349 <td style="text-align: center;">-</td>
1350 <td style="text-align: center;"><strong>25.3</strong></td>
1351 </tr>
1352 <tr>
1353 <td style="text-align: left;">Concat-300s</td>
1354 <td style="text-align: center;">410.8</td>
1355 <td style="text-align: center;">140.0</td>
1356 <td style="text-align: center;">-</td>
1357 <td style="text-align: center;"><strong>24.8</strong></td>
1358 </tr>
1359 <tr>
1360 <td style="text-align: left;">Concat-Cross-lingual</td>
1361 <td style="text-align: center;">-</td>
1362 <td style="text-align: center;">-</td>
1363 <td style="text-align: center;">-</td>
1364 <td style="text-align: center;"><strong>42.5</strong></td>
1365 </tr>
1366 <tr>
1367 <td style="text-align: left;"><em>Avg.</em></td>
1368 <td style="text-align: center;">141.3</td>
1369 <td style="text-align: center;">101.2</td>
1370 <td style="text-align: center;">-</td>
1371 <td style="text-align: center;"><strong>32.4</strong></td>
1372 </tr>
1373 </tbody>
1374 </table>
1375
1376 </details>
1377
1378
1379 ## Citation
1380
1381 If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
1382
1383 ```BibTeX
1384 @article{Qwen3-ASR,
1385 title={Qwen3-ASR Technical Report},
1386 author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},
1387 journal={arXiv preprint arXiv:2601.21337},
1388 year={2026}
1389 }
1390 ```
1391
1392
1393 <br>