README.md
7.7 KB · 556 lines · markdown Raw
1 ---
2 tags:
3 - mms
4 language:
5 - ab
6 - af
7 - ak
8 - am
9 - ar
10 - as
11 - av
12 - ay
13 - az
14 - ba
15 - bm
16 - be
17 - bn
18 - bi
19 - bo
20 - sh
21 - br
22 - bg
23 - ca
24 - cs
25 - ce
26 - cv
27 - ku
28 - cy
29 - da
30 - de
31 - dv
32 - dz
33 - el
34 - en
35 - eo
36 - et
37 - eu
38 - ee
39 - fo
40 - fa
41 - fj
42 - fi
43 - fr
44 - fy
45 - ff
46 - ga
47 - gl
48 - gn
49 - gu
50 - zh
51 - ht
52 - ha
53 - he
54 - hi
55 - sh
56 - hu
57 - hy
58 - ig
59 - ia
60 - ms
61 - is
62 - it
63 - jv
64 - ja
65 - kn
66 - ka
67 - kk
68 - kr
69 - km
70 - ki
71 - rw
72 - ky
73 - ko
74 - kv
75 - lo
76 - la
77 - lv
78 - ln
79 - lt
80 - lb
81 - lg
82 - mh
83 - ml
84 - mr
85 - ms
86 - mk
87 - mg
88 - mt
89 - mn
90 - mi
91 - my
92 - zh
93 - nl
94 - 'no'
95 - 'no'
96 - ne
97 - ny
98 - oc
99 - om
100 - or
101 - os
102 - pa
103 - pl
104 - pt
105 - ms
106 - ps
107 - qu
108 - qu
109 - qu
110 - qu
111 - qu
112 - qu
113 - qu
114 - qu
115 - qu
116 - qu
117 - qu
118 - qu
119 - qu
120 - qu
121 - qu
122 - qu
123 - qu
124 - qu
125 - qu
126 - qu
127 - qu
128 - qu
129 - ro
130 - rn
131 - ru
132 - sg
133 - sk
134 - sl
135 - sm
136 - sn
137 - sd
138 - so
139 - es
140 - sq
141 - su
142 - sv
143 - sw
144 - ta
145 - tt
146 - te
147 - tg
148 - tl
149 - th
150 - ti
151 - ts
152 - tr
153 - uk
154 - ms
155 - vi
156 - wo
157 - xh
158 - ms
159 - yo
160 - ms
161 - zu
162 - za
163 license: cc-by-nc-4.0
164 datasets:
165 - google/fleurs
166 metrics:
167 - acc
168 ---
169
170 # Massively Multilingual Speech (MMS) - Finetuned LID
171
172 This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).
173 This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and classifies raw audio input to a probability distribution over 256 output classes (each class representing a language).
174 The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 256 languages.
175
176 ## Table Of Content
177
178 - [Example](#example)
179 - [Supported Languages](#supported-languages)
180 - [Model details](#model-details)
181 - [Additional links](#additional-links)
182
183 ## Example
184
185 This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to identify
186 the spoken language of an audio. It can recognize the [following 256 languages](#supported-languages).
187
188 Let's look at a simple example.
189
190 First, we install transformers and some other libraries
191 ```
192 pip install torch accelerate torchaudio datasets
193 pip install --upgrade transformers
194 ````
195
196 **Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version
197 is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from
198 source:
199 ```
200 pip install git+https://github.com/huggingface/transformers.git
201 ```
202
203 Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz.
204
205 ```py
206 from datasets import load_dataset, Audio
207
208 # English
209 stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
210 stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
211 en_sample = next(iter(stream_data))["audio"]["array"]
212
213 # Arabic
214 stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)
215 stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
216 ar_sample = next(iter(stream_data))["audio"]["array"]
217 ```
218
219 Next, we load the model and processor
220
221 ```py
222 from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor
223 import torch
224
225 model_id = "facebook/mms-lid-256"
226
227 processor = AutoFeatureExtractor.from_pretrained(model_id)
228 model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)
229 ```
230
231 Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as [ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://huggingface.co/harshit345/xlsr-wav2vec-speech-emotion-recognition)
232
233 ```py
234 # English
235 inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
236
237 with torch.no_grad():
238 outputs = model(**inputs).logits
239
240 lang_id = torch.argmax(outputs, dim=-1)[0].item()
241 detected_lang = model.config.id2label[lang_id]
242 # 'eng'
243
244 # Arabic
245 inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")
246
247 with torch.no_grad():
248 outputs = model(**inputs).logits
249
250 lang_id = torch.argmax(outputs, dim=-1)[0].item()
251 detected_lang = model.config.id2label[lang_id]
252 # 'ara'
253 ```
254
255 To see all the supported languages of a checkpoint, you can print out the language ids as follows:
256 ```py
257 processor.id2label.values()
258 ```
259
260 For more details, about the architecture please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms).
261
262 ## Supported Languages
263
264 This model supports 256 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3).
265 You can find more details about the languages and their ISO 649-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html).
266 <details>
267 <summary>Click to toggle</summary>
268
269 - ara
270 - cmn
271 - eng
272 - spa
273 - fra
274 - mlg
275 - swe
276 - por
277 - vie
278 - ful
279 - sun
280 - asm
281 - ben
282 - zlm
283 - kor
284 - ind
285 - hin
286 - tuk
287 - urd
288 - aze
289 - slv
290 - mon
291 - hau
292 - tel
293 - swh
294 - bod
295 - rus
296 - tur
297 - heb
298 - mar
299 - som
300 - tgl
301 - tat
302 - tha
303 - cat
304 - ron
305 - mal
306 - bel
307 - pol
308 - yor
309 - nld
310 - bul
311 - hat
312 - afr
313 - isl
314 - amh
315 - tam
316 - hun
317 - hrv
318 - lit
319 - cym
320 - fas
321 - mkd
322 - ell
323 - bos
324 - deu
325 - sqi
326 - jav
327 - kmr
328 - nob
329 - uzb
330 - snd
331 - lat
332 - nya
333 - grn
334 - mya
335 - orm
336 - lin
337 - hye
338 - yue
339 - pan
340 - jpn
341 - kaz
342 - npi
343 - kik
344 - kat
345 - guj
346 - kan
347 - tgk
348 - ukr
349 - ces
350 - lav
351 - bak
352 - khm
353 - fao
354 - glg
355 - ltz
356 - xog
357 - lao
358 - mlt
359 - sin
360 - aka
361 - sna
362 - ita
363 - srp
364 - mri
365 - nno
366 - pus
367 - eus
368 - ory
369 - lug
370 - bre
371 - luo
372 - slk
373 - ewe
374 - fin
375 - rif
376 - dan
377 - yid
378 - yao
379 - mos
380 - hne
381 - est
382 - dyu
383 - bam
384 - uig
385 - sck
386 - tso
387 - mup
388 - ctg
389 - ceb
390 - war
391 - bbc
392 - vmw
393 - sid
394 - tpi
395 - mag
396 - san
397 - kri
398 - lon
399 - kir
400 - run
401 - ubl
402 - kin
403 - rkt
404 - xmm
405 - tir
406 - mai
407 - nan
408 - nyn
409 - bcc
410 - hak
411 - suk
412 - bem
413 - rmy
414 - awa
415 - pcm
416 - bgc
417 - shn
418 - oci
419 - wol
420 - bci
421 - kab
422 - ilo
423 - bcl
424 - haw
425 - mad
426 - nod
427 - sag
428 - sas
429 - jam
430 - mey
431 - shi
432 - hil
433 - ace
434 - kam
435 - min
436 - umb
437 - hno
438 - ban
439 - syl
440 - bxg
441 - xho
442 - mww
443 - epo
444 - tzm
445 - zul
446 - ibo
447 - abk
448 - guz
449 - ckb
450 - knc
451 - nso
452 - bho
453 - dje
454 - tiv
455 - gle
456 - lua
457 - skr
458 - bto
459 - kea
460 - glk
461 - ast
462 - sat
463 - ktu
464 - bhb
465 - emk
466 - kng
467 - kmb
468 - tsn
469 - gom
470 - ven
471 - sco
472 - glv
473 - sot
474 - sou
475 - gno
476 - nde
477 - bjn
478 - ina
479 - fmu
480 - esg
481 - wes
482 - pnb
483 - phr
484 - mui
485 - bug
486 - mrr
487 - kas
488 - lir
489 - vah
490 - ssw
491 - rwr
492 - pcc
493 - hms
494 - wbr
495 - swv
496 - mtr
497 - haz
498 - aii
499 - bns
500 - msi
501 - wuu
502 - hsn
503 - bgp
504 - tts
505 - lmn
506 - dcc
507 - bew
508 - bjj
509 - ibb
510 - tji
511 - hoj
512 - cpx
513 - cdo
514 - daq
515 - mut
516 - nap
517 - czh
518 - gdx
519 - sdh
520 - scn
521 - mnp
522 - bar
523 - mzn
524 - gsw
525
526 </details>
527
528 ## Model details
529
530 - **Developed by:** Vineel Pratap et al.
531 - **Model type:** Multi-Lingual Automatic Speech Recognition model
532 - **Language(s):** 256 languages, see [supported languages](#supported-languages)
533 - **License:** CC-BY-NC 4.0 license
534 - **Num parameters**: 1 billion
535 - **Audio sampling rate**: 16,000 kHz
536 - **Cite as:**
537
538 @article{pratap2023mms,
539 title={Scaling Speech Technology to 1,000+ Languages},
540 author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},
541 journal={arXiv},
542 year={2023}
543 }
544
545 ## Additional Links
546
547 - [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)
548 - [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms).
549 - [Paper](https://arxiv.org/abs/2305.13516)
550 - [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr)
551 - [Other **MMS** checkpoints](https://huggingface.co/models?other=mms)
552 - MMS base checkpoints:
553 - [facebook/mms-1b](https://huggingface.co/facebook/mms-1b)
554 - [facebook/mms-300m](https://huggingface.co/facebook/mms-300m)
555 - [Official Space](https://huggingface.co/spaces/facebook/MMS)
556