README.md
| 1 | --- |
| 2 | tags: |
| 3 | - mms |
| 4 | language: |
| 5 | - ab |
| 6 | - af |
| 7 | - ak |
| 8 | - am |
| 9 | - ar |
| 10 | - as |
| 11 | - av |
| 12 | - ay |
| 13 | - az |
| 14 | - ba |
| 15 | - bm |
| 16 | - be |
| 17 | - bn |
| 18 | - bi |
| 19 | - bo |
| 20 | - sh |
| 21 | - br |
| 22 | - bg |
| 23 | - ca |
| 24 | - cs |
| 25 | - ce |
| 26 | - cv |
| 27 | - ku |
| 28 | - cy |
| 29 | - da |
| 30 | - de |
| 31 | - dv |
| 32 | - dz |
| 33 | - el |
| 34 | - en |
| 35 | - eo |
| 36 | - et |
| 37 | - eu |
| 38 | - ee |
| 39 | - fo |
| 40 | - fa |
| 41 | - fj |
| 42 | - fi |
| 43 | - fr |
| 44 | - fy |
| 45 | - ff |
| 46 | - ga |
| 47 | - gl |
| 48 | - gn |
| 49 | - gu |
| 50 | - zh |
| 51 | - ht |
| 52 | - ha |
| 53 | - he |
| 54 | - hi |
| 55 | - sh |
| 56 | - hu |
| 57 | - hy |
| 58 | - ig |
| 59 | - ia |
| 60 | - ms |
| 61 | - is |
| 62 | - it |
| 63 | - jv |
| 64 | - ja |
| 65 | - kn |
| 66 | - ka |
| 67 | - kk |
| 68 | - kr |
| 69 | - km |
| 70 | - ki |
| 71 | - rw |
| 72 | - ky |
| 73 | - ko |
| 74 | - kv |
| 75 | - lo |
| 76 | - la |
| 77 | - lv |
| 78 | - ln |
| 79 | - lt |
| 80 | - lb |
| 81 | - lg |
| 82 | - mh |
| 83 | - ml |
| 84 | - mr |
| 85 | - ms |
| 86 | - mk |
| 87 | - mg |
| 88 | - mt |
| 89 | - mn |
| 90 | - mi |
| 91 | - my |
| 92 | - zh |
| 93 | - nl |
| 94 | - 'no' |
| 95 | - 'no' |
| 96 | - ne |
| 97 | - ny |
| 98 | - oc |
| 99 | - om |
| 100 | - or |
| 101 | - os |
| 102 | - pa |
| 103 | - pl |
| 104 | - pt |
| 105 | - ms |
| 106 | - ps |
| 107 | - qu |
| 108 | - qu |
| 109 | - qu |
| 110 | - qu |
| 111 | - qu |
| 112 | - qu |
| 113 | - qu |
| 114 | - qu |
| 115 | - qu |
| 116 | - qu |
| 117 | - qu |
| 118 | - qu |
| 119 | - qu |
| 120 | - qu |
| 121 | - qu |
| 122 | - qu |
| 123 | - qu |
| 124 | - qu |
| 125 | - qu |
| 126 | - qu |
| 127 | - qu |
| 128 | - qu |
| 129 | - ro |
| 130 | - rn |
| 131 | - ru |
| 132 | - sg |
| 133 | - sk |
| 134 | - sl |
| 135 | - sm |
| 136 | - sn |
| 137 | - sd |
| 138 | - so |
| 139 | - es |
| 140 | - sq |
| 141 | - su |
| 142 | - sv |
| 143 | - sw |
| 144 | - ta |
| 145 | - tt |
| 146 | - te |
| 147 | - tg |
| 148 | - tl |
| 149 | - th |
| 150 | - ti |
| 151 | - ts |
| 152 | - tr |
| 153 | - uk |
| 154 | - ms |
| 155 | - vi |
| 156 | - wo |
| 157 | - xh |
| 158 | - ms |
| 159 | - yo |
| 160 | - ms |
| 161 | - zu |
| 162 | - za |
| 163 | license: cc-by-nc-4.0 |
| 164 | datasets: |
| 165 | - google/fleurs |
| 166 | metrics: |
| 167 | - acc |
| 168 | --- |
| 169 | |
| 170 | # Massively Multilingual Speech (MMS) - Finetuned LID |
| 171 | |
| 172 | This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/). |
| 173 | This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and classifies raw audio input to a probability distribution over 256 output classes (each class representing a language). |
| 174 | The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 256 languages. |
| 175 | |
| 176 | ## Table Of Content |
| 177 | |
| 178 | - [Example](#example) |
| 179 | - [Supported Languages](#supported-languages) |
| 180 | - [Model details](#model-details) |
| 181 | - [Additional links](#additional-links) |
| 182 | |
| 183 | ## Example |
| 184 | |
| 185 | This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to identify |
| 186 | the spoken language of an audio. It can recognize the [following 256 languages](#supported-languages). |
| 187 | |
| 188 | Let's look at a simple example. |
| 189 | |
| 190 | First, we install transformers and some other libraries |
| 191 | ``` |
| 192 | pip install torch accelerate torchaudio datasets |
| 193 | pip install --upgrade transformers |
| 194 | ```` |
| 195 | |
| 196 | **Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version |
| 197 | is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from |
| 198 | source: |
| 199 | ``` |
| 200 | pip install git+https://github.com/huggingface/transformers.git |
| 201 | ``` |
| 202 | |
| 203 | Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz. |
| 204 | |
| 205 | ```py |
| 206 | from datasets import load_dataset, Audio |
| 207 | |
| 208 | # English |
| 209 | stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True) |
| 210 | stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) |
| 211 | en_sample = next(iter(stream_data))["audio"]["array"] |
| 212 | |
| 213 | # Arabic |
| 214 | stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True) |
| 215 | stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000)) |
| 216 | ar_sample = next(iter(stream_data))["audio"]["array"] |
| 217 | ``` |
| 218 | |
| 219 | Next, we load the model and processor |
| 220 | |
| 221 | ```py |
| 222 | from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor |
| 223 | import torch |
| 224 | |
| 225 | model_id = "facebook/mms-lid-256" |
| 226 | |
| 227 | processor = AutoFeatureExtractor.from_pretrained(model_id) |
| 228 | model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id) |
| 229 | ``` |
| 230 | |
| 231 | Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as [ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://huggingface.co/harshit345/xlsr-wav2vec-speech-emotion-recognition) |
| 232 | |
| 233 | ```py |
| 234 | # English |
| 235 | inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt") |
| 236 | |
| 237 | with torch.no_grad(): |
| 238 | outputs = model(**inputs).logits |
| 239 | |
| 240 | lang_id = torch.argmax(outputs, dim=-1)[0].item() |
| 241 | detected_lang = model.config.id2label[lang_id] |
| 242 | # 'eng' |
| 243 | |
| 244 | # Arabic |
| 245 | inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt") |
| 246 | |
| 247 | with torch.no_grad(): |
| 248 | outputs = model(**inputs).logits |
| 249 | |
| 250 | lang_id = torch.argmax(outputs, dim=-1)[0].item() |
| 251 | detected_lang = model.config.id2label[lang_id] |
| 252 | # 'ara' |
| 253 | ``` |
| 254 | |
| 255 | To see all the supported languages of a checkpoint, you can print out the language ids as follows: |
| 256 | ```py |
| 257 | processor.id2label.values() |
| 258 | ``` |
| 259 | |
| 260 | For more details, about the architecture please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms). |
| 261 | |
| 262 | ## Supported Languages |
| 263 | |
| 264 | This model supports 256 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3). |
| 265 | You can find more details about the languages and their ISO 649-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html). |
| 266 | <details> |
| 267 | <summary>Click to toggle</summary> |
| 268 | |
| 269 | - ara |
| 270 | - cmn |
| 271 | - eng |
| 272 | - spa |
| 273 | - fra |
| 274 | - mlg |
| 275 | - swe |
| 276 | - por |
| 277 | - vie |
| 278 | - ful |
| 279 | - sun |
| 280 | - asm |
| 281 | - ben |
| 282 | - zlm |
| 283 | - kor |
| 284 | - ind |
| 285 | - hin |
| 286 | - tuk |
| 287 | - urd |
| 288 | - aze |
| 289 | - slv |
| 290 | - mon |
| 291 | - hau |
| 292 | - tel |
| 293 | - swh |
| 294 | - bod |
| 295 | - rus |
| 296 | - tur |
| 297 | - heb |
| 298 | - mar |
| 299 | - som |
| 300 | - tgl |
| 301 | - tat |
| 302 | - tha |
| 303 | - cat |
| 304 | - ron |
| 305 | - mal |
| 306 | - bel |
| 307 | - pol |
| 308 | - yor |
| 309 | - nld |
| 310 | - bul |
| 311 | - hat |
| 312 | - afr |
| 313 | - isl |
| 314 | - amh |
| 315 | - tam |
| 316 | - hun |
| 317 | - hrv |
| 318 | - lit |
| 319 | - cym |
| 320 | - fas |
| 321 | - mkd |
| 322 | - ell |
| 323 | - bos |
| 324 | - deu |
| 325 | - sqi |
| 326 | - jav |
| 327 | - kmr |
| 328 | - nob |
| 329 | - uzb |
| 330 | - snd |
| 331 | - lat |
| 332 | - nya |
| 333 | - grn |
| 334 | - mya |
| 335 | - orm |
| 336 | - lin |
| 337 | - hye |
| 338 | - yue |
| 339 | - pan |
| 340 | - jpn |
| 341 | - kaz |
| 342 | - npi |
| 343 | - kik |
| 344 | - kat |
| 345 | - guj |
| 346 | - kan |
| 347 | - tgk |
| 348 | - ukr |
| 349 | - ces |
| 350 | - lav |
| 351 | - bak |
| 352 | - khm |
| 353 | - fao |
| 354 | - glg |
| 355 | - ltz |
| 356 | - xog |
| 357 | - lao |
| 358 | - mlt |
| 359 | - sin |
| 360 | - aka |
| 361 | - sna |
| 362 | - ita |
| 363 | - srp |
| 364 | - mri |
| 365 | - nno |
| 366 | - pus |
| 367 | - eus |
| 368 | - ory |
| 369 | - lug |
| 370 | - bre |
| 371 | - luo |
| 372 | - slk |
| 373 | - ewe |
| 374 | - fin |
| 375 | - rif |
| 376 | - dan |
| 377 | - yid |
| 378 | - yao |
| 379 | - mos |
| 380 | - hne |
| 381 | - est |
| 382 | - dyu |
| 383 | - bam |
| 384 | - uig |
| 385 | - sck |
| 386 | - tso |
| 387 | - mup |
| 388 | - ctg |
| 389 | - ceb |
| 390 | - war |
| 391 | - bbc |
| 392 | - vmw |
| 393 | - sid |
| 394 | - tpi |
| 395 | - mag |
| 396 | - san |
| 397 | - kri |
| 398 | - lon |
| 399 | - kir |
| 400 | - run |
| 401 | - ubl |
| 402 | - kin |
| 403 | - rkt |
| 404 | - xmm |
| 405 | - tir |
| 406 | - mai |
| 407 | - nan |
| 408 | - nyn |
| 409 | - bcc |
| 410 | - hak |
| 411 | - suk |
| 412 | - bem |
| 413 | - rmy |
| 414 | - awa |
| 415 | - pcm |
| 416 | - bgc |
| 417 | - shn |
| 418 | - oci |
| 419 | - wol |
| 420 | - bci |
| 421 | - kab |
| 422 | - ilo |
| 423 | - bcl |
| 424 | - haw |
| 425 | - mad |
| 426 | - nod |
| 427 | - sag |
| 428 | - sas |
| 429 | - jam |
| 430 | - mey |
| 431 | - shi |
| 432 | - hil |
| 433 | - ace |
| 434 | - kam |
| 435 | - min |
| 436 | - umb |
| 437 | - hno |
| 438 | - ban |
| 439 | - syl |
| 440 | - bxg |
| 441 | - xho |
| 442 | - mww |
| 443 | - epo |
| 444 | - tzm |
| 445 | - zul |
| 446 | - ibo |
| 447 | - abk |
| 448 | - guz |
| 449 | - ckb |
| 450 | - knc |
| 451 | - nso |
| 452 | - bho |
| 453 | - dje |
| 454 | - tiv |
| 455 | - gle |
| 456 | - lua |
| 457 | - skr |
| 458 | - bto |
| 459 | - kea |
| 460 | - glk |
| 461 | - ast |
| 462 | - sat |
| 463 | - ktu |
| 464 | - bhb |
| 465 | - emk |
| 466 | - kng |
| 467 | - kmb |
| 468 | - tsn |
| 469 | - gom |
| 470 | - ven |
| 471 | - sco |
| 472 | - glv |
| 473 | - sot |
| 474 | - sou |
| 475 | - gno |
| 476 | - nde |
| 477 | - bjn |
| 478 | - ina |
| 479 | - fmu |
| 480 | - esg |
| 481 | - wes |
| 482 | - pnb |
| 483 | - phr |
| 484 | - mui |
| 485 | - bug |
| 486 | - mrr |
| 487 | - kas |
| 488 | - lir |
| 489 | - vah |
| 490 | - ssw |
| 491 | - rwr |
| 492 | - pcc |
| 493 | - hms |
| 494 | - wbr |
| 495 | - swv |
| 496 | - mtr |
| 497 | - haz |
| 498 | - aii |
| 499 | - bns |
| 500 | - msi |
| 501 | - wuu |
| 502 | - hsn |
| 503 | - bgp |
| 504 | - tts |
| 505 | - lmn |
| 506 | - dcc |
| 507 | - bew |
| 508 | - bjj |
| 509 | - ibb |
| 510 | - tji |
| 511 | - hoj |
| 512 | - cpx |
| 513 | - cdo |
| 514 | - daq |
| 515 | - mut |
| 516 | - nap |
| 517 | - czh |
| 518 | - gdx |
| 519 | - sdh |
| 520 | - scn |
| 521 | - mnp |
| 522 | - bar |
| 523 | - mzn |
| 524 | - gsw |
| 525 | |
| 526 | </details> |
| 527 | |
| 528 | ## Model details |
| 529 | |
| 530 | - **Developed by:** Vineel Pratap et al. |
| 531 | - **Model type:** Multi-Lingual Automatic Speech Recognition model |
| 532 | - **Language(s):** 256 languages, see [supported languages](#supported-languages) |
| 533 | - **License:** CC-BY-NC 4.0 license |
| 534 | - **Num parameters**: 1 billion |
| 535 | - **Audio sampling rate**: 16,000 kHz |
| 536 | - **Cite as:** |
| 537 | |
| 538 | @article{pratap2023mms, |
| 539 | title={Scaling Speech Technology to 1,000+ Languages}, |
| 540 | author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli}, |
| 541 | journal={arXiv}, |
| 542 | year={2023} |
| 543 | } |
| 544 | |
| 545 | ## Additional Links |
| 546 | |
| 547 | - [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/) |
| 548 | - [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms). |
| 549 | - [Paper](https://arxiv.org/abs/2305.13516) |
| 550 | - [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr) |
| 551 | - [Other **MMS** checkpoints](https://huggingface.co/models?other=mms) |
| 552 | - MMS base checkpoints: |
| 553 | - [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) |
| 554 | - [facebook/mms-300m](https://huggingface.co/facebook/mms-300m) |
| 555 | - [Official Space](https://huggingface.co/spaces/facebook/MMS) |
| 556 | |