README.md · mms-lid-256

README.md

7.7 KB · 556 lines · markdown Raw

1	`---`
2	`tags:`
3	`- mms`
4	`language:`
5	`- ab`
6	`- af`
7	`- ak`
8	`- am`
9	`- ar`
10	`- as`
11	`- av`
12	`- ay`
13	`- az`
14	`- ba`
15	`- bm`
16	`- be`
17	`- bn`
18	`- bi`
19	`- bo`
20	`- sh`
21	`- br`
22	`- bg`
23	`- ca`
24	`- cs`
25	`- ce`
26	`- cv`
27	`- ku`
28	`- cy`
29	`- da`
30	`- de`
31	`- dv`
32	`- dz`
33	`- el`
34	`- en`
35	`- eo`
36	`- et`
37	`- eu`
38	`- ee`
39	`- fo`
40	`- fa`
41	`- fj`
42	`- fi`
43	`- fr`
44	`- fy`
45	`- ff`
46	`- ga`
47	`- gl`
48	`- gn`
49	`- gu`
50	`- zh`
51	`- ht`
52	`- ha`
53	`- he`
54	`- hi`
55	`- sh`
56	`- hu`
57	`- hy`
58	`- ig`
59	`- ia`
60	`- ms`
61	`- is`
62	`- it`
63	`- jv`
64	`- ja`
65	`- kn`
66	`- ka`
67	`- kk`
68	`- kr`
69	`- km`
70	`- ki`
71	`- rw`
72	`- ky`
73	`- ko`
74	`- kv`
75	`- lo`
76	`- la`
77	`- lv`
78	`- ln`
79	`- lt`
80	`- lb`
81	`- lg`
82	`- mh`
83	`- ml`
84	`- mr`
85	`- ms`
86	`- mk`
87	`- mg`
88	`- mt`
89	`- mn`
90	`- mi`
91	`- my`
92	`- zh`
93	`- nl`
94	`- 'no'`
95	`- 'no'`
96	`- ne`
97	`- ny`
98	`- oc`
99	`- om`
100	`- or`
101	`- os`
102	`- pa`
103	`- pl`
104	`- pt`
105	`- ms`
106	`- ps`
107	`- qu`
108	`- qu`
109	`- qu`
110	`- qu`
111	`- qu`
112	`- qu`
113	`- qu`
114	`- qu`
115	`- qu`
116	`- qu`
117	`- qu`
118	`- qu`
119	`- qu`
120	`- qu`
121	`- qu`
122	`- qu`
123	`- qu`
124	`- qu`
125	`- qu`
126	`- qu`
127	`- qu`
128	`- qu`
129	`- ro`
130	`- rn`
131	`- ru`
132	`- sg`
133	`- sk`
134	`- sl`
135	`- sm`
136	`- sn`
137	`- sd`
138	`- so`
139	`- es`
140	`- sq`
141	`- su`
142	`- sv`
143	`- sw`
144	`- ta`
145	`- tt`
146	`- te`
147	`- tg`
148	`- tl`
149	`- th`
150	`- ti`
151	`- ts`
152	`- tr`
153	`- uk`
154	`- ms`
155	`- vi`
156	`- wo`
157	`- xh`
158	`- ms`
159	`- yo`
160	`- ms`
161	`- zu`
162	`- za`
163	`license: cc-by-nc-4.0`
164	`datasets:`
165	`- google/fleurs`
166	`metrics:`
167	`- acc`
168	`---`
169
170	`# Massively Multilingual Speech (MMS) - Finetuned LID`
171
172	`This checkpoint is a model fine-tuned for speech language identification (LID) and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).`
173	`This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and classifies raw audio input to a probability distribution over 256 output classes (each class representing a language).`
174	`The checkpoint consists of 1 billion parameters and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 256 languages.`
175
176	`## Table Of Content`
177
178	`- [Example](#example)`
179	`- [Supported Languages](#supported-languages)`
180	`- [Model details](#model-details)`
181	`- [Additional links](#additional-links)`
182
183	`## Example`
184
185	`This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to identify`
186	`the spoken language of an audio. It can recognize the [following 256 languages](#supported-languages).`
187
188	`Let's look at a simple example.`
189
190	`First, we install transformers and some other libraries`
191	```
192	`pip install torch accelerate torchaudio datasets`
193	`pip install --upgrade transformers`
194	````
195
196	Note: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version
197	is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from
198	`source:`
199	```
200	`pip install git+https://github.com/huggingface/transformers.git`
201	```
202
203	Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz.
204
205	```py
206	`from datasets import load_dataset, Audio`
207
208	`# English`
209	`stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)`
210	`stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))`
211	`en_sample = next(iter(stream_data))["audio"]["array"]`
212
213	`# Arabic`
214	`stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "ar", split="test", streaming=True)`
215	`stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))`
216	`ar_sample = next(iter(stream_data))["audio"]["array"]`
217	```
218
219	`Next, we load the model and processor`
220
221	```py
222	`from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor`
223	`import torch`
224
225	`model_id = "facebook/mms-lid-256"`
226
227	`processor = AutoFeatureExtractor.from_pretrained(model_id)`
228	`model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id)`
229	```
230
231	`Now we process the audio data, pass the processed audio data to the model to classify it into a language, just like we usually do for Wav2Vec2 audio classification models such as [ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition](https://huggingface.co/harshit345/xlsr-wav2vec-speech-emotion-recognition)`
232
233	```py
234	`# English`
235	`inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")`
236
237	`with torch.no_grad():`
238	`outputs = model(**inputs).logits`
239
240	`lang_id = torch.argmax(outputs, dim=-1)[0].item()`
241	`detected_lang = model.config.id2label[lang_id]`
242	`# 'eng'`
243
244	`# Arabic`
245	`inputs = processor(ar_sample, sampling_rate=16_000, return_tensors="pt")`
246
247	`with torch.no_grad():`
248	`outputs = model(**inputs).logits`
249
250	`lang_id = torch.argmax(outputs, dim=-1)[0].item()`
251	`detected_lang = model.config.id2label[lang_id]`
252	`# 'ara'`
253	```
254
255	`To see all the supported languages of a checkpoint, you can print out the language ids as follows:`
256	```py
257	`processor.id2label.values()`
258	```
259
260	`For more details, about the architecture please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms).`
261
262	`## Supported Languages`
263
264	`This model supports 256 languages. Unclick the following to toogle all supported languages of this checkpoint in [ISO 639-3 code](https://en.wikipedia.org/wiki/ISO_639-3).`
265	`You can find more details about the languages and their ISO 649-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html).`
266	`<details>`
267	`<summary>Click to toggle</summary>`
268
269	`- ara`
270	`- cmn`
271	`- eng`
272	`- spa`
273	`- fra`
274	`- mlg`
275	`- swe`
276	`- por`
277	`- vie`
278	`- ful`
279	`- sun`
280	`- asm`
281	`- ben`
282	`- zlm`
283	`- kor`
284	`- ind`
285	`- hin`
286	`- tuk`
287	`- urd`
288	`- aze`
289	`- slv`
290	`- mon`
291	`- hau`
292	`- tel`
293	`- swh`
294	`- bod`
295	`- rus`
296	`- tur`
297	`- heb`
298	`- mar`
299	`- som`
300	`- tgl`
301	`- tat`
302	`- tha`
303	`- cat`
304	`- ron`
305	`- mal`
306	`- bel`
307	`- pol`
308	`- yor`
309	`- nld`
310	`- bul`
311	`- hat`
312	`- afr`
313	`- isl`
314	`- amh`
315	`- tam`
316	`- hun`
317	`- hrv`
318	`- lit`
319	`- cym`
320	`- fas`
321	`- mkd`
322	`- ell`
323	`- bos`
324	`- deu`
325	`- sqi`
326	`- jav`
327	`- kmr`
328	`- nob`
329	`- uzb`
330	`- snd`
331	`- lat`
332	`- nya`
333	`- grn`
334	`- mya`
335	`- orm`
336	`- lin`
337	`- hye`
338	`- yue`
339	`- pan`
340	`- jpn`
341	`- kaz`
342	`- npi`
343	`- kik`
344	`- kat`
345	`- guj`
346	`- kan`
347	`- tgk`
348	`- ukr`
349	`- ces`
350	`- lav`
351	`- bak`
352	`- khm`
353	`- fao`
354	`- glg`
355	`- ltz`
356	`- xog`
357	`- lao`
358	`- mlt`
359	`- sin`
360	`- aka`
361	`- sna`
362	`- ita`
363	`- srp`
364	`- mri`
365	`- nno`
366	`- pus`
367	`- eus`
368	`- ory`
369	`- lug`
370	`- bre`
371	`- luo`
372	`- slk`
373	`- ewe`
374	`- fin`
375	`- rif`
376	`- dan`
377	`- yid`
378	`- yao`
379	`- mos`
380	`- hne`
381	`- est`
382	`- dyu`
383	`- bam`
384	`- uig`
385	`- sck`
386	`- tso`
387	`- mup`
388	`- ctg`
389	`- ceb`
390	`- war`
391	`- bbc`
392	`- vmw`
393	`- sid`
394	`- tpi`
395	`- mag`
396	`- san`
397	`- kri`
398	`- lon`
399	`- kir`
400	`- run`
401	`- ubl`
402	`- kin`
403	`- rkt`
404	`- xmm`
405	`- tir`
406	`- mai`
407	`- nan`
408	`- nyn`
409	`- bcc`
410	`- hak`
411	`- suk`
412	`- bem`
413	`- rmy`
414	`- awa`
415	`- pcm`
416	`- bgc`
417	`- shn`
418	`- oci`
419	`- wol`
420	`- bci`
421	`- kab`
422	`- ilo`
423	`- bcl`
424	`- haw`
425	`- mad`
426	`- nod`
427	`- sag`
428	`- sas`
429	`- jam`
430	`- mey`
431	`- shi`
432	`- hil`
433	`- ace`
434	`- kam`
435	`- min`
436	`- umb`
437	`- hno`
438	`- ban`
439	`- syl`
440	`- bxg`
441	`- xho`
442	`- mww`
443	`- epo`
444	`- tzm`
445	`- zul`
446	`- ibo`
447	`- abk`
448	`- guz`
449	`- ckb`
450	`- knc`
451	`- nso`
452	`- bho`
453	`- dje`
454	`- tiv`
455	`- gle`
456	`- lua`
457	`- skr`
458	`- bto`
459	`- kea`
460	`- glk`
461	`- ast`
462	`- sat`
463	`- ktu`
464	`- bhb`
465	`- emk`
466	`- kng`
467	`- kmb`
468	`- tsn`
469	`- gom`
470	`- ven`
471	`- sco`
472	`- glv`
473	`- sot`
474	`- sou`
475	`- gno`
476	`- nde`
477	`- bjn`
478	`- ina`
479	`- fmu`
480	`- esg`
481	`- wes`
482	`- pnb`
483	`- phr`
484	`- mui`
485	`- bug`
486	`- mrr`
487	`- kas`
488	`- lir`
489	`- vah`
490	`- ssw`
491	`- rwr`
492	`- pcc`
493	`- hms`
494	`- wbr`
495	`- swv`
496	`- mtr`
497	`- haz`
498	`- aii`
499	`- bns`
500	`- msi`
501	`- wuu`
502	`- hsn`
503	`- bgp`
504	`- tts`
505	`- lmn`
506	`- dcc`
507	`- bew`
508	`- bjj`
509	`- ibb`
510	`- tji`
511	`- hoj`
512	`- cpx`
513	`- cdo`
514	`- daq`
515	`- mut`
516	`- nap`
517	`- czh`
518	`- gdx`
519	`- sdh`
520	`- scn`
521	`- mnp`
522	`- bar`
523	`- mzn`
524	`- gsw`
525
526	`</details>`
527
528	`## Model details`
529
530	`- Developed by: Vineel Pratap et al.`
531	`- Model type: Multi-Lingual Automatic Speech Recognition model`
532	`- Language(s): 256 languages, see [supported languages](#supported-languages)`
533	`- License: CC-BY-NC 4.0 license`
534	`- Num parameters: 1 billion`
535	`- Audio sampling rate: 16,000 kHz`
536	`- Cite as:`
537
538	`@article{pratap2023mms,`
539	`title={Scaling Speech Technology to 1,000+ Languages},`
540	`author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},`
541	`journal={arXiv},`
542	`year={2023}`
543	`}`
544
545	`## Additional Links`
546
547	`- [Blog post](https://ai.facebook.com/blog/multilingual-model-speech-recognition/)`
548	`- [Transformers documentation](https://huggingface.co/docs/transformers/main/en/model_doc/mms).`
549	`- [Paper](https://arxiv.org/abs/2305.13516)`
550	`- [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms#asr)`
551	`- [Other MMS checkpoints](https://huggingface.co/models?other=mms)`
552	`- MMS base checkpoints:`
553	`- [facebook/mms-1b](https://huggingface.co/facebook/mms-1b)`
554	`- [facebook/mms-300m](https://huggingface.co/facebook/mms-300m)`
555	`- [Official Space](https://huggingface.co/spaces/facebook/MMS)`
556