README.md · wav2vec2-large-xlsr-53-portuguese

README.md

5.2 KB · 165 lines · markdown Raw

1	`---`
2	`language: pt`
3	`license: apache-2.0`
4	`datasets:`
5	`- common_voice`
6	`- mozilla-foundation/common_voice_6_0`
7	`metrics:`
8	`- wer`
9	`- cer`
10	`tags:`
11	`- audio`
12	`- automatic-speech-recognition`
13	`- hf-asr-leaderboard`
14	`- mozilla-foundation/common_voice_6_0`
15	`- pt`
16	`- robust-speech-event`
17	`- speech`
18	`- xlsr-fine-tuning-week`
19	`model-index:`
20	`- name: XLSR Wav2Vec2 Portuguese by Jonatas Grosman`
21	`results:`
22	`- task:`
23	`name: Automatic Speech Recognition`
24	`type: automatic-speech-recognition`
25	`dataset:`
26	`name: Common Voice pt`
27	`type: common_voice`
28	`args: pt`
29	`metrics:`
30	`- name: Test WER`
31	`type: wer`
32	`value: 11.31`
33	`- name: Test CER`
34	`type: cer`
35	`value: 3.74`
36	`- name: Test WER (+LM)`
37	`type: wer`
38	`value: 9.01`
39	`- name: Test CER (+LM)`
40	`type: cer`
41	`value: 3.21`
42	`- task:`
43	`name: Automatic Speech Recognition`
44	`type: automatic-speech-recognition`
45	`dataset:`
46	`name: Robust Speech Event - Dev Data`
47	`type: speech-recognition-community-v2/dev_data`
48	`args: pt`
49	`metrics:`
50	`- name: Dev WER`
51	`type: wer`
52	`value: 42.1`
53	`- name: Dev CER`
54	`type: cer`
55	`value: 17.93`
56	`- name: Dev WER (+LM)`
57	`type: wer`
58	`value: 36.92`
59	`- name: Dev CER (+LM)`
60	`type: cer`
61	`value: 16.88`
62	`---`
63
64	`# Fine-tuned XLSR-53 large model for speech recognition in Portuguese`
65
66	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Portuguese using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).`
67	`When using this model, make sure that your speech input is sampled at 16kHz.`
68
69	`This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)`
70
71	`The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint`
72
73	`## Usage`
74
75	`The model can be used directly (without a language model) as follows...`
76
77	`Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:`
78
79	```python
80	`from huggingsound import SpeechRecognitionModel`
81
82	`model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-portuguese")`
83	`audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]`
84
85	`transcriptions = model.transcribe(audio_paths)`
86	```
87
88	`Writing your own inference script:`
89
90	```python
91	`import torch`
92	`import librosa`
93	`from datasets import load_dataset`
94	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
95
96	`LANG_ID = "pt"`
97	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-portuguese"`
98	`SAMPLES = 10`
99
100	`test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")`
101
102	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
103	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
104
105	`# Preprocessing the datasets.`
106	`# We need to read the audio files as arrays`
107	`def speech_file_to_array_fn(batch):`
108	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
109	`batch["speech"] = speech_array`
110	`batch["sentence"] = batch["sentence"].upper()`
111	`return batch`
112
113	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
114	`inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
115
116	`with torch.no_grad():`
117	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
118
119	`predicted_ids = torch.argmax(logits, dim=-1)`
120	`predicted_sentences = processor.batch_decode(predicted_ids)`
121
122	`for i, predicted_sentence in enumerate(predicted_sentences):`
123	`print("-" * 100)`
124	`print("Reference:", test_dataset[i]["sentence"])`
125	`print("Prediction:", predicted_sentence)`
126	```
127
128	`\| Reference \| Prediction \|`
129	`\| ------------- \| ------------- \|`
130	`\| NEM O RADAR NEM OS OUTROS INSTRUMENTOS DETECTARAM O BOMBARDEIRO STEALTH. \| NEMHUM VADAN OS OLTWES INSTRUMENTOS DE TTÉÃN UM BOMBERDEIRO OSTER \|`
131	`\| PEDIR DINHEIRO EMPRESTADO ÀS PESSOAS DA ALDEIA \| E DIR ENGINHEIRO EMPRESTAR AS PESSOAS DA ALDEIA \|`
132	`\| OITO \| OITO \|`
133	`\| TRANCÁ-LOS \| TRANCAUVOS \|`
134	`\| REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA \| REALIZAR UMA INVESTIGAÇÃO PARA RESOLVER O PROBLEMA \|`
135	`\| O YOUTUBE AINDA É A MELHOR PLATAFORMA DE VÍDEOS. \| YOUTUBE AINDA É A MELHOR PLATAFOMA DE VÍDEOS \|`
136	`\| MENINA E MENINO BEIJANDO NAS SOMBRAS \| MENINA E MENINO BEIJANDO NAS SOMBRAS \|`
137	`\| EU SOU O SENHOR \| EU SOU O SENHOR \|`
138	`\| DUAS MULHERES QUE SENTAM-SE PARA BAIXO LENDO JORNAIS. \| DUAS MIERES QUE SENTAM-SE PARA BAICLANE JODNÓI \|`
139	`\| EU ORIGINALMENTE ESPERAVA \| EU ORIGINALMENTE ESPERAVA \|`
140
141	`## Evaluation`
142
143	1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
144
145	```bash
146	`python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-portuguese --dataset mozilla-foundation/common_voice_6_0 --config pt --split test`
147	```
148
149	2. To evaluate on `speech-recognition-community-v2/dev_data`
150
151	```bash
152	`python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-portuguese --dataset speech-recognition-community-v2/dev_data --config pt --split validation --chunk_length_s 5.0 --stride_length_s 1.0`
153	```
154
155	`## Citation`
156	`If you want to cite this model you can use this:`
157
158	```bibtex
159	`@misc{grosman2021xlsr53-large-portuguese,`
160	`title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}ortuguese},`
161	`author={Grosman, Jonatas},`
162	`howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese}},`
163	`year={2021}`
164	`}`
165	```