README.md · wav2vec2-large-xlsr-53-hungarian

README.md

7.6 KB · 196 lines · markdown Raw

1	`---`
2	`language: hu`
3	`datasets:`
4	`- common_voice`
5	`metrics:`
6	`- wer`
7	`- cer`
8	`tags:`
9	`- audio`
10	`- automatic-speech-recognition`
11	`- speech`
12	`- xlsr-fine-tuning-week`
13	`license: apache-2.0`
14	`model-index:`
15	`- name: XLSR Wav2Vec2 Hungarian by Jonatas Grosman`
16	`results:`
17	`- task:`
18	`name: Speech Recognition`
19	`type: automatic-speech-recognition`
20	`dataset:`
21	`name: Common Voice hu`
22	`type: common_voice`
23	`args: hu`
24	`metrics:`
25	`- name: Test WER`
26	`type: wer`
27	`value: 31.40`
28	`- name: Test CER`
29	`type: cer`
30	`value: 6.20`
31	`---`
32
33	`# Fine-tuned XLSR-53 large model for speech recognition in Hungarian`
34
35	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Hungarian using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice) and [CSS10](https://github.com/Kyubyong/css10).`
36	`When using this model, make sure that your speech input is sampled at 16kHz.`
37
38	`This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)`
39
40	`The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint`
41
42	`## Usage`
43
44	`The model can be used directly (without a language model) as follows...`
45
46	`Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:`
47
48	```python
49	`from huggingsound import SpeechRecognitionModel`
50
51	`model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-hungarian")`
52	`audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]`
53
54	`transcriptions = model.transcribe(audio_paths)`
55	```
56
57	`Writing your own inference script:`
58
59	```python
60	`import torch`
61	`import librosa`
62	`from datasets import load_dataset`
63	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
64
65	`LANG_ID = "hu"`
66	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian"`
67	`SAMPLES = 5`
68
69	`test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")`
70
71	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
72	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
73
74	`# Preprocessing the datasets.`
75	`# We need to read the audio files as arrays`
76	`def speech_file_to_array_fn(batch):`
77	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
78	`batch["speech"] = speech_array`
79	`batch["sentence"] = batch["sentence"].upper()`
80	`return batch`
81
82	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
83	`inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
84
85	`with torch.no_grad():`
86	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
87
88	`predicted_ids = torch.argmax(logits, dim=-1)`
89	`predicted_sentences = processor.batch_decode(predicted_ids)`
90
91	`for i, predicted_sentence in enumerate(predicted_sentences):`
92	`print("-" * 100)`
93	`print("Reference:", test_dataset[i]["sentence"])`
94	`print("Prediction:", predicted_sentence)`
95	```
96
97	`\| Reference \| Prediction \|`
98	`\| ------------- \| ------------- \|`
99	`\| BÜSZKÉK VAGYUNK A MAGYAR EMBEREK NAGYSZERŰ SZELLEMI ALKOTÁSAIRA. \| BÜSZKÉK VAGYUNK A MAGYAR EMBEREK NAGYSZERŰ SZELLEMI ALKOTÁSAIRE \|`
100	`\| A NEMZETSÉG TAGJAI KÖZÜL EZT TERMESZTIK A LEGSZÉLESEBB KÖRBEN ÍZLETES TERMÉSÉÉRT. \| A NEMZETSÉG TAGJAI KÖZÜL ESZSZERMESZTIK A LEGSZELESEBB KÖRBEN IZLETES TERMÉSSÉÉRT \|`
101	`\| A VÁROSBA VÁGYÓDOTT A LEGJOBBAN, ÉPPEN MERT ODA NEM JUTHATOTT EL SOHA. \| A VÁROSBA VÁGYÓDOTT A LEGJOBBAN ÉPPEN MERT ODA NEM JUTHATOTT EL SOHA \|`
102	`\| SÍRJA MÁRA MEGSEMMISÜLT. \| SIMGI A MANDO MEG SEMMICSEN \|`
103	`\| MINDEN ZENESZÁMOT DRÁGAKŐNEK NEVEZETT. \| MINDEN ZENA SZÁMODRAGAKŐNEK NEVEZETT \|`
104	`\| ÍGY MÚLT EL A DÉLELŐTT. \| ÍGY MÚLT EL A DÍN ELŐTT \|`
105	`\| REMEK POFA! \| A REMEG PUFO \|`
106	`\| SZEMET SZEMÉRT, FOGAT FOGÉRT. \| SZEMET SZEMÉRT FOGADD FOGÉRT \|`
107	`\| BIZTOSAN LAKIK ITT NÉHÁNY ATYÁMFIA. \| BIZTOSAN LAKIKÉT NÉHANY ATYAMFIA \|`
108	`\| A SOROK KÖZÖTT OLVAS. \| A SOROG KÖZÖTT OLVAS \|`
109
110	`## Evaluation`
111
112	`The model can be evaluated as follows on the Hungarian test data of Common Voice.`
113
114	```python
115	`import torch`
116	`import re`
117	`import librosa`
118	`from datasets import load_dataset, load_metric`
119	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
120
121	`LANG_ID = "hu"`
122	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-hungarian"`
123	`DEVICE = "cuda"`
124
125	`CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",`
126	`"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",`
127	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
128	`"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",`
129	`"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]`
130
131	`test_dataset = load_dataset("common_voice", LANG_ID, split="test")`
132
133	`wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py`
134	`cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py`
135
136	`chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"`
137
138	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
139	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
140	`model.to(DEVICE)`
141
142	`# Preprocessing the datasets.`
143	`# We need to read the audio files as arrays`
144	`def speech_file_to_array_fn(batch):`
145	`with warnings.catch_warnings():`
146	`warnings.simplefilter("ignore")`
147	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
148	`batch["speech"] = speech_array`
149	`batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()`
150	`return batch`
151
152	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
153
154	`# Preprocessing the datasets.`
155	`# We need to read the audio files as arrays`
156	`def evaluate(batch):`
157	`inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
158
159	`with torch.no_grad():`
160	`logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits`
161
162	`pred_ids = torch.argmax(logits, dim=-1)`
163	`batch["pred_strings"] = processor.batch_decode(pred_ids)`
164	`return batch`
165
166	`result = test_dataset.map(evaluate, batched=True, batch_size=8)`
167
168	`predictions = [x.upper() for x in result["pred_strings"]]`
169	`references = [x.upper() for x in result["sentence"]]`
170
171	`print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")`
172	`print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")`
173	```
174
175	`Test Result:`
176
177	`In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-22). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.`
178
179	`\| Model \| WER \| CER \|`
180	`\| ------------- \| ------------- \| ------------- \|`
181	`\| jonatasgrosman/wav2vec2-large-xlsr-53-hungarian \| 31.40% \| 6.20% \|`
182	`\| anton-l/wav2vec2-large-xlsr-53-hungarian \| 42.39% \| 9.39% \|`
183	`\| gchhablani/wav2vec2-large-xlsr-hu \| 46.42% \| 10.04% \|`
184	`\| birgermoell/wav2vec2-large-xlsr-hungarian \| 46.93% \| 10.31% \|`
185
186	`## Citation`
187	`If you want to cite this model you can use this:`
188
189	```bibtex
190	`@misc{grosman2021xlsr53-large-hungarian,`
191	`title={Fine-tuned {XLSR}-53 large model for speech recognition in {H}ungarian},`
192	`author={Grosman, Jonatas},`
193	`howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-hungarian}},`
194	`year={2021}`
195	`}`
196	```