README.md · wav2vec2-large-xlsr-53-persian

README.md

7.4 KB · 195 lines · markdown Raw

1	`---`
2	`language: fa`
3	`datasets:`
4	`- common_voice`
5	`metrics:`
6	`- wer`
7	`- cer`
8	`tags:`
9	`- audio`
10	`- automatic-speech-recognition`
11	`- speech`
12	`- xlsr-fine-tuning-week`
13	`license: apache-2.0`
14	`model-index:`
15	`- name: XLSR Wav2Vec2 Persian by Jonatas Grosman`
16	`results:`
17	`- task:`
18	`name: Speech Recognition`
19	`type: automatic-speech-recognition`
20	`dataset:`
21	`name: Common Voice fa`
22	`type: common_voice`
23	`args: fa`
24	`metrics:`
25	`- name: Test WER`
26	`type: wer`
27	`value: 30.12`
28	`- name: Test CER`
29	`type: cer`
30	`value: 7.37`
31	`---`
32
33	`# Fine-tuned XLSR-53 large model for speech recognition in Persian`
34
35	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Persian using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).`
36	`When using this model, make sure that your speech input is sampled at 16kHz.`
37
38	`This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)`
39
40	`The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint`
41
42	`## Usage`
43
44	`The model can be used directly (without a language model) as follows...`
45
46	`Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:`
47
48	```python
49	`from huggingsound import SpeechRecognitionModel`
50
51	`model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-persian")`
52	`audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]`
53
54	`transcriptions = model.transcribe(audio_paths)`
55	```
56
57	`Writing your own inference script:`
58
59	```python
60	`import torch`
61	`import librosa`
62	`from datasets import load_dataset`
63	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
64
65	`LANG_ID = "fa"`
66	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"`
67	`SAMPLES = 5`
68
69	`test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")`
70
71	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
72	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
73
74	`# Preprocessing the datasets.`
75	`# We need to read the audio files as arrays`
76	`def speech_file_to_array_fn(batch):`
77	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
78	`batch["speech"] = speech_array`
79	`batch["sentence"] = batch["sentence"].upper()`
80	`return batch`
81
82	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
83	`inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
84
85	`with torch.no_grad():`
86	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
87
88	`predicted_ids = torch.argmax(logits, dim=-1)`
89	`predicted_sentences = processor.batch_decode(predicted_ids)`
90
91	`for i, predicted_sentence in enumerate(predicted_sentences):`
92	`print("-" * 100)`
93	`print("Reference:", test_dataset[i]["sentence"])`
94	`print("Prediction:", predicted_sentence)`
95	```
96
97	`\| Reference \| Prediction \|`
98	`\| ------------- \| ------------- \|`
99	`\| از مهمونداری کنار بکشم \| از مهمانداری کنار بکشم \|`
100	`\| برو از مهرداد بپرس. \| برو از ماقدعاد به پرس \|`
101	`\| خب ، تو چیكار می كنی؟ \| خوب تو چیکار می کنی \|`
102	`\| مسقط پایتخت عمان در عربی به معنای محل سقوط است \| مسقط پایتخت عمان در عربی به بعنای محل سقوط است \|`
103	`\| آه، نه اصلاُ! \| اهنه اصلا \|`
104	`\| توانست \| توانست \|`
105	`\| قصیده فن شعر میگوید ای دوستان \| قصیده فن شعر میگوید ایدوستون \|`
106	`\| دو استایل متفاوت دارین \| دوبوست داریل و متفاوت بری \|`
107	`\| دو روز قبل از کریسمس ؟ \| اون مفتود پش پشش \|`
108	`\| ساعت های کاری چیست؟ \| این توری که موشیکل خب \|`
109
110	`## Evaluation`
111
112	`The model can be evaluated as follows on the Persian test data of Common Voice.`
113
114	```python
115	`import torch`
116	`import re`
117	`import librosa`
118	`from datasets import load_dataset, load_metric`
119	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
120
121	`LANG_ID = "fa"`
122	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-persian"`
123	`DEVICE = "cuda"`
124
125	`CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",`
126	`"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",`
127	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
128	`"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",`
129	`"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]`
130
131	`test_dataset = load_dataset("common_voice", LANG_ID, split="test")`
132
133	`wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py`
134	`cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py`
135
136	`chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"`
137
138	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
139	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
140	`model.to(DEVICE)`
141
142	`# Preprocessing the datasets.`
143	`# We need to read the audio files as arrays`
144	`def speech_file_to_array_fn(batch):`
145	`with warnings.catch_warnings():`
146	`warnings.simplefilter("ignore")`
147	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
148	`batch["speech"] = speech_array`
149	`batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()`
150	`return batch`
151
152	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
153
154	`# Preprocessing the datasets.`
155	`# We need to read the audio files as arrays`
156	`def evaluate(batch):`
157	`inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
158
159	`with torch.no_grad():`
160	`logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits`
161
162	`pred_ids = torch.argmax(logits, dim=-1)`
163	`batch["pred_strings"] = processor.batch_decode(pred_ids)`
164	`return batch`
165
166	`result = test_dataset.map(evaluate, batched=True, batch_size=8)`
167
168	`predictions = [x.upper() for x in result["pred_strings"]]`
169	`references = [x.upper() for x in result["sentence"]]`
170
171	`print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")`
172	`print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")`
173	```
174
175	`Test Result:`
176
177	`In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-04-22). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.`
178
179	`\| Model \| WER \| CER \|`
180	`\| ------------- \| ------------- \| ------------- \|`
181	`\| jonatasgrosman/wav2vec2-large-xlsr-53-persian \| 30.12% \| 7.37% \|`
182	`\| m3hrdadfi/wav2vec2-large-xlsr-persian-v2 \| 33.85% \| 8.79% \|`
183	`\| m3hrdadfi/wav2vec2-large-xlsr-persian \| 34.37% \| 8.98% \|`
184
185	`## Citation`
186	`If you want to cite this model you can use this:`
187
188	```bibtex
189	`@misc{grosman2021xlsr53-large-persian,`
190	`title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}ersian},`
191	`author={Grosman, Jonatas},`
192	`howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-persian}},`
193	`year={2021}`
194	`}`
195	```