README.md
8.1 KB · 200 lines · markdown Raw
1 ---
2 language: ar
3 datasets:
4 - common_voice
5 - arabic_speech_corpus
6 metrics:
7 - wer
8 - cer
9 tags:
10 - audio
11 - automatic-speech-recognition
12 - speech
13 - xlsr-fine-tuning-week
14 license: apache-2.0
15 model-index:
16 - name: XLSR Wav2Vec2 Arabic by Jonatas Grosman
17 results:
18 - task:
19 name: Speech Recognition
20 type: automatic-speech-recognition
21 dataset:
22 name: Common Voice ar
23 type: common_voice
24 args: ar
25 metrics:
26 - name: Test WER
27 type: wer
28 value: 39.59
29 - name: Test CER
30 type: cer
31 value: 18.18
32 ---
33
34 # Fine-tuned XLSR-53 large model for speech recognition in Arabic
35
36 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Arabic using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice) and [Arabic Speech Corpus](https://huggingface.co/datasets/arabic_speech_corpus).
37 When using this model, make sure that your speech input is sampled at 16kHz.
38
39 This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)
40
41 The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint
42
43 ## Usage
44
45 The model can be used directly (without a language model) as follows...
46
47 Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:
48
49 ```python
50 from huggingsound import SpeechRecognitionModel
51
52 model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
53 audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]
54
55 transcriptions = model.transcribe(audio_paths)
56 ```
57
58 Writing your own inference script:
59
60 ```python
61 import torch
62 import librosa
63 from datasets import load_dataset
64 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
65
66 LANG_ID = "ar"
67 MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
68 SAMPLES = 10
69
70 test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
71
72 processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
73 model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
74
75 # Preprocessing the datasets.
76 # We need to read the audio files as arrays
77 def speech_file_to_array_fn(batch):
78 speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
79 batch["speech"] = speech_array
80 batch["sentence"] = batch["sentence"].upper()
81 return batch
82
83 test_dataset = test_dataset.map(speech_file_to_array_fn)
84 inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
85
86 with torch.no_grad():
87 logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
88
89 predicted_ids = torch.argmax(logits, dim=-1)
90 predicted_sentences = processor.batch_decode(predicted_ids)
91
92 for i, predicted_sentence in enumerate(predicted_sentences):
93 print("-" * 100)
94 print("Reference:", test_dataset[i]["sentence"])
95 print("Prediction:", predicted_sentence)
96 ```
97
98 | Reference | Prediction |
99 | ------------- | ------------- |
100 | ألديك قلم ؟ | ألديك قلم |
101 | ليست هناك مسافة على هذه الأرض أبعد من يوم أمس. | ليست نالك مسافة على هذه الأرض أبعد من يوم الأمس م |
102 | إنك تكبر المشكلة. | إنك تكبر المشكلة |
103 | يرغب أن يلتقي بك. | يرغب أن يلتقي بك |
104 | إنهم لا يعرفون لماذا حتى. | إنهم لا يعرفون لماذا حتى |
105 | سيسعدني مساعدتك أي وقت تحب. | سيسئدنيمساعدتك أي وقد تحب |
106 | أَحَبُّ نظريّة علمية إليّ هي أن حلقات زحل مكونة بالكامل من الأمتعة المفقودة. | أحب نظرية علمية إلي هي أن حل قتزح المكوينا بالكامل من الأمت عن المفقودة |
107 | سأشتري له قلماً. | سأشتري له قلما |
108 | أين المشكلة ؟ | أين المشكل |
109 | وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ | ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون |
110
111 ## Evaluation
112
113 The model can be evaluated as follows on the Arabic test data of Common Voice.
114
115 ```python
116 import torch
117 import re
118 import librosa
119 from datasets import load_dataset, load_metric
120 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
121
122 LANG_ID = "ar"
123 MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
124 DEVICE = "cuda"
125
126 CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", ";", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
127 "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
128 "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
129 "、", "﹂", "﹁", "‧", "~", "﹏", ",", "{", "}", "(", ")", "[", "]", "【", "】", "‥", "〽",
130 "『", "』", "〝", "〟", "⟨", "⟩", "〜", ":", "!", "?", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
131
132 test_dataset = load_dataset("common_voice", LANG_ID, split="test")
133
134 wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
135 cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py
136
137 chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"
138
139 processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
140 model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
141 model.to(DEVICE)
142
143 # Preprocessing the datasets.
144 # We need to read the audio files as arrays
145 def speech_file_to_array_fn(batch):
146 with warnings.catch_warnings():
147 warnings.simplefilter("ignore")
148 speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
149 batch["speech"] = speech_array
150 batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
151 return batch
152
153 test_dataset = test_dataset.map(speech_file_to_array_fn)
154
155 # Preprocessing the datasets.
156 # We need to read the audio files as arrays
157 def evaluate(batch):
158 inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
159
160 with torch.no_grad():
161 logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits
162
163 pred_ids = torch.argmax(logits, dim=-1)
164 batch["pred_strings"] = processor.batch_decode(pred_ids)
165 return batch
166
167 result = test_dataset.map(evaluate, batched=True, batch_size=8)
168
169 predictions = [x.upper() for x in result["pred_strings"]]
170 references = [x.upper() for x in result["sentence"]]
171
172 print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
173 print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
174 ```
175
176 **Test Result**:
177
178 In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-05-14). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
179
180 | Model | WER | CER |
181 | ------------- | ------------- | ------------- |
182 | jonatasgrosman/wav2vec2-large-xlsr-53-arabic | **39.59%** | **18.18%** |
183 | bakrianoo/sinai-voice-ar-stt | 45.30% | 21.84% |
184 | othrif/wav2vec2-large-xlsr-arabic | 45.93% | 20.51% |
185 | kmfoda/wav2vec2-large-xlsr-arabic | 54.14% | 26.07% |
186 | mohammed/wav2vec2-large-xlsr-arabic | 56.11% | 26.79% |
187 | anas/wav2vec2-large-xlsr-arabic | 62.02% | 27.09% |
188 | elgeish/wav2vec2-large-xlsr-53-arabic | 100.00% | 100.56% |
189
190 ## Citation
191 If you want to cite this model you can use this:
192
193 ```bibtex
194 @misc{grosman2021xlsr53-large-arabic,
195 title={Fine-tuned {XLSR}-53 large model for speech recognition in {A}rabic},
196 author={Grosman, Jonatas},
197 howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-arabic}},
198 year={2021}
199 }
200 ```