README.md
10.9 KB · 278 lines · markdown Raw
1 ---
2 language:
3 - ro
4 license: apache-2.0
5 tags:
6 - automatic-speech-recognition
7 - hf-asr-leaderboard
8 - robust-speech-event
9 datasets:
10 - mozilla-foundation/common_voice_8_0
11 - gigant/romanian_speech_synthesis_0_8_1
12 base_model: facebook/wav2vec2-xls-r-300m
13 model-index:
14 - name: wav2vec2-ro-300m_01
15 results:
16 - task:
17 type: automatic-speech-recognition
18 name: Automatic Speech Recognition
19 dataset:
20 name: Robust Speech Event
21 type: speech-recognition-community-v2/dev_data
22 args: ro
23 metrics:
24 - type: wer
25 value: 46.99
26 name: Dev WER (without LM)
27 - type: cer
28 value: 16.04
29 name: Dev CER (without LM)
30 - type: wer
31 value: 38.63
32 name: Dev WER (with LM)
33 - type: cer
34 value: 14.52
35 name: Dev CER (with LM)
36 - task:
37 type: automatic-speech-recognition
38 name: Automatic Speech Recognition
39 dataset:
40 name: Common Voice
41 type: mozilla-foundation/common_voice_8_0
42 args: ro
43 metrics:
44 - type: wer
45 value: 11.73
46 name: Test WER (without LM)
47 - type: cer
48 value: 2.93
49 name: Test CER (without LM)
50 - type: wer
51 value: 7.31
52 name: Test WER (with LM)
53 - type: cer
54 value: 2.17
55 name: Test CER (with LM)
56 - task:
57 type: automatic-speech-recognition
58 name: Automatic Speech Recognition
59 dataset:
60 name: Robust Speech Event - Test Data
61 type: speech-recognition-community-v2/eval_data
62 args: ro
63 metrics:
64 - type: wer
65 value: 43.23
66 name: Test WER
67 ---
68
69 You can test this model online with the [**Space for Romanian Speech Recognition**](https://huggingface.co/spaces/gigant/romanian-speech-recognition)
70
71 The model ranked **TOP-1** on Romanian Speech Recognition during HuggingFace's Robust Speech Challenge :
72
73 * [**The 🤗 Speech Bench**](https://huggingface.co/spaces/huggingface/hf-speech-bench)
74
75 * [**Speech Challenge Leaderboard**](https://huggingface.co/spaces/speech-recognition-community-v2/FinalLeaderboard)
76
77 # Romanian Wav2Vec2
78
79 This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on the [Common Voice 8.0 - Romanian subset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) dataset, with extra training data from [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) dataset.
80
81 Without the 5-gram Language Model optimization, it achieves the following results on the evaluation set (Common Voice 8.0, Romanian subset, test split):
82 - Loss: 0.1553
83 - Wer: 0.1174
84 - Cer: 0.0294
85
86 ## Model description
87
88 The architecture is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) with a speech recognition CTC head and an added 5-gram language model (using [pyctcdecode](https://github.com/kensho-technologies/pyctcdecode) and [kenlm](https://github.com/kpu/kenlm)) trained on the [Romanian Corpora Parliament](gigant/ro_corpora_parliament_processed) dataset. Those libraries are needed in order for the language model-boosted decoder to work.
89
90 ## Intended uses & limitations
91
92 The model is made for speech recognition in Romanian from audio clips sampled at **16kHz**. The predicted text is lowercased and does not contain any punctuation.
93
94 ## How to use
95
96 Make sure you have installed the correct dependencies for the language model-boosted version to work. You can just run this command to install the `kenlm` and `pyctcdecode` libraries :
97
98 ```pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode```
99
100
101 With the framework `transformers` you can load the model with the following code :
102
103 ```
104 from transformers import AutoProcessor, AutoModelForCTC
105
106 processor = AutoProcessor.from_pretrained("gigant/romanian-wav2vec2")
107
108 model = AutoModelForCTC.from_pretrained("gigant/romanian-wav2vec2")
109 ```
110
111 Or, if you want to test the model, you can load the automatic speech recognition pipeline from `transformers` with :
112
113 ```
114 from transformers import pipeline
115
116 asr = pipeline("automatic-speech-recognition", model="gigant/romanian-wav2vec2")
117 ```
118
119
120 ## Example use with the `datasets` library
121
122 First, you need to load your data
123
124 We will use the [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) dataset in this example.
125
126 ```
127 from datasets import load_dataset
128
129 dataset = load_dataset("gigant/romanian_speech_synthesis_0_8_1")
130 ```
131
132 You can listen to the samples with the `IPython.display` library :
133
134 ```
135 from IPython.display import Audio
136
137 i = 0
138 sample = dataset["train"][i]
139 Audio(sample["audio"]["array"], rate = sample["audio"]["sampling_rate"])
140 ```
141
142 The model is trained to work with audio sampled at 16kHz, so if the sampling rate of the audio in the dataset is different, we will have to resample it.
143
144 In the example, the audio is sampled at 48kHz. We can see this by checking `dataset["train"][0]["audio"]["sampling_rate"]`
145
146 The following code resample the audio using the `torchaudio` library :
147
148 ```
149 import torchaudio
150 import torch
151
152 i = 0
153 audio = sample["audio"]["array"]
154 rate = sample["audio"]["sampling_rate"]
155 resampler = torchaudio.transforms.Resample(rate, 16_000)
156 audio_16 = resampler(torch.Tensor(audio)).numpy()
157 ```
158
159 To listen to the resampled sample :
160
161 ```
162 Audio(audio_16, rate=16000)
163 ```
164
165 Know you can get the model prediction by running
166
167 ```
168 predicted_text = asr(audio_16)
169 ground_truth = dataset["train"][i]["sentence"]
170
171 print(f"Predicted text : {predicted_text}")
172 print(f"Ground truth : {ground_truth}")
173 ```
174
175 ## Training and evaluation data
176
177 Training data :
178 - [Common Voice 8.0 - Romanian subset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) : train + validation + other splits
179 - [Romanian Speech Synthesis](https://huggingface.co/datasets/gigant/romanian_speech_synthesis_0_8_1) : train + test splits
180
181 Evaluation data :
182 - [Common Voice 8.0 - Romanian subset](https://huggingface.co/datasets/mozilla-foundation/common_voice_8_0) : test split
183
184 ## Training procedure
185
186 ### Training hyperparameters
187
188 The following hyperparameters were used during training:
189 - learning_rate: 0.003
190 - train_batch_size: 16
191 - eval_batch_size: 8
192 - seed: 42
193 - gradient_accumulation_steps: 3
194 - total_train_batch_size: 48
195 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
196 - lr_scheduler_type: linear
197 - lr_scheduler_warmup_steps: 500
198 - num_epochs: 50.0
199 - mixed_precision_training: Native AMP
200
201 ### Training results
202
203 | Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
204 |:-------------:|:-----:|:-----:|:---------------:|:------:|:------:|
205 | 2.9272 | 0.78 | 500 | 0.7603 | 0.7734 | 0.2355 |
206 | 0.6157 | 1.55 | 1000 | 0.4003 | 0.4866 | 0.1247 |
207 | 0.4452 | 2.33 | 1500 | 0.2960 | 0.3689 | 0.0910 |
208 | 0.3631 | 3.11 | 2000 | 0.2580 | 0.3205 | 0.0796 |
209 | 0.3153 | 3.88 | 2500 | 0.2465 | 0.2977 | 0.0747 |
210 | 0.2795 | 4.66 | 3000 | 0.2274 | 0.2789 | 0.0694 |
211 | 0.2615 | 5.43 | 3500 | 0.2277 | 0.2685 | 0.0675 |
212 | 0.2389 | 6.21 | 4000 | 0.2135 | 0.2518 | 0.0627 |
213 | 0.2229 | 6.99 | 4500 | 0.2054 | 0.2449 | 0.0614 |
214 | 0.2067 | 7.76 | 5000 | 0.2096 | 0.2378 | 0.0597 |
215 | 0.1977 | 8.54 | 5500 | 0.2042 | 0.2387 | 0.0600 |
216 | 0.1896 | 9.32 | 6000 | 0.2110 | 0.2383 | 0.0595 |
217 | 0.1801 | 10.09 | 6500 | 0.1909 | 0.2165 | 0.0548 |
218 | 0.174 | 10.87 | 7000 | 0.1883 | 0.2206 | 0.0559 |
219 | 0.1685 | 11.65 | 7500 | 0.1848 | 0.2097 | 0.0528 |
220 | 0.1591 | 12.42 | 8000 | 0.1851 | 0.2039 | 0.0514 |
221 | 0.1537 | 13.2 | 8500 | 0.1881 | 0.2065 | 0.0518 |
222 | 0.1504 | 13.97 | 9000 | 0.1840 | 0.1972 | 0.0499 |
223 | 0.145 | 14.75 | 9500 | 0.1845 | 0.2029 | 0.0517 |
224 | 0.1417 | 15.53 | 10000 | 0.1884 | 0.2003 | 0.0507 |
225 | 0.1364 | 16.3 | 10500 | 0.2010 | 0.2037 | 0.0517 |
226 | 0.1331 | 17.08 | 11000 | 0.1838 | 0.1923 | 0.0483 |
227 | 0.129 | 17.86 | 11500 | 0.1818 | 0.1922 | 0.0489 |
228 | 0.1198 | 18.63 | 12000 | 0.1760 | 0.1861 | 0.0465 |
229 | 0.1203 | 19.41 | 12500 | 0.1686 | 0.1839 | 0.0465 |
230 | 0.1225 | 20.19 | 13000 | 0.1828 | 0.1920 | 0.0479 |
231 | 0.1145 | 20.96 | 13500 | 0.1673 | 0.1784 | 0.0446 |
232 | 0.1053 | 21.74 | 14000 | 0.1802 | 0.1810 | 0.0456 |
233 | 0.1071 | 22.51 | 14500 | 0.1769 | 0.1775 | 0.0444 |
234 | 0.1053 | 23.29 | 15000 | 0.1920 | 0.1783 | 0.0457 |
235 | 0.1024 | 24.07 | 15500 | 0.1904 | 0.1775 | 0.0446 |
236 | 0.0987 | 24.84 | 16000 | 0.1793 | 0.1762 | 0.0446 |
237 | 0.0949 | 25.62 | 16500 | 0.1801 | 0.1766 | 0.0443 |
238 | 0.0942 | 26.4 | 17000 | 0.1731 | 0.1659 | 0.0423 |
239 | 0.0906 | 27.17 | 17500 | 0.1776 | 0.1698 | 0.0424 |
240 | 0.0861 | 27.95 | 18000 | 0.1716 | 0.1600 | 0.0406 |
241 | 0.0851 | 28.73 | 18500 | 0.1662 | 0.1630 | 0.0410 |
242 | 0.0844 | 29.5 | 19000 | 0.1671 | 0.1572 | 0.0393 |
243 | 0.0792 | 30.28 | 19500 | 0.1768 | 0.1599 | 0.0407 |
244 | 0.0798 | 31.06 | 20000 | 0.1732 | 0.1558 | 0.0394 |
245 | 0.0779 | 31.83 | 20500 | 0.1694 | 0.1544 | 0.0388 |
246 | 0.0718 | 32.61 | 21000 | 0.1709 | 0.1578 | 0.0399 |
247 | 0.0732 | 33.38 | 21500 | 0.1697 | 0.1523 | 0.0391 |
248 | 0.0708 | 34.16 | 22000 | 0.1616 | 0.1474 | 0.0375 |
249 | 0.0678 | 34.94 | 22500 | 0.1698 | 0.1474 | 0.0375 |
250 | 0.0642 | 35.71 | 23000 | 0.1681 | 0.1459 | 0.0369 |
251 | 0.0661 | 36.49 | 23500 | 0.1612 | 0.1411 | 0.0357 |
252 | 0.0629 | 37.27 | 24000 | 0.1662 | 0.1414 | 0.0355 |
253 | 0.0587 | 38.04 | 24500 | 0.1659 | 0.1408 | 0.0351 |
254 | 0.0581 | 38.82 | 25000 | 0.1612 | 0.1382 | 0.0352 |
255 | 0.0556 | 39.6 | 25500 | 0.1647 | 0.1376 | 0.0345 |
256 | 0.0543 | 40.37 | 26000 | 0.1658 | 0.1335 | 0.0337 |
257 | 0.052 | 41.15 | 26500 | 0.1716 | 0.1369 | 0.0343 |
258 | 0.0513 | 41.92 | 27000 | 0.1600 | 0.1317 | 0.0330 |
259 | 0.0491 | 42.7 | 27500 | 0.1671 | 0.1311 | 0.0328 |
260 | 0.0463 | 43.48 | 28000 | 0.1613 | 0.1289 | 0.0324 |
261 | 0.0468 | 44.25 | 28500 | 0.1599 | 0.1260 | 0.0315 |
262 | 0.0435 | 45.03 | 29000 | 0.1556 | 0.1232 | 0.0308 |
263 | 0.043 | 45.81 | 29500 | 0.1588 | 0.1240 | 0.0309 |
264 | 0.0421 | 46.58 | 30000 | 0.1567 | 0.1217 | 0.0308 |
265 | 0.04 | 47.36 | 30500 | 0.1533 | 0.1198 | 0.0302 |
266 | 0.0389 | 48.14 | 31000 | 0.1582 | 0.1185 | 0.0297 |
267 | 0.0387 | 48.91 | 31500 | 0.1576 | 0.1187 | 0.0297 |
268 | 0.0376 | 49.69 | 32000 | 0.1560 | 0.1182 | 0.0295 |
269
270
271 ### Framework versions
272
273 - Transformers 4.16.2
274 - Pytorch 1.10.0+cu111
275 - Tokenizers 0.11.0
276 - pyctcdecode 0.3.0
277 - kenlm
278