README.md · wav2vec2-large-xlsr-53-polish

README.md

5.1 KB · 165 lines · markdown Raw

1	`---`
2	`language: pl`
3	`license: apache-2.0`
4	`datasets:`
5	`- common_voice`
6	`- mozilla-foundation/common_voice_6_0`
7	`metrics:`
8	`- wer`
9	`- cer`
10	`tags:`
11	`- audio`
12	`- automatic-speech-recognition`
13	`- hf-asr-leaderboard`
14	`- mozilla-foundation/common_voice_6_0`
15	`- pl`
16	`- robust-speech-event`
17	`- speech`
18	`- xlsr-fine-tuning-week`
19	`model-index:`
20	`- name: XLSR Wav2Vec2 Polish by Jonatas Grosman`
21	`results:`
22	`- task:`
23	`name: Automatic Speech Recognition`
24	`type: automatic-speech-recognition`
25	`dataset:`
26	`name: Common Voice pl`
27	`type: common_voice`
28	`args: pl`
29	`metrics:`
30	`- name: Test WER`
31	`type: wer`
32	`value: 14.21`
33	`- name: Test CER`
34	`type: cer`
35	`value: 3.49`
36	`- name: Test WER (+LM)`
37	`type: wer`
38	`value: 10.98`
39	`- name: Test CER (+LM)`
40	`type: cer`
41	`value: 2.93`
42	`- task:`
43	`name: Automatic Speech Recognition`
44	`type: automatic-speech-recognition`
45	`dataset:`
46	`name: Robust Speech Event - Dev Data`
47	`type: speech-recognition-community-v2/dev_data`
48	`args: pl`
49	`metrics:`
50	`- name: Dev WER`
51	`type: wer`
52	`value: 33.18`
53	`- name: Dev CER`
54	`type: cer`
55	`value: 15.92`
56	`- name: Dev WER (+LM)`
57	`type: wer`
58	`value: 29.31`
59	`- name: Dev CER (+LM)`
60	`type: cer`
61	`value: 15.17`
62	`---`
63
64	`# Fine-tuned XLSR-53 large model for speech recognition in Polish`
65
66	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Polish using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice).`
67	`When using this model, make sure that your speech input is sampled at 16kHz.`
68
69	`This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)`
70
71	`The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint`
72
73	`## Usage`
74
75	`The model can be used directly (without a language model) as follows...`
76
77	`Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:`
78
79	```python
80	`from huggingsound import SpeechRecognitionModel`
81
82	`model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-polish")`
83	`audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]`
84
85	`transcriptions = model.transcribe(audio_paths)`
86	```
87
88	`Writing your own inference script:`
89
90	```python
91	`import torch`
92	`import librosa`
93	`from datasets import load_dataset`
94	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
95
96	`LANG_ID = "pl"`
97	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-polish"`
98	`SAMPLES = 5`
99
100	`test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")`
101
102	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
103	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
104
105	`# Preprocessing the datasets.`
106	`# We need to read the audio files as arrays`
107	`def speech_file_to_array_fn(batch):`
108	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
109	`batch["speech"] = speech_array`
110	`batch["sentence"] = batch["sentence"].upper()`
111	`return batch`
112
113	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
114	`inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
115
116	`with torch.no_grad():`
117	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
118
119	`predicted_ids = torch.argmax(logits, dim=-1)`
120	`predicted_sentences = processor.batch_decode(predicted_ids)`
121
122	`for i, predicted_sentence in enumerate(predicted_sentences):`
123	`print("-" * 100)`
124	`print("Reference:", test_dataset[i]["sentence"])`
125	`print("Prediction:", predicted_sentence)`
126	```
127
128	`\| Reference \| Prediction \|`
129	`\| ------------- \| ------------- \|`
130	`\| """CZY DRZWI BYŁY ZAMKNIĘTE?""" \| PRZY DRZWI BYŁY ZAMKNIĘTE \|`
131	`\| GDZIEŻ TU POWÓD DO WYRZUTÓW? \| WGDZIEŻ TO POM DO WYRYDÓ \|`
132	`\| """O TEM JEDNAK NIE BYŁO MOWY.""" \| O TEM JEDNAK NIE BYŁO MOWY \|`
133	`\| LUBIĘ GO. \| LUBIĄ GO \|`
134	`\| — TO MI NIE POMAGA. \| TO MNIE NIE POMAGA \|`
135	`\| WCIĄŻ LUDZIE WYSIADAJĄ PRZED ZAMKIEM, Z MIASTA, Z PRAGI. \| WCIĄŻ LUDZIE WYSIADAJĄ PRZED ZAMKIEM Z MIASTA Z PRAGI \|`
136	`\| ALE ON WCALE INACZEJ NIE MYŚLAŁ. \| ONY MONITCENIE PONACZUŁA NA MASU \|`
137	`\| A WY, CO TAK STOICIE? \| A WY CO TAK STOICIE \|`
138	`\| A TEN PRZYRZĄD DO CZEGO SŁUŻY? \| A TEN PRZYRZĄD DO CZEGO SŁUŻY \|`
139	`\| NA JUTRZEJSZYM KOLOKWIUM BĘDZIE PIĘĆ PYTAŃ OTWARTYCH I TEST WIELOKROTNEGO WYBORU. \| NAJUTRZEJSZYM KOLOKWIUM BĘDZIE PIĘĆ PYTAŃ OTWARTYCH I TEST WIELOKROTNEGO WYBORU \|`
140
141	`## Evaluation`
142
143	1. To evaluate on `mozilla-foundation/common_voice_6_0` with split `test`
144
145	```bash
146	`python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-polish --dataset mozilla-foundation/common_voice_6_0 --config pl --split test`
147	```
148
149	2. To evaluate on `speech-recognition-community-v2/dev_data`
150
151	```bash
152	`python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-polish --dataset speech-recognition-community-v2/dev_data --config pl --split validation --chunk_length_s 5.0 --stride_length_s 1.0`
153	```
154
155	`## Citation`
156	`If you want to cite this model you can use this:`
157
158	```bibtex
159	`@misc{grosman2021xlsr53-large-polish,`
160	`title={Fine-tuned {XLSR}-53 large model for speech recognition in {P}olish},`
161	`author={Grosman, Jonatas},`
162	`howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-polish}},`
163	`year={2021}`
164	`}`
165	```