README.md · wav2vec2-indonesian-javanese-sundanese

README.md

5.5 KB · 183 lines · markdown Raw

1	`---`
2	`language:`
3	`- id`
4	`- jv`
5	`- sun`
6	`datasets:`
7	`- mozilla-foundation/common_voice_7_0`
8	`- openslr`
9	`- magic_data`
10	`- titml`
11	`metrics:`
12	`- wer`
13	`tags:`
14	`- audio`
15	`- automatic-speech-recognition`
16	`- hf-asr-leaderboard`
17	`- id`
18	`- jv`
19	`- robust-speech-event`
20	`- speech`
21	`- su`
22	`license: apache-2.0`
23	`model-index:`
24	`- name: Wav2Vec2 Indonesian Javanese and Sundanese by Indonesian NLP`
25	`results:`
26	`- task:`
27	`name: Automatic Speech Recognition`
28	`type: automatic-speech-recognition`
29	`dataset:`
30	`name: Common Voice 6.1`
31	`type: common_voice`
32	`args: id`
33	`metrics:`
34	`- name: Test WER`
35	`type: wer`
36	`value: 4.056`
37	`- name: Test CER`
38	`type: cer`
39	`value: 1.472`
40	`- task:`
41	`name: Automatic Speech Recognition`
42	`type: automatic-speech-recognition`
43	`dataset:`
44	`name: Common Voice 7`
45	`type: mozilla-foundation/common_voice_7_0`
46	`args: id`
47	`metrics:`
48	`- name: Test WER`
49	`type: wer`
50	`value: 4.492`
51	`- name: Test CER`
52	`type: cer`
53	`value: 1.577`
54	`- task:`
55	`name: Automatic Speech Recognition`
56	`type: automatic-speech-recognition`
57	`dataset:`
58	`name: Robust Speech Event - Dev Data`
59	`type: speech-recognition-community-v2/dev_data`
60	`args: id`
61	`metrics:`
62	`- name: Test WER`
63	`type: wer`
64	`value: 48.94`
65	`- task:`
66	`name: Automatic Speech Recognition`
67	`type: automatic-speech-recognition`
68	`dataset:`
69	`name: Robust Speech Event - Test Data`
70	`type: speech-recognition-community-v2/eval_data`
71	`args: id`
72	`metrics:`
73	`- name: Test WER`
74	`type: wer`
75	`value: 68.95`
76	`---`
77
78	`# Multilingual Speech Recognition for Indonesian Languages`
79
80	`This is the model built for the project`
81	`[Multilingual Speech Recognition for Indonesian Languages](https://github.com/indonesian-nlp/multilingual-asr).`
82	`It is a fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)`
83	`model on the [Indonesian Common Voice dataset](https://huggingface.co/datasets/common_voice),`
84	`[High-quality TTS data for Javanese - SLR41](https://huggingface.co/datasets/openslr), and`
85	`[High-quality TTS data for Sundanese - SLR44](https://huggingface.co/datasets/openslr) datasets.`
86
87	`We also provide a [live demo](https://huggingface.co/spaces/indonesian-nlp/multilingual-asr) to test the model.`
88
89	`When using this model, make sure that your speech input is sampled at 16kHz.`
90
91	`## Usage`
92	`The model can be used directly (without a language model) as follows:`
93	```python
94	`import torch`
95	`import torchaudio`
96	`from datasets import load_dataset`
97	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
98
99	`test_dataset = load_dataset("common_voice", "id", split="test[:2%]")`
100
101	`processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")`
102	`model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")`
103
104	`resampler = torchaudio.transforms.Resample(48_000, 16_000)`
105
106	`# Preprocessing the datasets.`
107	`# We need to read the aduio files as arrays`
108	`def speech_file_to_array_fn(batch):`
109	`speech_array, sampling_rate = torchaudio.load(batch["path"])`
110	`batch["speech"] = resampler(speech_array).squeeze().numpy()`
111	`return batch`
112
113	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
114	`inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
115
116	`with torch.no_grad():`
117	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
118
119	`predicted_ids = torch.argmax(logits, dim=-1)`
120
121	`print("Prediction:", processor.batch_decode(predicted_ids))`
122	`print("Reference:", test_dataset[:2]["sentence"])`
123	```
124
125
126	`## Evaluation`
127
128	`The model can be evaluated as follows on the Indonesian test data of Common Voice.`
129
130	```python
131	`import torch`
132	`import torchaudio`
133	`from datasets import load_dataset, load_metric`
134	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
135	`import re`
136
137	`test_dataset = load_dataset("common_voice", "id", split="test")`
138	`wer = load_metric("wer")`
139
140	`processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")`
141	`model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")`
142	`model.to("cuda")`
143
144	`chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\'\”\�]'`
145
146	`resampler = torchaudio.transforms.Resample(48_000, 16_000)`
147
148	`# Preprocessing the datasets.`
149	`# We need to read the audio files as arrays`
150	`def speech_file_to_array_fn(batch):`
151	`batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()`
152	`speech_array, sampling_rate = torchaudio.load(batch["path"])`
153	`batch["speech"] = resampler(speech_array).squeeze().numpy()`
154	`return batch`
155
156	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
157
158	`# Preprocessing the datasets.`
159	`# We need to read the audio files as arrays`
160	`def evaluate(batch):`
161	`inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
162
163	`with torch.no_grad():`
164	`logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits`
165
166	`pred_ids = torch.argmax(logits, dim=-1)`
167	`batch["pred_strings"] = processor.batch_decode(pred_ids)`
168	`return batch`
169
170	`result = test_dataset.map(evaluate, batched=True, batch_size=8)`
171
172	`print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))`
173	```
174
175	`Test Result: 11.57 %`
176
177	`## Training`
178
179	The Common Voice `train`, `validation`, and ... datasets were used for training as well as ... and ... # TODO
180
181	`The script used for training can be found [here](https://github.com/cahya-wirawan/indonesian-speech-recognition)`
182	`(will be available soon)`
183