README.md
5.5 KB · 183 lines · markdown Raw
1 ---
2 language:
3 - id
4 - jv
5 - sun
6 datasets:
7 - mozilla-foundation/common_voice_7_0
8 - openslr
9 - magic_data
10 - titml
11 metrics:
12 - wer
13 tags:
14 - audio
15 - automatic-speech-recognition
16 - hf-asr-leaderboard
17 - id
18 - jv
19 - robust-speech-event
20 - speech
21 - su
22 license: apache-2.0
23 model-index:
24 - name: Wav2Vec2 Indonesian Javanese and Sundanese by Indonesian NLP
25 results:
26 - task:
27 name: Automatic Speech Recognition
28 type: automatic-speech-recognition
29 dataset:
30 name: Common Voice 6.1
31 type: common_voice
32 args: id
33 metrics:
34 - name: Test WER
35 type: wer
36 value: 4.056
37 - name: Test CER
38 type: cer
39 value: 1.472
40 - task:
41 name: Automatic Speech Recognition
42 type: automatic-speech-recognition
43 dataset:
44 name: Common Voice 7
45 type: mozilla-foundation/common_voice_7_0
46 args: id
47 metrics:
48 - name: Test WER
49 type: wer
50 value: 4.492
51 - name: Test CER
52 type: cer
53 value: 1.577
54 - task:
55 name: Automatic Speech Recognition
56 type: automatic-speech-recognition
57 dataset:
58 name: Robust Speech Event - Dev Data
59 type: speech-recognition-community-v2/dev_data
60 args: id
61 metrics:
62 - name: Test WER
63 type: wer
64 value: 48.94
65 - task:
66 name: Automatic Speech Recognition
67 type: automatic-speech-recognition
68 dataset:
69 name: Robust Speech Event - Test Data
70 type: speech-recognition-community-v2/eval_data
71 args: id
72 metrics:
73 - name: Test WER
74 type: wer
75 value: 68.95
76 ---
77
78 # Multilingual Speech Recognition for Indonesian Languages
79
80 This is the model built for the project
81 [Multilingual Speech Recognition for Indonesian Languages](https://github.com/indonesian-nlp/multilingual-asr).
82 It is a fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
83 model on the [Indonesian Common Voice dataset](https://huggingface.co/datasets/common_voice),
84 [High-quality TTS data for Javanese - SLR41](https://huggingface.co/datasets/openslr), and
85 [High-quality TTS data for Sundanese - SLR44](https://huggingface.co/datasets/openslr) datasets.
86
87 We also provide a [live demo](https://huggingface.co/spaces/indonesian-nlp/multilingual-asr) to test the model.
88
89 When using this model, make sure that your speech input is sampled at 16kHz.
90
91 ## Usage
92 The model can be used directly (without a language model) as follows:
93 ```python
94 import torch
95 import torchaudio
96 from datasets import load_dataset
97 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
98
99 test_dataset = load_dataset("common_voice", "id", split="test[:2%]")
100
101 processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
102 model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
103
104 resampler = torchaudio.transforms.Resample(48_000, 16_000)
105
106 # Preprocessing the datasets.
107 # We need to read the aduio files as arrays
108 def speech_file_to_array_fn(batch):
109 speech_array, sampling_rate = torchaudio.load(batch["path"])
110 batch["speech"] = resampler(speech_array).squeeze().numpy()
111 return batch
112
113 test_dataset = test_dataset.map(speech_file_to_array_fn)
114 inputs = processor(test_dataset[:2]["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
115
116 with torch.no_grad():
117 logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
118
119 predicted_ids = torch.argmax(logits, dim=-1)
120
121 print("Prediction:", processor.batch_decode(predicted_ids))
122 print("Reference:", test_dataset[:2]["sentence"])
123 ```
124
125
126 ## Evaluation
127
128 The model can be evaluated as follows on the Indonesian test data of Common Voice.
129
130 ```python
131 import torch
132 import torchaudio
133 from datasets import load_dataset, load_metric
134 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
135 import re
136
137 test_dataset = load_dataset("common_voice", "id", split="test")
138 wer = load_metric("wer")
139
140 processor = Wav2Vec2Processor.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
141 model = Wav2Vec2ForCTC.from_pretrained("indonesian-nlp/wav2vec2-indonesian-javanese-sundanese")
142 model.to("cuda")
143
144 chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\'\”\�]'
145
146 resampler = torchaudio.transforms.Resample(48_000, 16_000)
147
148 # Preprocessing the datasets.
149 # We need to read the audio files as arrays
150 def speech_file_to_array_fn(batch):
151 batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
152 speech_array, sampling_rate = torchaudio.load(batch["path"])
153 batch["speech"] = resampler(speech_array).squeeze().numpy()
154 return batch
155
156 test_dataset = test_dataset.map(speech_file_to_array_fn)
157
158 # Preprocessing the datasets.
159 # We need to read the audio files as arrays
160 def evaluate(batch):
161 inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
162
163 with torch.no_grad():
164 logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
165
166 pred_ids = torch.argmax(logits, dim=-1)
167 batch["pred_strings"] = processor.batch_decode(pred_ids)
168 return batch
169
170 result = test_dataset.map(evaluate, batched=True, batch_size=8)
171
172 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
173 ```
174
175 **Test Result**: 11.57 %
176
177 ## Training
178
179 The Common Voice `train`, `validation`, and ... datasets were used for training as well as ... and ... # TODO
180
181 The script used for training can be found [here](https://github.com/cahya-wirawan/indonesian-speech-recognition)
182 (will be available soon)
183