README.md · Wav2Vec2-large-xlsr-hindi

README.md

3.5 KB · 106 lines · markdown Raw

1	`---`
2	`language:`
3	`- hi`
4	`metrics:`
5	`- wer`
6	`base_model:`
7	`- facebook/wav2vec2-large-xlsr-53`
8	`pipeline_tag: automatic-speech-recognition`
9	`---`
10
11	`# Wav2Vec2-Large-XLSR-53-hindi`
12
13	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) hindi using the [Multilingual and code-switching ASR challenges for low resource Indian languages](https://navana-tech.github.io/IS21SS-indicASRchallenge/data.html).`
14	`When using this model, make sure that your speech input is sampled at 16kHz.`
15
16	`## Usage`
17
18	`The model can be used directly (without a language model) as follows:`
19
20	```python
21	`import torch`
22	`import torchaudio`
23	`from datasets import load_dataset`
24	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
25
26	`test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")`
27	`processor = Wav2Vec2Processor.from_pretrained("theainerd/Wav2Vec2-large-xlsr-hindi")`
28	`model = Wav2Vec2ForCTC.from_pretrained("theainerd/Wav2Vec2-large-xlsr-hindi")`
29	`resampler = torchaudio.transforms.Resample(48_000, 16_000)`
30
31	`# Preprocessing the datasets.`
32	`# We need to read the aduio files as arrays`
33	`def speech_file_to_array_fn(batch):`
34	`speech_array, sampling_rate = torchaudio.load(batch["path"])`
35	`batch["speech"] = resampler(speech_array).squeeze().numpy()`
36	`return batch`
37
38	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
39	`inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)`
40
41	`with torch.no_grad():`
42	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
43
44	`predicted_ids = torch.argmax(logits, dim=-1)`
45
46	`print("Prediction:", processor.batch_decode(predicted_ids))`
47	`print("Reference:", test_dataset["sentence"][:2])`
48	```
49
50
51	`## Evaluation`
52
53	`The model can be evaluated as follows on the hindi test data of Common Voice.`
54
55
56	```python
57	`import torch`
58	`import torchaudio`
59	`from datasets import load_dataset, load_metric`
60	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
61	`import re`
62
63	`test_dataset = load_dataset("common_voice", "hi", split="test")`
64	`wer = load_metric("wer")`
65
66	`processor = Wav2Vec2Processor.from_pretrained("theainerd/Wav2Vec2-large-xlsr-hindi")`
67	`model = Wav2Vec2ForCTC.from_pretrained("theainerd/Wav2Vec2-large-xlsr-hindi")`
68	`model.to("cuda")`
69
70	`resampler = torchaudio.transforms.Resample(48_000, 16_000)`
71
72	`chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'`
73
74	`# Preprocessing the datasets.`
75	`# We need to read the aduio files as arrays`
76	`def speech_file_to_array_fn(batch):`
77	`batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()`
78	`speech_array, sampling_rate = torchaudio.load(batch["path"])`
79	`batch["speech"] = resampler(speech_array).squeeze().numpy()`
80	`return batch`
81
82	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
83
84	`# Preprocessing the datasets.`
85	`# We need to read the aduio files as arrays`
86	`def evaluate(batch):`
87	`inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
88
89	`with torch.no_grad():`
90	`logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits`
91
92	`pred_ids = torch.argmax(logits, dim=-1)`
93	`batch["pred_strings"] = processor.batch_decode(pred_ids)`
94	`return batch`
95
96	`result = test_dataset.map(evaluate, batched=True, batch_size=8)`
97
98	`print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))`
99	```
100
101	`Test Result: 72.62 %`
102
103
104	`## Training`
105
106	`The script used for training can be found [Hindi ASR Fine Tuning Wav2Vec2](https://colab.research.google.com/drive/1m-F7et3CHT_kpFqg7UffTIwnUV9AKgrg?usp=sharing)`