README.md · wav2vec2-large-xlsr-53-telugu

README.md

4.8 KB · 121 lines · markdown Raw

1	`---`
2	`language: te`
3	`datasets:`
4	`- openslr`
5	`metrics:`
6	`- wer`
7	`tags:`
8	`- audio`
9	`- automatic-speech-recognition`
10	`- speech`
11	`- xlsr-fine-tuning-week`
12	`license: apache-2.0`
13	`model-index:`
14	`- name: Anurag Singh XLSR Wav2Vec2 Large 53 Telugu`
15	`results:`
16	`- task:`
17	`name: Speech Recognition`
18	`type: automatic-speech-recognition`
19	`dataset:`
20	`name: OpenSLR te`
21	`type: openslr`
22	`args: te`
23	`metrics:`
24	`- name: Test WER`
25	`type: wer`
26	`value: 44.98`
27	`---`
28	`# Wav2Vec2-Large-XLSR-53-Telugu`
29	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Telugu using the [OpenSLR SLR66](http://openslr.org/66/) dataset.`
30	`When using this model, make sure that your speech input is sampled at 16kHz.`
31	`## Usage`
32	`The model can be used directly (without a language model) as follows:`
33	```python
34	`import torch`
35	`import torchaudio`
36	`from datasets import load_dataset`
37	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
38	`import pandas as pd`
39	`# Evaluation notebook contains the procedure to download the data`
40	`df = pd.read_csv("/content/te/test.tsv", sep="\t")`
41	`df["path"] = "/content/te/clips/" + df["path"]`
42	`test_dataset = Dataset.from_pandas(df)`
43	`processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")`
44	`model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")`
45	`resampler = torchaudio.transforms.Resample(48_000, 16_000)`
46	`# Preprocessing the datasets.`
47	`# We need to read the aduio files as arrays`
48	`def speech_file_to_array_fn(batch):`
49	`speech_array, sampling_rate = torchaudio.load(batch["path"])`
50	`batch["speech"] = resampler(speech_array).squeeze().numpy()`
51	`return batch`
52	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
53	`inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)`
54	`with torch.no_grad():`
55	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
56	`predicted_ids = torch.argmax(logits, dim=-1)`
57	`print("Prediction:", processor.batch_decode(predicted_ids))`
58	`print("Reference:", test_dataset["sentence"][:2])`
59	```
60	`## Evaluation`
61	```python
62	`import torch`
63	`import torchaudio`
64	`from datasets import Dataset, load_metric`
65	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
66	`import re`
67	`from sklearn.model_selection import train_test_split`
68	`import pandas as pd`
69	`# Evaluation notebook contains the procedure to download the data`
70	`df = pd.read_csv("/content/te/test.tsv", sep="\t")`
71	`df["path"] = "/content/te/clips/" + df["path"]`
72	`test_dataset = Dataset.from_pandas(df)`
73	`wer = load_metric("wer")`
74	`processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")`
75	`model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")`
76	`model.to("cuda")`
77	`chars_to_ignore_regex = '[\,\?\.\!\-\_\;\:\"\“\%\‘\”\।\’\'\&]'`
78	`resampler = torchaudio.transforms.Resample(48_000, 16_000)`
79	`def normalizer(text):`
80	`# Use your custom normalizer`
81	`text = text.replace("\\n","\n")`
82	`text = ' '.join(text.split())`
83	`text = re.sub(r'''([a-z]+)''','',text,flags=re.IGNORECASE)`
84	`text = re.sub(r'''%'''," శాతం ", text)`
85	`text = re.sub(r'''(/\|-\|_)'''," ", text)`
86	`text = re.sub("ై","ై", text)`
87	`text = text.strip()`
88	`return text`
89	`def speech_file_to_array_fn(batch):`
90	`batch["sentence"] = normalizer(batch["sentence"])`
91	`batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()+ " "`
92	`speech_array, sampling_rate = torchaudio.load(batch["path"])`
93	`batch["speech"] = resampler(speech_array).squeeze().numpy()`
94	`return batch`
95	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
96	`# Preprocessing the datasets.`
97	`# We need to read the aduio files as arrays`
98	`def evaluate(batch):`
99	`inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
100	`with torch.no_grad():`
101	`logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits`
102	`pred_ids = torch.argmax(logits, dim=-1)`
103	`batch["pred_strings"] = processor.batch_decode(pred_ids)`
104	`return batch`
105	`result = test_dataset.map(evaluate, batched=True, batch_size=8)`
106	`print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))`
107	```
108
109	`Test Result: 44.98%`
110	`## Training`
111	`70% of the OpenSLR Telugu dataset was used for training.`
112
113	`Train Split of annotations is [here](https://www.dropbox.com/s/xqc0wtour7f9h4c/train.tsv)`
114
115	`Test Split of annotations is [here](https://www.dropbox.com/s/qw1uy63oj4qdiu4/test.tsv)`
116
117	`Training Data Preparation notebook can be found [here](https://colab.research.google.com/drive/1_VR1QtY9qoiabyXBdJcOI29-xIKGdIzU?usp=sharing)`
118
119	`Training notebook can be found[here](https://colab.research.google.com/drive/14N-j4m0Ng_oktPEBN5wiUhDDbyrKYt8I?usp=sharing)`
120
121	`Evaluation notebook is [here](https://colab.research.google.com/drive/1SLEvbTWBwecIRTNqpQ0fFTqmr1-7MnSI?usp=sharing)`