README.md
4.8 KB · 121 lines · markdown Raw
1 ---
2 language: te
3 datasets:
4 - openslr
5 metrics:
6 - wer
7 tags:
8 - audio
9 - automatic-speech-recognition
10 - speech
11 - xlsr-fine-tuning-week
12 license: apache-2.0
13 model-index:
14 - name: Anurag Singh XLSR Wav2Vec2 Large 53 Telugu
15 results:
16 - task:
17 name: Speech Recognition
18 type: automatic-speech-recognition
19 dataset:
20 name: OpenSLR te
21 type: openslr
22 args: te
23 metrics:
24 - name: Test WER
25 type: wer
26 value: 44.98
27 ---
28 # Wav2Vec2-Large-XLSR-53-Telugu
29 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Telugu using the [OpenSLR SLR66](http://openslr.org/66/) dataset.
30 When using this model, make sure that your speech input is sampled at 16kHz.
31 ## Usage
32 The model can be used directly (without a language model) as follows:
33 ```python
34 import torch
35 import torchaudio
36 from datasets import load_dataset
37 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
38 import pandas as pd
39 # Evaluation notebook contains the procedure to download the data
40 df = pd.read_csv("/content/te/test.tsv", sep="\t")
41 df["path"] = "/content/te/clips/" + df["path"]
42 test_dataset = Dataset.from_pandas(df)
43 processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")
44 model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")
45 resampler = torchaudio.transforms.Resample(48_000, 16_000)
46 # Preprocessing the datasets.
47 # We need to read the aduio files as arrays
48 def speech_file_to_array_fn(batch):
49 speech_array, sampling_rate = torchaudio.load(batch["path"])
50 batch["speech"] = resampler(speech_array).squeeze().numpy()
51 return batch
52 test_dataset = test_dataset.map(speech_file_to_array_fn)
53 inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
54 with torch.no_grad():
55 logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
56 predicted_ids = torch.argmax(logits, dim=-1)
57 print("Prediction:", processor.batch_decode(predicted_ids))
58 print("Reference:", test_dataset["sentence"][:2])
59 ```
60 ## Evaluation
61 ```python
62 import torch
63 import torchaudio
64 from datasets import Dataset, load_metric
65 from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
66 import re
67 from sklearn.model_selection import train_test_split
68 import pandas as pd
69 # Evaluation notebook contains the procedure to download the data
70 df = pd.read_csv("/content/te/test.tsv", sep="\t")
71 df["path"] = "/content/te/clips/" + df["path"]
72 test_dataset = Dataset.from_pandas(df)
73 wer = load_metric("wer")
74 processor = Wav2Vec2Processor.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")
75 model = Wav2Vec2ForCTC.from_pretrained("anuragshas/wav2vec2-large-xlsr-53-telugu")
76 model.to("cuda")
77 chars_to_ignore_regex = '[\,\?\.\!\-\_\;\:\"\“\%\‘\”\।\’\'\&]'
78 resampler = torchaudio.transforms.Resample(48_000, 16_000)
79 def normalizer(text):
80 # Use your custom normalizer
81 text = text.replace("\\n","\n")
82 text = ' '.join(text.split())
83 text = re.sub(r'''([a-z]+)''','',text,flags=re.IGNORECASE)
84 text = re.sub(r'''%'''," శాతం ", text)
85 text = re.sub(r'''(/|-|_)'''," ", text)
86 text = re.sub("ై","ై", text)
87 text = text.strip()
88 return text
89 def speech_file_to_array_fn(batch):
90 batch["sentence"] = normalizer(batch["sentence"])
91 batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()+ " "
92 speech_array, sampling_rate = torchaudio.load(batch["path"])
93 batch["speech"] = resampler(speech_array).squeeze().numpy()
94 return batch
95 test_dataset = test_dataset.map(speech_file_to_array_fn)
96 # Preprocessing the datasets.
97 # We need to read the aduio files as arrays
98 def evaluate(batch):
99 inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
100 with torch.no_grad():
101 logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
102 pred_ids = torch.argmax(logits, dim=-1)
103 batch["pred_strings"] = processor.batch_decode(pred_ids)
104 return batch
105 result = test_dataset.map(evaluate, batched=True, batch_size=8)
106 print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
107 ```
108
109 **Test Result**: 44.98%
110 ## Training
111 70% of the OpenSLR Telugu dataset was used for training.
112
113 Train Split of annotations is [here](https://www.dropbox.com/s/xqc0wtour7f9h4c/train.tsv)
114
115 Test Split of annotations is [here](https://www.dropbox.com/s/qw1uy63oj4qdiu4/test.tsv)
116
117 Training Data Preparation notebook can be found [here](https://colab.research.google.com/drive/1_VR1QtY9qoiabyXBdJcOI29-xIKGdIzU?usp=sharing)
118
119 Training notebook can be found[here](https://colab.research.google.com/drive/14N-j4m0Ng_oktPEBN5wiUhDDbyrKYt8I?usp=sharing)
120
121 Evaluation notebook is [here](https://colab.research.google.com/drive/1SLEvbTWBwecIRTNqpQ0fFTqmr1-7MnSI?usp=sharing)