README.md · wav2vec2-large-xlsr-53-chinese-zh-cn

README.md

7.5 KB · 195 lines · markdown Raw

1	`---`
2	`language: zh`
3	`datasets:`
4	`- common_voice`
5	`metrics:`
6	`- wer`
7	`- cer`
8	`tags:`
9	`- audio`
10	`- automatic-speech-recognition`
11	`- speech`
12	`- xlsr-fine-tuning-week`
13	`license: apache-2.0`
14	`model-index:`
15	`- name: XLSR Wav2Vec2 Chinese (zh-CN) by Jonatas Grosman`
16	`results:`
17	`- task:`
18	`name: Speech Recognition`
19	`type: automatic-speech-recognition`
20	`dataset:`
21	`name: Common Voice zh-CN`
22	`type: common_voice`
23	`args: zh-CN`
24	`metrics:`
25	`- name: Test WER`
26	`type: wer`
27	`value: 82.37`
28	`- name: Test CER`
29	`type: cer`
30	`value: 19.03`
31	`---`
32
33	`# Fine-tuned XLSR-53 large model for speech recognition in Chinese`
34
35	`Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Chinese using the train and validation splits of [Common Voice 6.1](https://huggingface.co/datasets/common_voice), [CSS10](https://github.com/Kyubyong/css10) and [ST-CMDS](http://www.openslr.org/38/).`
36	`When using this model, make sure that your speech input is sampled at 16kHz.`
37
38	`This model has been fine-tuned thanks to the GPU credits generously given by the [OVHcloud](https://www.ovhcloud.com/en/public-cloud/ai-training/) :)`
39
40	`The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint`
41
42	`## Usage`
43
44	`The model can be used directly (without a language model) as follows...`
45
46	`Using the [HuggingSound](https://github.com/jonatasgrosman/huggingsound) library:`
47
48	```python
49	`from huggingsound import SpeechRecognitionModel`
50
51	`model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")`
52	`audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]`
53
54	`transcriptions = model.transcribe(audio_paths)`
55	```
56
57	`Writing your own inference script:`
58
59	```python
60	`import torch`
61	`import librosa`
62	`from datasets import load_dataset`
63	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
64
65	`LANG_ID = "zh-CN"`
66	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn"`
67	`SAMPLES = 10`
68
69	`test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")`
70
71	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
72	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
73
74	`# Preprocessing the datasets.`
75	`# We need to read the audio files as arrays`
76	`def speech_file_to_array_fn(batch):`
77	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
78	`batch["speech"] = speech_array`
79	`batch["sentence"] = batch["sentence"].upper()`
80	`return batch`
81
82	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
83	`inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
84
85	`with torch.no_grad():`
86	`logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits`
87
88	`predicted_ids = torch.argmax(logits, dim=-1)`
89	`predicted_sentences = processor.batch_decode(predicted_ids)`
90
91	`for i, predicted_sentence in enumerate(predicted_sentences):`
92	`print("-" * 100)`
93	`print("Reference:", test_dataset[i]["sentence"])`
94	`print("Prediction:", predicted_sentence)`
95	```
96
97	`\| Reference \| Prediction \|`
98	`\| ------------- \| ------------- \|`
99	`\| 宋朝末年年间定居粉岭围。 \| 宋朝末年年间定居分定为 \|`
100	`\| 渐渐行动不便 \| 建境行动不片 \|`
101	`\| 二十一年去世。 \| 二十一年去世 \|`
102	`\| 他们自称恰哈拉。 \| 他们自称家哈<unk> \|`
103	`\| 局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。 \| 菊物干寺的例子包括有口肝眼睛干照以及阴到干<unk> \|`
104	`\| 嘉靖三十八年，登进士第三甲第二名。 \| 嘉靖三十八年登进士第三甲第二名 \|`
105	`\| 这一名称一直沿用至今。 \| 这一名称一直沿用是心 \|`
106	`\| 同时乔凡尼还得到包税合同和许多明矾矿的经营权。 \| 同时桥凡妮还得到包税合同和许多民繁矿的经营权 \|`
107	`\| 为了惩罚西扎城和塞尔柱的结盟，盟军在抵达后将外城烧毁。 \| 为了曾罚西扎城和塞尔素的节盟盟军在抵达后将外曾烧毁 \|`
108	`\| 河内盛产黄色无鱼鳞的鳍射鱼。 \| 合类生场环色无鱼林的骑射鱼 \|`
109
110	`## Evaluation`
111
112	`The model can be evaluated as follows on the Chinese (zh-CN) test data of Common Voice.`
113
114	```python
115	`import torch`
116	`import re`
117	`import librosa`
118	`from datasets import load_dataset, load_metric`
119	`from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor`
120
121	`LANG_ID = "zh-CN"`
122	`MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn"`
123	`DEVICE = "cuda"`
124
125	`CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",`
126	`"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",`
127	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
128	`"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",`
129	`"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]`
130
131	`test_dataset = load_dataset("common_voice", LANG_ID, split="test")`
132
133	`wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py`
134	`cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py`
135
136	`chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"`
137
138	`processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)`
139	`model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)`
140	`model.to(DEVICE)`
141
142	`# Preprocessing the datasets.`
143	`# We need to read the audio files as arrays`
144	`def speech_file_to_array_fn(batch):`
145	`with warnings.catch_warnings():`
146	`warnings.simplefilter("ignore")`
147	`speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)`
148	`batch["speech"] = speech_array`
149	`batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()`
150	`return batch`
151
152	`test_dataset = test_dataset.map(speech_file_to_array_fn)`
153
154	`# Preprocessing the datasets.`
155	`# We need to read the audio files as arrays`
156	`def evaluate(batch):`
157	`inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)`
158
159	`with torch.no_grad():`
160	`logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits`
161
162	`pred_ids = torch.argmax(logits, dim=-1)`
163	`batch["pred_strings"] = processor.batch_decode(pred_ids)`
164	`return batch`
165
166	`result = test_dataset.map(evaluate, batched=True, batch_size=8)`
167
168	`predictions = [x.upper() for x in result["pred_strings"]]`
169	`references = [x.upper() for x in result["sentence"]]`
170
171	`print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")`
172	`print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")`
173	```
174
175	`Test Result:`
176
177	`In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-05-13). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.`
178
179	`\| Model \| WER \| CER \|`
180	`\| ------------- \| ------------- \| ------------- \|`
181	`\| jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn \| 82.37% \| 19.03% \|`
182	`\| ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt \| 84.01% \| 20.95% \|`
183
184
185	`## Citation`
186	`If you want to cite this model you can use this:`
187
188	```bibtex
189	`@misc{grosman2021xlsr53-large-chinese,`
190	`title={Fine-tuned {XLSR}-53 large model for speech recognition in {C}hinese},`
191	`author={Grosman, Jonatas},`
192	`howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn}},`
193	`year={2021}`
194	`}`
195	```