README.md
9.3 KB · 377 lines · markdown Raw
1 ---
2 language:
3 - multilingual
4 - ab
5 - af
6 - am
7 - ar
8 - as
9 - az
10 - ba
11 - be
12 - bg
13 - bi
14 - bo
15 - br
16 - bs
17 - ca
18 - ceb
19 - cs
20 - cy
21 - da
22 - de
23 - el
24 - en
25 - eo
26 - es
27 - et
28 - eu
29 - fa
30 - fi
31 - fo
32 - fr
33 - gl
34 - gn
35 - gu
36 - gv
37 - ha
38 - haw
39 - hi
40 - hr
41 - ht
42 - hu
43 - hy
44 - ia
45 - id
46 - is
47 - it
48 - he
49 - ja
50 - jv
51 - ka
52 - kk
53 - km
54 - kn
55 - ko
56 - la
57 - lm
58 - ln
59 - lo
60 - lt
61 - lv
62 - mg
63 - mi
64 - mk
65 - ml
66 - mn
67 - mr
68 - ms
69 - mt
70 - my
71 - ne
72 - nl
73 - nn
74 - no
75 - oc
76 - pa
77 - pl
78 - ps
79 - pt
80 - ro
81 - ru
82 - sa
83 - sco
84 - sd
85 - si
86 - sk
87 - sl
88 - sn
89 - so
90 - sq
91 - sr
92 - su
93 - sv
94 - sw
95 - ta
96 - te
97 - tg
98 - th
99 - tk
100 - tl
101 - tr
102 - tt
103 - uk
104 - ud
105 - uz
106 - vi
107 - war
108 - yi
109 - yo
110 - zh
111 thumbnail:
112 tags:
113 - audio-classification
114 - speechbrain
115 - embeddings
116 - Language
117 - Identification
118 - pytorch
119 - ECAPA-TDNN
120 - TDNN
121 - VoxLingua107
122 license: "apache-2.0"
123 datasets:
124 - VoxLingua107
125 metrics:
126 - Accuracy
127 widget:
128 - example_title: English Sample
129 src: https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac
130 ---
131
132 # VoxLingua107 ECAPA-TDNN Spoken Language Identification Model
133
134 ## Model description
135
136 This is a spoken language recognition model trained on the [VoxLingua107 dataset](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/) using SpeechBrain.
137 The model uses the ECAPA-TDNN architecture that has previously been used for speaker recognition. However, it uses
138 more fully connected hidden layers after the embedding layer, and cross-entropy loss was used for training.
139 We observed that this improved the performance of extracted utterance embeddings for downstream tasks.
140
141 The system is trained with recordings sampled at 16kHz (single channel).
142 The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *classify_file* if needed.
143
144 The model can classify a speech utterance according to the language spoken.
145 It covers 107 different languages (
146 Abkhazian,
147 Afrikaans,
148 Amharic,
149 Arabic,
150 Assamese,
151 Azerbaijani,
152 Bashkir,
153 Belarusian,
154 Bulgarian,
155 Bengali,
156 Tibetan,
157 Breton,
158 Bosnian,
159 Catalan,
160 Cebuano,
161 Czech,
162 Welsh,
163 Danish,
164 German,
165 Greek,
166 English,
167 Esperanto,
168 Spanish,
169 Estonian,
170 Basque,
171 Persian,
172 Finnish,
173 Faroese,
174 French,
175 Galician,
176 Guarani,
177 Gujarati,
178 Manx,
179 Hausa,
180 Hawaiian,
181 Hindi,
182 Croatian,
183 Haitian,
184 Hungarian,
185 Armenian,
186 Interlingua,
187 Indonesian,
188 Icelandic,
189 Italian,
190 Hebrew,
191 Japanese,
192 Javanese,
193 Georgian,
194 Kazakh,
195 Central Khmer,
196 Kannada,
197 Korean,
198 Latin,
199 Luxembourgish,
200 Lingala,
201 Lao,
202 Lithuanian,
203 Latvian,
204 Malagasy,
205 Maori,
206 Macedonian,
207 Malayalam,
208 Mongolian,
209 Marathi,
210 Malay,
211 Maltese,
212 Burmese,
213 Nepali,
214 Dutch,
215 Norwegian Nynorsk,
216 Norwegian,
217 Occitan,
218 Panjabi,
219 Polish,
220 Pushto,
221 Portuguese,
222 Romanian,
223 Russian,
224 Sanskrit,
225 Scots,
226 Sindhi,
227 Sinhala,
228 Slovak,
229 Slovenian,
230 Shona,
231 Somali,
232 Albanian,
233 Serbian,
234 Sundanese,
235 Swedish,
236 Swahili,
237 Tamil,
238 Telugu,
239 Tajik,
240 Thai,
241 Turkmen,
242 Tagalog,
243 Turkish,
244 Tatar,
245 Ukrainian,
246 Urdu,
247 Uzbek,
248 Vietnamese,
249 Waray,
250 Yiddish,
251 Yoruba,
252 Mandarin Chinese).
253
254 ## Intended uses & limitations
255
256 The model has two uses:
257
258 - use 'as is' for spoken language recognition
259 - use as an utterance-level feature (embedding) extractor, for creating a dedicated language ID model on your own data
260
261 The model is trained on automatically collected YouTube data. For more
262 information about the dataset, see [here](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/).
263
264
265 #### How to use
266 ```bash
267 pip install git+https://github.com/speechbrain/speechbrain.git@develop
268 ```
269
270 ```python
271 import torchaudio
272 from speechbrain.inference.classifiers import EncoderClassifier
273 language_id = EncoderClassifier.from_hparams(source="speechbrain/lang-id-voxlingua107-ecapa", savedir="tmp")
274 # Download Thai language sample from Omniglot and cvert to suitable form
275 signal = language_id.load_audio("speechbrain/lang-id-voxlingua107-ecapa/udhr_th.wav")
276 prediction = language_id.classify_batch(signal)
277 print(prediction)
278 # (tensor([[-2.8646e+01, -3.0346e+01, -2.0748e+01, -2.9562e+01, -2.2187e+01,
279 # -3.2668e+01, -3.6677e+01, -3.3573e+01, -3.2545e+01, -2.4365e+01,
280 # -2.4688e+01, -3.1171e+01, -2.7743e+01, -2.9918e+01, -2.4770e+01,
281 # -3.2250e+01, -2.4727e+01, -2.6087e+01, -2.1870e+01, -3.2821e+01,
282 # -2.2128e+01, -2.2822e+01, -3.0888e+01, -3.3564e+01, -2.9906e+01,
283 # -2.2392e+01, -2.5573e+01, -2.6443e+01, -3.2429e+01, -3.2652e+01,
284 # -3.0030e+01, -2.4607e+01, -2.2967e+01, -2.4396e+01, -2.8578e+01,
285 # -2.5153e+01, -2.8475e+01, -2.6409e+01, -2.5230e+01, -2.7957e+01,
286 # -2.6298e+01, -2.3609e+01, -2.5863e+01, -2.8225e+01, -2.7225e+01,
287 # -3.0486e+01, -2.1185e+01, -2.7938e+01, -3.3155e+01, -1.9076e+01,
288 # -2.9181e+01, -2.2160e+01, -1.8352e+01, -2.5866e+01, -3.3636e+01,
289 # -4.2016e+00, -3.1581e+01, -3.1894e+01, -2.7834e+01, -2.5429e+01,
290 # -3.2235e+01, -3.2280e+01, -2.8786e+01, -2.3366e+01, -2.6047e+01,
291 # -2.2075e+01, -2.3770e+01, -2.2518e+01, -2.8101e+01, -2.5745e+01,
292 # -2.6441e+01, -2.9822e+01, -2.7109e+01, -3.0225e+01, -2.4566e+01,
293 # -2.9268e+01, -2.7651e+01, -3.4221e+01, -2.9026e+01, -2.6009e+01,
294 # -3.1968e+01, -3.1747e+01, -2.8156e+01, -2.9025e+01, -2.7756e+01,
295 # -2.8052e+01, -2.9341e+01, -2.8806e+01, -2.1636e+01, -2.3992e+01,
296 # -2.3794e+01, -3.3743e+01, -2.8332e+01, -2.7465e+01, -1.5085e-02,
297 # -2.9094e+01, -2.1444e+01, -2.9780e+01, -3.6046e+01, -3.7401e+01,
298 # -3.0888e+01, -3.3172e+01, -1.8931e+01, -2.2679e+01, -3.0225e+01,
299 # -2.4995e+01, -2.1028e+01]]), tensor([-0.0151]), tensor([94]), ['th'])
300 # The scores in the prediction[0] tensor can be interpreted as log-likelihoods that
301 # the given utterance belongs to the given language (i.e., the larger the better)
302 # The linear-scale likelihood can be retrieved using the following:
303 print(prediction[1].exp())
304 # tensor([0.9850])
305 # The identified language ISO code is given in prediction[3]
306 print(prediction[3])
307 # ['th: Thai']
308
309 # Alternatively, use the utterance embedding extractor:
310 emb = language_id.encode_batch(signal)
311 print(emb.shape)
312 # torch.Size([1, 1, 256])
313 ```
314 To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
315
316 The system is trained with recordings sampled at 16kHz (single channel).
317 The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *classify_file* if needed. Make sure your input tensor is compliant with the expected sampling rate if you use *encode_batch* and *classify_batch*.
318
319 Warning: In the dataset and in the defaults of this model (see [`label_encoder.txt`](label_encoder.txt), the used ISO language code for Hebrew is obsolete (should be `he` instead of `iw`). The ISO language code for Javanese is incorrect (should be `jv` instead of `jw`). See [issue #2396](https://github.com/speechbrain/speechbrain/issues/2396).
320
321 #### Limitations and bias
322
323 Since the model is trained on VoxLingua107, it has many limitations and biases, some of which are:
324
325 - Probably it's accuracy on smaller languages is quite limited
326 - Probably it works worse on female speech than male speech (because YouTube data includes much more male speech)
327 - Based on subjective experiments, it doesn't work well on speech with a foreign accent
328 - Probably it doesn't work well on children's speech and on persons with speech disorders
329
330
331 ## Training data
332
333 The model is trained on [VoxLingua107](https://cs.taltech.ee/staff/tanel.alumae/data/voxlingua107/).
334
335 VoxLingua107 is a speech dataset for training spoken language identification models.
336 The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives.
337
338 VoxLingua107 contains data for 107 languages. The total amount of speech in the training set is 6628 hours.
339 The average amount of data per language is 62 hours. However, the real amount per language varies a lot. There is also a seperate development set containing 1609 speech segments from 33 languages, validated by at least two volunteers to really contain the given language.
340
341 ## Training procedure
342
343 See the [SpeechBrain recipe](https://github.com/speechbrain/speechbrain/tree/voxlingua107/recipes/VoxLingua107/lang_id).
344
345 ## Evaluation results
346
347 Error rate: 6.7% on the VoxLingua107 development dataset
348
349 #### Referencing SpeechBrain
350 ```bibtex
351 @misc{speechbrain,
352 title={{SpeechBrain}: A General-Purpose Speech Toolkit},
353 author={Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
354 year={2021},
355 eprint={2106.04624},
356 archivePrefix={arXiv},
357 primaryClass={eess.AS},
358 note={arXiv:2106.04624}
359 }
360 ```
361
362 ### Referencing VoxLingua107
363
364 ```bibtex
365 @inproceedings{valk2021slt,
366 title={{VoxLingua107}: a Dataset for Spoken Language Recognition},
367 author={J{\"o}rgen Valk and Tanel Alum{\"a}e},
368 booktitle={Proc. IEEE SLT Workshop},
369 year={2021},
370 }
371 ```
372
373 #### About SpeechBrain
374 SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains.
375 Website: https://speechbrain.github.io/
376 GitHub: https://github.com/speechbrain/speechbrain
377