README.md
15.4 KB · 347 lines · markdown Raw
1 ---
2 license: cc-by-nc-4.0
3 language:
4 - af
5 - am
6 - ar
7 - as
8 - az
9 - be
10 - bn
11 - bs
12 - bg
13 - ca
14 - cs
15 - zh
16 - cy
17 - da
18 - de
19 - el
20 - en
21 - et
22 - fi
23 - fr
24 - or
25 - om
26 - ga
27 - gl
28 - gu
29 - ha
30 - he
31 - hi
32 - hr
33 - hu
34 - hy
35 - ig
36 - id
37 - is
38 - it
39 - jv
40 - ja
41 - kn
42 - ka
43 - kk
44 - mn
45 - km
46 - ky
47 - ko
48 - lo
49 - ln
50 - lt
51 - lb
52 - lg
53 - lv
54 - ml
55 - mr
56 - mk
57 - mt
58 - mi
59 - my
60 - nl
61 - nb
62 - ne
63 - ny
64 - oc
65 - pa
66 - ps
67 - fa
68 - pl
69 - pt
70 - ro
71 - ru
72 - sk
73 - sl
74 - sn
75 - sd
76 - so
77 - es
78 - sr
79 - sv
80 - sw
81 - ta
82 - te
83 - tg
84 - tl
85 - th
86 - tr
87 - uk
88 - ur
89 - uz
90 - vi
91 - wo
92 - xh
93 - yo
94 - ms
95 - zu
96 - ary
97 - arz
98 - yue
99 - kea
100 metrics:
101 - bleu
102 - wer
103 - chrf
104 inference: False
105 pipeline_tag: automatic-speech-recognition
106 tags:
107 - audio-to-audio
108 - text-to-speech
109 - seamless_communication
110 library_name: transformers
111 widget:
112 - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
113 example_title: Librispeech sample 1
114 output:
115 text: going along slushy country roads and speaking to damp audiences in draughty schoolrooms day after day for a fortnight he'll have to put in an appearance at some place of worship on sunday morning and he can come to us immediately afterwards
116 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
117 example_title: Librispeech sample 2
118 output:
119 text: before he had time to answer a much-encumbered vera burst into the room with the question i say can i leave these here these were a small black pig and a lusty specimen of black-red game-cock
120 ---
121
122 # SeamlessM4T v2
123
124 **SeamlessM4T** is our foundational all-in-one **M**assively **M**ultilingual and **M**ultimodal **M**achine **T**ranslation model delivering high-quality translation for speech and text in nearly 100 languages.
125
126 SeamlessM4T models support the tasks of:
127 - Speech-to-speech translation (S2ST)
128 - Speech-to-text translation (S2TT)
129 - Text-to-speech translation (T2ST)
130 - Text-to-text translation (T2TT)
131 - Automatic speech recognition (ASR).
132
133 SeamlessM4T models support:
134 - 🎤 101 languages for speech input.
135 - 💬 96 Languages for text input/output.
136 - 🔊 35 languages for speech output.
137
138 🌟 We are releasing SeamlessM4T v2, an updated version with our novel *UnitY2* architecture.
139 This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.
140
141 The v2 version of SeamlessM4T is a multitask adaptation of our novel *UnitY2* architecture.
142 *Unity2* with its hierarchical character-to-unit upsampling and non-autoregressive text-to-unit decoding considerably improves over SeamlessM4T v1 in quality and inference speed.
143
144 **SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the dedicated section below](#transformers-usage).**
145
146 ![SeamlessM4T architectures](seamlessm4t_arch.svg)
147
148 ## SeamlessM4T models
149 | Model Name | #params | checkpoint | metrics |
150 | ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
151 | [SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip) |
152 | [SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip) |
153 | [SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium) | 1.2B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
154
155 We provide the extensive evaluation results of seamlessM4T-Large and SeamlessM4T-Medium reported in the paper (as averages) in the `metrics` files above.
156
157 The evaluation data ids for FLEURS, CoVoST2 and CVSS-C can be found [here](https://dl.fbaipublicfiles.com/seamless/metrics/evaluation_data_ids.zip)
158
159
160 ## Evaluating SeamlessM4T models
161 To reproduce our results or to evaluate using the same metrics over your own test sets, please check out the [Evaluation README here](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/evaluate).
162
163
164 ## Finetuning SeamlessM4T models
165 Please check out the [Finetuning README here](https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/finetune).
166
167 ## Transformers usage
168
169 SeamlessM4T is available in the 🤗 Transformers library, requiring minimal dependencies. Steps to get started:
170
171 1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) from main and [sentencepiece](https://github.com/google/sentencepiece):
172
173 ```
174 pip install git+https://github.com/huggingface/transformers.git sentencepiece
175 ```
176
177 2. Run the following Python code to generate speech samples. Here the target language is Russian:
178
179 ```py
180 from transformers import AutoProcessor, SeamlessM4Tv2Model
181 import torchaudio
182
183 processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
184 model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")
185
186 # from text
187 text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
188 audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
189
190 # from audio
191 audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
192 audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
193 audio_inputs = processor(audios=audio, return_tensors="pt")
194 audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
195 ```
196
197 3. Listen to the audio samples either in an ipynb notebook:
198
199 ```py
200 from IPython.display import Audio
201
202 sample_rate = model.config.sampling_rate
203 Audio(audio_array_from_text, rate=sample_rate)
204 # Audio(audio_array_from_audio, rate=sample_rate)
205 ```
206
207 Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
208
209 ```py
210 import scipy
211
212 sample_rate = model.config.sampling_rate
213 scipy.io.wavfile.write("out_from_text.wav", rate=sample_rate, data=audio_array_from_text)
214 # scipy.io.wavfile.write("out_from_audio.wav", rate=sample_rate, data=audio_array_from_audio)
215 ```
216 For more details on using the SeamlessM4T model for inference using the 🤗 Transformers library, refer to the
217 **[SeamlessM4T v2 docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2)** or to this **hands-on [Google Colab](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/v2_seamless_m4t_hugging_face.ipynb).**
218
219
220 ## Supported Languages:
221
222 Listed below, are the languages supported by SeamlessM4T-large (v1/v2).
223 The `source` column specifies whether a language is supported as source speech (`Sp`) and/or source text (`Tx`).
224 The `target` column specifies whether a language is supported as target speech (`Sp`) and/or target text (`Tx`).
225
226
227 | code | language | script | Source | Target |
228 | ---- | ---------------------- | ---------- | ------ | ------ |
229 | afr | Afrikaans | Latn | Sp, Tx | Tx |
230 | amh | Amharic | Ethi | Sp, Tx | Tx |
231 | arb | Modern Standard Arabic | Arab | Sp, Tx | Sp, Tx |
232 | ary | Moroccan Arabic | Arab | Sp, Tx | Tx |
233 | arz | Egyptian Arabic | Arab | Sp, Tx | Tx |
234 | asm | Assamese | Beng | Sp, Tx | Tx |
235 | ast | Asturian | Latn | Sp | \-- |
236 | azj | North Azerbaijani | Latn | Sp, Tx | Tx |
237 | bel | Belarusian | Cyrl | Sp, Tx | Tx |
238 | ben | Bengali | Beng | Sp, Tx | Sp, Tx |
239 | bos | Bosnian | Latn | Sp, Tx | Tx |
240 | bul | Bulgarian | Cyrl | Sp, Tx | Tx |
241 | cat | Catalan | Latn | Sp, Tx | Sp, Tx |
242 | ceb | Cebuano | Latn | Sp, Tx | Tx |
243 | ces | Czech | Latn | Sp, Tx | Sp, Tx |
244 | ckb | Central Kurdish | Arab | Sp, Tx | Tx |
245 | cmn | Mandarin Chinese | Hans | Sp, Tx | Sp, Tx |
246 | cmn_Hant | Mandarin Chinese | Hant | Sp, Tx | Sp, Tx |
247 | cym | Welsh | Latn | Sp, Tx | Sp, Tx |
248 | dan | Danish | Latn | Sp, Tx | Sp, Tx |
249 | deu | German | Latn | Sp, Tx | Sp, Tx |
250 | ell | Greek | Grek | Sp, Tx | Tx |
251 | eng | English | Latn | Sp, Tx | Sp, Tx |
252 | est | Estonian | Latn | Sp, Tx | Sp, Tx |
253 | eus | Basque | Latn | Sp, Tx | Tx |
254 | fin | Finnish | Latn | Sp, Tx | Sp, Tx |
255 | fra | French | Latn | Sp, Tx | Sp, Tx |
256 | fuv | Nigerian Fulfulde | Latn | Sp, Tx | Tx |
257 | gaz | West Central Oromo | Latn | Sp, Tx | Tx |
258 | gle | Irish | Latn | Sp, Tx | Tx |
259 | glg | Galician | Latn | Sp, Tx | Tx |
260 | guj | Gujarati | Gujr | Sp, Tx | Tx |
261 | heb | Hebrew | Hebr | Sp, Tx | Tx |
262 | hin | Hindi | Deva | Sp, Tx | Sp, Tx |
263 | hrv | Croatian | Latn | Sp, Tx | Tx |
264 | hun | Hungarian | Latn | Sp, Tx | Tx |
265 | hye | Armenian | Armn | Sp, Tx | Tx |
266 | ibo | Igbo | Latn | Sp, Tx | Tx |
267 | ind | Indonesian | Latn | Sp, Tx | Sp, Tx |
268 | isl | Icelandic | Latn | Sp, Tx | Tx |
269 | ita | Italian | Latn | Sp, Tx | Sp, Tx |
270 | jav | Javanese | Latn | Sp, Tx | Tx |
271 | jpn | Japanese | Jpan | Sp, Tx | Sp, Tx |
272 | kam | Kamba | Latn | Sp | \-- |
273 | kan | Kannada | Knda | Sp, Tx | Tx |
274 | kat | Georgian | Geor | Sp, Tx | Tx |
275 | kaz | Kazakh | Cyrl | Sp, Tx | Tx |
276 | kea | Kabuverdianu | Latn | Sp | \-- |
277 | khk | Halh Mongolian | Cyrl | Sp, Tx | Tx |
278 | khm | Khmer | Khmr | Sp, Tx | Tx |
279 | kir | Kyrgyz | Cyrl | Sp, Tx | Tx |
280 | kor | Korean | Kore | Sp, Tx | Sp, Tx |
281 | lao | Lao | Laoo | Sp, Tx | Tx |
282 | lit | Lithuanian | Latn | Sp, Tx | Tx |
283 | ltz | Luxembourgish | Latn | Sp | \-- |
284 | lug | Ganda | Latn | Sp, Tx | Tx |
285 | luo | Luo | Latn | Sp, Tx | Tx |
286 | lvs | Standard Latvian | Latn | Sp, Tx | Tx |
287 | mai | Maithili | Deva | Sp, Tx | Tx |
288 | mal | Malayalam | Mlym | Sp, Tx | Tx |
289 | mar | Marathi | Deva | Sp, Tx | Tx |
290 | mkd | Macedonian | Cyrl | Sp, Tx | Tx |
291 | mlt | Maltese | Latn | Sp, Tx | Sp, Tx |
292 | mni | Meitei | Beng | Sp, Tx | Tx |
293 | mya | Burmese | Mymr | Sp, Tx | Tx |
294 | nld | Dutch | Latn | Sp, Tx | Sp, Tx |
295 | nno | Norwegian Nynorsk | Latn | Sp, Tx | Tx |
296 | nob | Norwegian Bokmål | Latn | Sp, Tx | Tx |
297 | npi | Nepali | Deva | Sp, Tx | Tx |
298 | nya | Nyanja | Latn | Sp, Tx | Tx |
299 | oci | Occitan | Latn | Sp | \-- |
300 | ory | Odia | Orya | Sp, Tx | Tx |
301 | pan | Punjabi | Guru | Sp, Tx | Tx |
302 | pbt | Southern Pashto | Arab | Sp, Tx | Tx |
303 | pes | Western Persian | Arab | Sp, Tx | Sp, Tx |
304 | pol | Polish | Latn | Sp, Tx | Sp, Tx |
305 | por | Portuguese | Latn | Sp, Tx | Sp, Tx |
306 | ron | Romanian | Latn | Sp, Tx | Sp, Tx |
307 | rus | Russian | Cyrl | Sp, Tx | Sp, Tx |
308 | slk | Slovak | Latn | Sp, Tx | Sp, Tx |
309 | slv | Slovenian | Latn | Sp, Tx | Tx |
310 | sna | Shona | Latn | Sp, Tx | Tx |
311 | snd | Sindhi | Arab | Sp, Tx | Tx |
312 | som | Somali | Latn | Sp, Tx | Tx |
313 | spa | Spanish | Latn | Sp, Tx | Sp, Tx |
314 | srp | Serbian | Cyrl | Sp, Tx | Tx |
315 | swe | Swedish | Latn | Sp, Tx | Sp, Tx |
316 | swh | Swahili | Latn | Sp, Tx | Sp, Tx |
317 | tam | Tamil | Taml | Sp, Tx | Tx |
318 | tel | Telugu | Telu | Sp, Tx | Sp, Tx |
319 | tgk | Tajik | Cyrl | Sp, Tx | Tx |
320 | tgl | Tagalog | Latn | Sp, Tx | Sp, Tx |
321 | tha | Thai | Thai | Sp, Tx | Sp, Tx |
322 | tur | Turkish | Latn | Sp, Tx | Sp, Tx |
323 | ukr | Ukrainian | Cyrl | Sp, Tx | Sp, Tx |
324 | urd | Urdu | Arab | Sp, Tx | Sp, Tx |
325 | uzn | Northern Uzbek | Latn | Sp, Tx | Sp, Tx |
326 | vie | Vietnamese | Latn | Sp, Tx | Sp, Tx |
327 | xho | Xhosa | Latn | Sp | \-- |
328 | yor | Yoruba | Latn | Sp, Tx | Tx |
329 | yue | Cantonese | Hant | Sp, Tx | Tx |
330 | zlm | Colloquial Malay | Latn | Sp | \-- |
331 | zsm | Standard Malay | Latn | Tx | Tx |
332 | zul | Zulu | Latn | Sp, Tx | Tx |
333
334
335 Note that seamlessM4T-medium supports 200 languages in the text modality, and is based on NLLB-200 (see full list in [asset card](https://github.com/facebookresearch/seamless_communication/blob/main/src/seamless_communication/cards/unity_nllb-200.yaml))
336
337 ## Citation
338 For SeamlessM4T v2, please cite :
339 ```bibtex
340 @inproceedings{seamless2023,
341 title="Seamless: Multilingual Expressive and Streaming Speech Translation",
342 author="{Seamless Communication}, Lo{\"i}c Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss{\`a}, Maha Elbayad, Hongyu Gong, Francisco Guzm{\'a}n, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson",
343 journal={ArXiv},
344 year={2023}
345 }
346 ```
347 [//]: # "https://arxiv.org/abs/2312.05187"