README.md
4.7 KB · 184 lines · markdown Raw
1 ---
2 license: mit
3 language:
4 - af
5 - am
6 - ar
7 - as
8 - az
9 - be
10 - bn
11 - bs
12 - bg
13 - ca
14 - cs
15 - zh
16 - cy
17 - da
18 - de
19 - el
20 - en
21 - et
22 - fi
23 - fr
24 - or
25 - om
26 - ga
27 - gl
28 - gu
29 - ha
30 - he
31 - hi
32 - hr
33 - hu
34 - hy
35 - ig
36 - id
37 - is
38 - it
39 - jv
40 - ja
41 - kn
42 - ka
43 - kk
44 - mn
45 - km
46 - ky
47 - ko
48 - lo
49 - ln
50 - lt
51 - lb
52 - lg
53 - lv
54 - ml
55 - mr
56 - mk
57 - mt
58 - mi
59 - my
60 - nl
61 - nb
62 - ne
63 - ny
64 - oc
65 - pa
66 - ps
67 - fa
68 - pl
69 - pt
70 - ro
71 - ru
72 - sk
73 - sl
74 - sn
75 - sd
76 - so
77 - es
78 - sr
79 - sv
80 - sw
81 - ta
82 - te
83 - tg
84 - tl
85 - th
86 - tr
87 - uk
88 - ur
89 - uz
90 - vi
91 - wo
92 - xh
93 - yo
94 - ms
95 - zu
96 - ary
97 - arz
98 - yue
99 - kea
100 inference: false
101 ---
102 # W2v-BERT 2.0 speech encoder
103
104 We are open-sourcing our Conformer-based [W2v-BERT 2.0 speech encoder](#w2v-bert-20-speech-encoder) as described in Section 3.2.1 of the [paper](https://arxiv.org/pdf/2312.05187.pdf), which is at the core of our Seamless models.
105
106 This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.
107
108 | Model Name | #params | checkpoint |
109 | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
110 | W2v-BERT 2.0 | 600M | [checkpoint](https://huggingface.co/reach-vb/conformer-shaw/resolve/main/conformer_shaw.pt)
111
112 **This model and its training are supported by 🤗 Transformers, more on it in the [docs](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert).**
113
114
115 # 🤗 Transformers usage
116
117 This is a bare checkpoint without any modeling head, and thus requires finetuning to be used for downstream tasks such as ASR. You can however use it to extract audio embeddings from the top layer with this code snippet:
118
119 ```python
120 from transformers import AutoFeatureExtractor, Wav2Vec2BertModel
121 import torch
122 from datasets import load_dataset
123
124 dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
125 dataset = dataset.sort("id")
126 sampling_rate = dataset.features["audio"].sampling_rate
127
128 processor = AutoProcessor.from_pretrained("facebook/w2v-bert-2.0")
129 model = Wav2Vec2BertModel.from_pretrained("facebook/w2v-bert-2.0")
130
131 # audio file is decoded on the fly
132 inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
133 with torch.no_grad():
134 outputs = model(**inputs)
135 ```
136
137 To learn more about the model use, refer to the following resources:
138 - [its docs](https://huggingface.co/docs/transformers/main/en/model_doc/wav2vec2-bert)
139 - [a blog post showing how to fine-tune it on Mongolian ASR](https://huggingface.co/blog/fine-tune-w2v2-bert)
140 - [a training script example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py)
141
142
143 # Seamless Communication usage
144
145 This model can be used in [Seamless Communication](https://github.com/facebookresearch/seamless_communication), where it was released.
146
147 Here's how to make a forward pass through the voice encoder, after having completed the [installation steps](https://github.com/facebookresearch/seamless_communication?tab=readme-ov-file#installation):
148
149 ```python
150 import torch
151
152 from fairseq2.data.audio import AudioDecoder, WaveformToFbankConverter
153 from fairseq2.memory import MemoryBlock
154 from fairseq2.nn.padding import get_seqs_and_padding_mask
155 from pathlib import Path
156 from seamless_communication.models.conformer_shaw import load_conformer_shaw_model
157
158
159 audio_wav_path, device, dtype = ...
160 audio_decoder = AudioDecoder(dtype=torch.float32, device=device)
161 fbank_converter = WaveformToFbankConverter(
162 num_mel_bins=80,
163 waveform_scale=2**15,
164 channel_last=True,
165 standardize=True,
166 device=device,
167 dtype=dtype,
168 )
169 collater = Collater(pad_value=1)
170
171 model = load_conformer_shaw_model("conformer_shaw", device=device, dtype=dtype)
172 model.eval()
173
174 with Path(audio_wav_path).open("rb") as fb:
175 block = MemoryBlock(fb.read())
176
177 decoded_audio = audio_decoder(block)
178 src = collater(fbank_converter(decoded_audio))["fbank"]
179 seqs, padding_mask = get_seqs_and_padding_mask(src)
180
181 with torch.inference_mode():
182 seqs, padding_mask = model.encoder_frontend(seqs, padding_mask)
183 seqs, padding_mask = model.encoder(seqs, padding_mask)
184 ```