README.md · wav2vec2-large-xlsr-53-gender-recognition-librispeech

README.md

6.8 KB · 240 lines · markdown Raw

1	`---`
2	`license: apache-2.0`
3	`tags:`
4	`- generated_from_trainer`
5	`datasets:`
6	`- librispeech_asr`
7	`metrics:`
8	`- f1`
9	`base_model: facebook/wav2vec2-xls-r-300m`
10	`model-index:`
11	`- name: weights`
12	`results: []`
13	`---`
14
15	`<!-- This model card has been generated automatically according to the information the Trainer had access to. You`
16	`should probably proofread and complete it, then remove this comment. -->`
17
18	`# wav2vec2-large-xlsr-53-gender-recognition-librispeech`
19
20	`This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on Librispeech-clean-100 for gender recognition.`
21	`It achieves the following results on the evaluation set:`
22	`- Loss: 0.0061`
23	`- F1: 0.9993`
24
25	`### Compute your inferences`
26
27	```python
28	`import os`
29	`import random`
30	`from glob import glob`
31	`from typing import List, Optional, Union, Dict`
32
33	`import tqdm`
34	`import torch`
35	`import torchaudio`
36	`import numpy as np`
37	`import pandas as pd`
38	`from torch import nn`
39	`from torch.utils.data import DataLoader`
40	`from torch.nn import functional as F`
41	`from transformers import (`
42	`AutoFeatureExtractor,`
43	`AutoModelForAudioClassification,`
44	`Wav2Vec2Processor`
45	`)`
46
47	`class CustomDataset(torch.utils.data.Dataset):`
48	`def __init__(`
49	`self,`
50	`dataset: List,`
51	`basedir: Optional[str] = None,`
52	`sampling_rate: int = 16000,`
53	`max_audio_len: int = 5,`
54	`):`
55	`self.dataset = dataset`
56	`self.basedir = basedir`
57
58	`self.sampling_rate = sampling_rate`
59	`self.max_audio_len = max_audio_len`
60
61	`def __len__(self):`
62	`"""`
63	`Return the length of the dataset`
64	`"""`
65	`return len(self.dataset)`
66
67	`def __getitem__(self, index):`
68	`if self.basedir is None:`
69	`filepath = self.dataset[index]`
70	`else:`
71	`filepath = os.path.join(self.basedir, self.dataset[index])`
72
73	`speech_array, sr = torchaudio.load(filepath)`
74
75	`if speech_array.shape[0] > 1:`
76	`speech_array = torch.mean(speech_array, dim=0, keepdim=True)`
77
78	`if sr != self.sampling_rate:`
79	`transform = torchaudio.transforms.Resample(sr, self.sampling_rate)`
80	`speech_array = transform(speech_array)`
81	`sr = self.sampling_rate`
82
83	`len_audio = speech_array.shape[1]`
84
85	`# Pad or truncate the audio to match the desired length`
86	`if len_audio < self.max_audio_len * self.sampling_rate:`
87	`# Pad the audio if it's shorter than the desired length`
88	`padding = torch.zeros(1, self.max_audio_len * self.sampling_rate - len_audio)`
89	`speech_array = torch.cat([speech_array, padding], dim=1)`
90	`else:`
91	`# Truncate the audio if it's longer than the desired length`
92	`speech_array = speech_array[:, :self.max_audio_len * self.sampling_rate]`
93
94	`speech_array = speech_array.squeeze().numpy()`
95
96	`return {"input_values": speech_array, "attention_mask": None}`
97
98
99	`class CollateFunc:`
100	`def __init__(`
101	`self,`
102	`processor: Wav2Vec2Processor,`
103	`padding: Union[bool, str] = True,`
104	`pad_to_multiple_of: Optional[int] = None,`
105	`return_attention_mask: bool = True,`
106	`sampling_rate: int = 16000,`
107	`max_length: Optional[int] = None,`
108	`):`
109	`self.sampling_rate = sampling_rate`
110	`self.processor = processor`
111	`self.padding = padding`
112	`self.pad_to_multiple_of = pad_to_multiple_of`
113	`self.return_attention_mask = return_attention_mask`
114	`self.max_length = max_length`
115
116	`def __call__(self, batch: List[Dict[str, np.ndarray]]):`
117	`# Extract input_values from the batch`
118	`input_values = [item["input_values"] for item in batch]`
119
120	`batch = self.processor(`
121	`input_values,`
122	`sampling_rate=self.sampling_rate,`
123	`return_tensors="pt",`
124	`padding=self.padding,`
125	`max_length=self.max_length,`
126	`pad_to_multiple_of=self.pad_to_multiple_of,`
127	`return_attention_mask=self.return_attention_mask`
128	`)`
129
130	`return {`
131	`"input_values": batch.input_values,`
132	`"attention_mask": batch.attention_mask if self.return_attention_mask else None`
133	`}`
134
135
136	`def predict(test_dataloader, model, device: torch.device):`
137	`"""`
138	`Predict the class of the audio`
139	`"""`
140	`model.to(device)`
141	`model.eval()`
142	`preds = []`
143
144	`with torch.no_grad():`
145	`for batch in tqdm.tqdm(test_dataloader):`
146	`input_values, attention_mask = batch['input_values'].to(device), batch['attention_mask'].to(device)`
147
148	`logits = model(input_values, attention_mask=attention_mask).logits`
149	`scores = F.softmax(logits, dim=-1)`
150
151	`pred = torch.argmax(scores, dim=1).cpu().detach().numpy()`
152
153	`preds.extend(pred)`
154
155	`return preds`
156
157
158	`def get_gender(model_name_or_path: str, audio_paths: List[str], label2id: Dict, id2label: Dict, device: torch.device):`
159	`num_labels = 2`
160
161	`feature_extractor = AutoFeatureExtractor.from_pretrained(model_name_or_path)`
162	`model = AutoModelForAudioClassification.from_pretrained(`
163	`pretrained_model_name_or_path=model_name_or_path,`
164	`num_labels=num_labels,`
165	`label2id=label2id,`
166	`id2label=id2label,`
167	`)`
168
169	`test_dataset = CustomDataset(audio_paths, max_audio_len=5) # for 5-second audio`
170
171	`data_collator = CollateFunc(`
172	`processor=feature_extractor,`
173	`padding=True,`
174	`sampling_rate=16000,`
175	`)`
176
177	`test_dataloader = DataLoader(`
178	`dataset=test_dataset,`
179	`batch_size=16,`
180	`collate_fn=data_collator,`
181	`shuffle=False,`
182	`num_workers=2`
183	`)`
184
185	`preds = predict(test_dataloader=test_dataloader, model=model, device=device)`
186
187	`return preds`
188
189	`model_name_or_path = "alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech"`
190
191	`audio_paths = [] # Must be a list with absolute paths of the audios that will be used in inference`
192	`device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`
193
194	`label2id = {`
195	`"female": 0,`
196	`"male": 1`
197	`}`
198
199	`id2label = {`
200	`0: "female",`
201	`1: "male"`
202	`}`
203
204	`num_labels = 2`
205
206	`preds = get_gender(model_name_or_path, audio_paths, label2id, id2label, device)`
207	```
208
209
210	`## Training and evaluation data`
211
212	`The Librispeech-clean-100 dataset was used to train the model, with 70% of the data used for training, 10% for validation, and 20% for testing.`
213
214	`### Training hyperparameters`
215
216	`The following hyperparameters were used during training:`
217	`- learning_rate: 3e-05`
218	`- train_batch_size: 4`
219	`- eval_batch_size: 4`
220	`- seed: 42`
221	`- gradient_accumulation_steps: 4`
222	`- total_train_batch_size: 16`
223	`- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08`
224	`- lr_scheduler_type: linear`
225	`- lr_scheduler_warmup_ratio: 0.1`
226	`- num_epochs: 1`
227	`- mixed_precision_training: Native AMP`
228
229	`### Training results`
230
231	`\| Training Loss \| Epoch \| Step \| Validation Loss \| F1 \|`
232	`\|:-------------:\|:-----:\|:----:\|:---------------:\|:------:\|`
233	`\| 0.002 \| 1.0 \| 1248 \| 0.0061 \| 0.9993 \|`
234
235
236	`### Framework versions`
237
238	`- Transformers 4.28.0`
239	`- Pytorch 2.0.0+cu118`
240	`- Tokenizers 0.13.3`