README.md
6.8 KB · 240 lines · markdown Raw
1 ---
2 license: apache-2.0
3 tags:
4 - generated_from_trainer
5 datasets:
6 - librispeech_asr
7 metrics:
8 - f1
9 base_model: facebook/wav2vec2-xls-r-300m
10 model-index:
11 - name: weights
12 results: []
13 ---
14
15 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16 should probably proofread and complete it, then remove this comment. -->
17
18 # wav2vec2-large-xlsr-53-gender-recognition-librispeech
19
20 This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on Librispeech-clean-100 for gender recognition.
21 It achieves the following results on the evaluation set:
22 - Loss: 0.0061
23 - F1: 0.9993
24
25 ### Compute your inferences
26
27 ```python
28 import os
29 import random
30 from glob import glob
31 from typing import List, Optional, Union, Dict
32
33 import tqdm
34 import torch
35 import torchaudio
36 import numpy as np
37 import pandas as pd
38 from torch import nn
39 from torch.utils.data import DataLoader
40 from torch.nn import functional as F
41 from transformers import (
42 AutoFeatureExtractor,
43 AutoModelForAudioClassification,
44 Wav2Vec2Processor
45 )
46
47 class CustomDataset(torch.utils.data.Dataset):
48 def __init__(
49 self,
50 dataset: List,
51 basedir: Optional[str] = None,
52 sampling_rate: int = 16000,
53 max_audio_len: int = 5,
54 ):
55 self.dataset = dataset
56 self.basedir = basedir
57
58 self.sampling_rate = sampling_rate
59 self.max_audio_len = max_audio_len
60
61 def __len__(self):
62 """
63 Return the length of the dataset
64 """
65 return len(self.dataset)
66
67 def __getitem__(self, index):
68 if self.basedir is None:
69 filepath = self.dataset[index]
70 else:
71 filepath = os.path.join(self.basedir, self.dataset[index])
72
73 speech_array, sr = torchaudio.load(filepath)
74
75 if speech_array.shape[0] > 1:
76 speech_array = torch.mean(speech_array, dim=0, keepdim=True)
77
78 if sr != self.sampling_rate:
79 transform = torchaudio.transforms.Resample(sr, self.sampling_rate)
80 speech_array = transform(speech_array)
81 sr = self.sampling_rate
82
83 len_audio = speech_array.shape[1]
84
85 # Pad or truncate the audio to match the desired length
86 if len_audio < self.max_audio_len * self.sampling_rate:
87 # Pad the audio if it's shorter than the desired length
88 padding = torch.zeros(1, self.max_audio_len * self.sampling_rate - len_audio)
89 speech_array = torch.cat([speech_array, padding], dim=1)
90 else:
91 # Truncate the audio if it's longer than the desired length
92 speech_array = speech_array[:, :self.max_audio_len * self.sampling_rate]
93
94 speech_array = speech_array.squeeze().numpy()
95
96 return {"input_values": speech_array, "attention_mask": None}
97
98
99 class CollateFunc:
100 def __init__(
101 self,
102 processor: Wav2Vec2Processor,
103 padding: Union[bool, str] = True,
104 pad_to_multiple_of: Optional[int] = None,
105 return_attention_mask: bool = True,
106 sampling_rate: int = 16000,
107 max_length: Optional[int] = None,
108 ):
109 self.sampling_rate = sampling_rate
110 self.processor = processor
111 self.padding = padding
112 self.pad_to_multiple_of = pad_to_multiple_of
113 self.return_attention_mask = return_attention_mask
114 self.max_length = max_length
115
116 def __call__(self, batch: List[Dict[str, np.ndarray]]):
117 # Extract input_values from the batch
118 input_values = [item["input_values"] for item in batch]
119
120 batch = self.processor(
121 input_values,
122 sampling_rate=self.sampling_rate,
123 return_tensors="pt",
124 padding=self.padding,
125 max_length=self.max_length,
126 pad_to_multiple_of=self.pad_to_multiple_of,
127 return_attention_mask=self.return_attention_mask
128 )
129
130 return {
131 "input_values": batch.input_values,
132 "attention_mask": batch.attention_mask if self.return_attention_mask else None
133 }
134
135
136 def predict(test_dataloader, model, device: torch.device):
137 """
138 Predict the class of the audio
139 """
140 model.to(device)
141 model.eval()
142 preds = []
143
144 with torch.no_grad():
145 for batch in tqdm.tqdm(test_dataloader):
146 input_values, attention_mask = batch['input_values'].to(device), batch['attention_mask'].to(device)
147
148 logits = model(input_values, attention_mask=attention_mask).logits
149 scores = F.softmax(logits, dim=-1)
150
151 pred = torch.argmax(scores, dim=1).cpu().detach().numpy()
152
153 preds.extend(pred)
154
155 return preds
156
157
158 def get_gender(model_name_or_path: str, audio_paths: List[str], label2id: Dict, id2label: Dict, device: torch.device):
159 num_labels = 2
160
161 feature_extractor = AutoFeatureExtractor.from_pretrained(model_name_or_path)
162 model = AutoModelForAudioClassification.from_pretrained(
163 pretrained_model_name_or_path=model_name_or_path,
164 num_labels=num_labels,
165 label2id=label2id,
166 id2label=id2label,
167 )
168
169 test_dataset = CustomDataset(audio_paths, max_audio_len=5) # for 5-second audio
170
171 data_collator = CollateFunc(
172 processor=feature_extractor,
173 padding=True,
174 sampling_rate=16000,
175 )
176
177 test_dataloader = DataLoader(
178 dataset=test_dataset,
179 batch_size=16,
180 collate_fn=data_collator,
181 shuffle=False,
182 num_workers=2
183 )
184
185 preds = predict(test_dataloader=test_dataloader, model=model, device=device)
186
187 return preds
188
189 model_name_or_path = "alefiury/wav2vec2-large-xlsr-53-gender-recognition-librispeech"
190
191 audio_paths = [] # Must be a list with absolute paths of the audios that will be used in inference
192 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
193
194 label2id = {
195 "female": 0,
196 "male": 1
197 }
198
199 id2label = {
200 0: "female",
201 1: "male"
202 }
203
204 num_labels = 2
205
206 preds = get_gender(model_name_or_path, audio_paths, label2id, id2label, device)
207 ```
208
209
210 ## Training and evaluation data
211
212 The Librispeech-clean-100 dataset was used to train the model, with 70% of the data used for training, 10% for validation, and 20% for testing.
213
214 ### Training hyperparameters
215
216 The following hyperparameters were used during training:
217 - learning_rate: 3e-05
218 - train_batch_size: 4
219 - eval_batch_size: 4
220 - seed: 42
221 - gradient_accumulation_steps: 4
222 - total_train_batch_size: 16
223 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
224 - lr_scheduler_type: linear
225 - lr_scheduler_warmup_ratio: 0.1
226 - num_epochs: 1
227 - mixed_precision_training: Native AMP
228
229 ### Training results
230
231 | Training Loss | Epoch | Step | Validation Loss | F1 |
232 |:-------------:|:-----:|:----:|:---------------:|:------:|
233 | 0.002 | 1.0 | 1248 | 0.0061 | 0.9993 |
234
235
236 ### Framework versions
237
238 - Transformers 4.28.0
239 - Pytorch 2.0.0+cu118
240 - Tokenizers 0.13.3