README.md · wav2vec2-large-robust-12-ft-emotion-msp-dim

README.md

3.8 KB · 139 lines · markdown Raw

1	`---`
2	`language: en`
3	`datasets:`
4	`- msp-podcast`
5	`inference: true`
6	`tags:`
7	`- speech`
8	`- audio`
9	`- wav2vec2`
10	`- audio-classification`
11	`- emotion-recognition`
12	`license: cc-by-nc-sa-4.0`
13	`pipeline_tag: audio-classification`
14	`---`
15
16	`# Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0`
17
18	`Please note that this model is for research purpose only.`
19	`A commercial license for a model`
20	`that has been trained on much more data`
21	`can be acquired with [audEERING](https://www.audeering.com/products/devaice/).`
22	`The model expects a raw audio signal as input,`
23	`and outputs predictions for arousal, dominance and valence in a range of approximately 0...1.`
24	`In addition,`
25	`it provides the pooled states of the last transformer layer.`
26	`The model was created by fine-tuning`
27	`[Wav2Vec2-Large-Robust](https://huggingface.co/facebook/wav2vec2-large-robust)`
28	`on [MSP-Podcast](https://ecs.utdallas.edu/research/researchlabs/msp-lab/MSP-Podcast.html) (v1.7).`
29	`The model was pruned from 24 to 12 transformer layers before fine-tuning.`
30	`An [ONNX](https://onnx.ai/) export of the model is available from [doi:10.5281/zenodo.6221127](https://zenodo.org/record/6221127).`
31	`Further details are given in the associated [paper](https://arxiv.org/abs/2203.07378) and [tutorial](https://github.com/audeering/w2v2-how-to).`
32
33	`# Usage`
34
35	```python
36	`import numpy as np`
37	`import torch`
38	`import torch.nn as nn`
39	`from transformers import Wav2Vec2Processor`
40	`from transformers.models.wav2vec2.modeling_wav2vec2 import (`
41	`Wav2Vec2Model,`
42	`Wav2Vec2PreTrainedModel,`
43	`)`
44
45
46	`class RegressionHead(nn.Module):`
47	`r"""Classification head."""`
48
49	`def __init__(self, config):`
50
51	`super().__init__()`
52
53	`self.dense = nn.Linear(config.hidden_size, config.hidden_size)`
54	`self.dropout = nn.Dropout(config.final_dropout)`
55	`self.out_proj = nn.Linear(config.hidden_size, config.num_labels)`
56
57	`def forward(self, features, **kwargs):`
58
59	`x = features`
60	`x = self.dropout(x)`
61	`x = self.dense(x)`
62	`x = torch.tanh(x)`
63	`x = self.dropout(x)`
64	`x = self.out_proj(x)`
65
66	`return x`
67
68
69	`class EmotionModel(Wav2Vec2PreTrainedModel):`
70	`r"""Speech emotion classifier."""`
71
72	`def __init__(self, config):`
73
74	`super().__init__(config)`
75
76	`self.config = config`
77	`self.wav2vec2 = Wav2Vec2Model(config)`
78	`self.classifier = RegressionHead(config)`
79	`self.init_weights()`
80
81	`def forward(`
82	`self,`
83	`input_values,`
84	`):`
85
86	`outputs = self.wav2vec2(input_values)`
87	`hidden_states = outputs[0]`
88	`hidden_states = torch.mean(hidden_states, dim=1)`
89	`logits = self.classifier(hidden_states)`
90
91	`return hidden_states, logits`
92
93
94
95	`# load model from hub`
96	`device = 'cpu'`
97	`model_name = 'audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim'`
98	`processor = Wav2Vec2Processor.from_pretrained(model_name)`
99	`model = EmotionModel.from_pretrained(model_name).to(device)`
100
101	`# dummy signal`
102	`sampling_rate = 16000`
103	`signal = np.zeros((1, sampling_rate), dtype=np.float32)`
104
105
106	`def process_func(`
107	`x: np.ndarray,`
108	`sampling_rate: int,`
109	`embeddings: bool = False,`
110	`) -> np.ndarray:`
111	`r"""Predict emotions or extract embeddings from raw audio signal."""`
112
113	`# run through processor to normalize signal`
114	`# always returns a batch, so we just get the first entry`
115	`# then we put it on the device`
116	`y = processor(x, sampling_rate=sampling_rate)`
117	`y = y['input_values'][0]`
118	`y = y.reshape(1, -1)`
119	`y = torch.from_numpy(y).to(device)`
120
121	`# run through model`
122	`with torch.no_grad():`
123	`y = model(y)[0 if embeddings else 1]`
124
125	`# convert to numpy`
126	`y = y.detach().cpu().numpy()`
127
128	`return y`
129
130
131	`print(process_func(signal, sampling_rate))`
132	`# Arousal dominance valence`
133	`# [[0.5460754 0.6062266 0.40431657]]`
134
135	`print(process_func(signal, sampling_rate, embeddings=True))`
136	`# Pooled hidden states of last transformer layer`
137	`# [[-0.00752167 0.0065819 -0.00746342 ... 0.00663632 0.00848748`
138	`# 0.00599211]]`
139	```