README.md · MERT-v1-330M

1

---

2

license: cc-by-nc-4.0

3

inference: false

4

tags:

5

- music

6

pipeline_tag: audio-classification

7

---

8

9

# Introduction to our series work

10

11

The development log of our Music Audio Pre-training (m-a-p) model family:

12

- 02/06/2023: [arxiv pre-print](https://arxiv.org/abs/2306.00107) and training [codes](https://github.com/yizhilll/MERT) released.

13

- 17/03/2023: we release two advanced music understanding models, [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) and [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.

14

- 14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public)

15

- 29/12/2022: a music understanding model [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) trained with **MLM** paradigm, which performs better at downstream tasks.

16

- 29/10/2022: a pre-trained MIR model [music2vec](https://huggingface.co/m-a-p/music2vec-v1) trained with **BYOL** paradigm.

17

18

19

20

Here is a table for quick model pick-up:

21

22

| Name                                                         | Pre-train Paradigm | Training Data (hour) | Pre-train Context   (second) | Model Size | Transformer Layer-Dimension | Feature Rate | Sample Rate | Release Date |

23

| ------------------------------------------------------------ | ------------------ | -------------------- | ---------------------------- | ---------- | --------------------------- | ------------ | ----------- | ------------ |

24

| [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M)    | MLM                | 160K                 | 5                            | 330M       | 24-1024                     | 75 Hz        | 24K Hz      | 17/03/2023   |

25

| [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)      | MLM                | 20K                  | 5                            | 95M        | 12-768                      | 75 Hz        | 24K Hz      | 17/03/2023   |

26

| [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | MLM                | 900                  | 5                            | 95M        | 12-768                      | 50 Hz        | 16K Hz      | 14/03/2023   |

27

| [MERT-v0](https://huggingface.co/m-a-p/MERT-v0)              | MLM                | 1000                 | 5                            | 95 M       | 12-768                      | 50 Hz        | 16K Hz      | 29/12/2022   |

28

| [music2vec-v1](https://huggingface.co/m-a-p/music2vec-v1)    | BYOL               | 1000                 | 30                           | 95 M       | 12-768                      | 50 Hz        | 16K Hz      | 30/10/2022   |

29

30

## Explanation

31

32

The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using:

33

34

- **Model Size**: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware.

35

- **Transformer Layer-Dimension**: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by **different layers could have various performance depending on tasks**.

36

- **Feature Rate**: Given a 1-second audio input, the number of features output by the model.

37

- **Sample Rate**: The frequency of audio that the model is trained with.

38

39

40

41

# Introduction to MERT-v1

42

43

Compared to MERT-v0, we introduce multiple new things in the MERT-v1 pre-training:

44

45

- Change the pseudo labels to 8 codebooks from [encodec](https://github.com/facebookresearch/encodec), which potentially has higher quality and empower our model to support music generation.

46

- MLM prediction with in-batch noise mixture.

47

- Train with higher audio frequency (24K Hz).

48

- Train with more audio data (up to 160 thousands of hours).

49

- More available model sizes 95M and 330M.

50

51

52

53

More details will be written in our coming-soon paper.

54

55

56

57

# Model Usage

58

59

```python

60

# from transformers import Wav2Vec2Processor

61

from transformers import Wav2Vec2FeatureExtractor

62

from transformers import AutoModel

63

import torch

64

from torch import nn

65

import torchaudio.transforms as T

66

from datasets import load_dataset

67

68

# loading our model weights

69

model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True)

70

# loading the corresponding preprocessor config

71

processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-330M",trust_remote_code=True)

72

73

# load demo audio and set processor

74

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")

75

dataset = dataset.sort("id")

76

sampling_rate = dataset.features["audio"].sampling_rate

77

78

resample_rate = processor.sampling_rate

79

# make sure the sample_rate aligned

80

if resample_rate != sampling_rate:

81

print(f'setting rate from {sampling_rate} to {resample_rate}')

82

resampler = T.Resample(sampling_rate, resample_rate)

83

else:

84

resampler = None

85

86

# audio file is decoded on the fly

87

if resampler is None:

88

input_audio = dataset[0]["audio"]["array"]

89

else:

90

input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))

91

92

inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt")

93

with torch.no_grad():

94

outputs = model(**inputs, output_hidden_states=True)

95

96

# take a look at the output shape, there are 25 layers of representation

97

# each layer performs differently in different downstream tasks, you should choose empirically

98

all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()

99

print(all_layer_hidden_states.shape) # [25 layer, Time steps, 1024 feature_dim]

100

101

# for utterance level classification tasks, you can simply reduce the representation in time

102

time_reduced_hidden_states = all_layer_hidden_states.mean(-2)

103

print(time_reduced_hidden_states.shape) # [25, 1024]

104

105

# you can even use a learnable weighted average representation

106

aggregator = nn.Conv1d(in_channels=25, out_channels=1, kernel_size=1)

107

weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()

108

print(weighted_avg_hidden_states.shape) # [1024]

109

```

110

111

112

113

# Citation

114

115

```shell

116

@misc{li2023mert,

117

title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training},

118

      author={Yizhi Li and Ruibin Yuan and Ge Zhang and Yinghao Ma and Xingran Chen and Hanzhi Yin and Chenghua Lin and Anton Ragni and Emmanouil Benetos and Norbert Gyenge and Roger Dannenberg and Ruibo Liu and Wenhu Chen and Gus Xia and Yemin Shi and Wenhao Huang and Yike Guo and Jie Fu},

119

year={2023},

120

eprint={2306.00107},

121

archivePrefix={arXiv},

122

primaryClass={cs.SD}

123

}

124

```