README.md
3.3 KB · 100 lines · markdown Raw
1 ---
2 license: mit
3 tags:
4 - mlx
5 - speaker-embedding
6 - speaker-verification
7 - speaker-diarization
8 - wespeaker
9 - resnet
10 - apple-silicon
11 base_model: pyannote/wespeaker-voxceleb-resnet34-LM
12 library_name: mlx
13 pipeline_tag: audio-classification
14 ---
15
16 # WeSpeaker ResNet34-LM — MLX
17
18 MLX-compatible weights for [WeSpeaker ResNet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM), converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.
19
20 ## Model
21
22 WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.
23
24 **Architecture:**
25
26 ```
27 Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
28
29 ├─ Conv2d(1→32, k=3, p=1) + ReLU
30 ├─ Layer1: 3× BasicBlock(32→32)
31 ├─ Layer2: 4× BasicBlock(32→64, stride=2)
32 ├─ Layer3: 6× BasicBlock(64→128, stride=2)
33 ├─ Layer4: 3× BasicBlock(128→256, stride=2)
34
35 ├─ Statistics Pooling: mean + std → [B, 5120]
36 ├─ Linear(5120→256) → L2 normalize
37
38 Output: [B, 256] speaker embedding
39 ```
40
41 BatchNorm is fused into Conv2d at conversion time — no BN layers in the MLX model.
42
43 ## Usage (Swift / MLX)
44
45 ```swift
46 import SpeechVAD
47
48 // Speaker embedding
49 let model = try await WeSpeakerModel.fromPretrained()
50 let embedding = model.embed(audio: samples, sampleRate: 16000)
51 // embedding: [Float] of length 256, L2-normalized
52
53 // Compare speakers
54 let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)
55
56 // Full speaker diarization pipeline
57 let pipeline = try await DiarizationPipeline.fromPretrained()
58 let result = pipeline.diarize(audio: samples, sampleRate: 16000)
59 for seg in result.segments {
60 print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
61 }
62 ```
63
64 Part of [speech-swift](https://github.com/soniqo/speech-swift).
65
66 ## Conversion
67
68 ```bash
69 python3 scripts/convert_wespeaker.py --upload
70 ```
71
72 Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:
73
74 - **Fuse BatchNorm** into Conv2d: `w_fused = w × γ/√(σ²+ε)`, `b_fused = β − μ×γ/√(σ²+ε)`
75 - **Transpose Conv2d** weights: `[O, I, H, W]` → `[O, H, W, I]` for MLX channels-last
76 - **Rename**: strip `resnet.` prefix, `seg_1` → `embedding`
77 - **Drop** `num_batches_tracked` keys
78
79 ## Weight Mapping
80
81 | PyTorch Key | MLX Key | Shape |
82 |-------------|---------|-------|
83 | `resnet.conv1.weight` + `resnet.bn1.*` | `conv1.weight` | [32, 3, 3, 1] |
84 | `resnet.layer{L}.{B}.conv{1,2}.weight` + `bn{1,2}.*` | `layer{L}.{B}.conv{1,2}.weight` | [O, 3, 3, I] |
85 | `resnet.layer{L}.0.shortcut.0.weight` + `shortcut.1.*` | `layer{L}.0.shortcut.weight` | [O, 1, 1, I] |
86 | `resnet.seg_1.weight` | `embedding.weight` | [256, 5120] |
87 | `resnet.seg_1.bias` | `embedding.bias` | [256] |
88
89 ## License
90
91 The original WeSpeaker model is released under the [MIT License](https://github.com/wenet-e2e/wespeaker/blob/master/LICENSE).
92
93 ---
94
95 ---
96
97 - **Guide**: [soniqo.audio/guides/embed-speaker](https://soniqo.audio/guides/embed-speaker)
98 - **Docs**: [soniqo.audio](https://soniqo.audio)
99 - **GitHub**: [soniqo/speech-swift](https://github.com/soniqo/speech-swift)
100