README.md · WeSpeaker-ResNet34-LM-MLX

README.md

3.3 KB · 100 lines · markdown Raw

1	`---`
2	`license: mit`
3	`tags:`
4	`- mlx`
5	`- speaker-embedding`
6	`- speaker-verification`
7	`- speaker-diarization`
8	`- wespeaker`
9	`- resnet`
10	`- apple-silicon`
11	`base_model: pyannote/wespeaker-voxceleb-resnet34-LM`
12	`library_name: mlx`
13	`pipeline_tag: audio-classification`
14	`---`
15
16	`# WeSpeaker ResNet34-LM — MLX`
17
18	`MLX-compatible weights for [WeSpeaker ResNet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM), converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.`
19
20	`## Model`
21
22	`WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.`
23
24	`Architecture:`
25
26	```
27	`Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)`
28	`│`
29	`├─ Conv2d(1→32, k=3, p=1) + ReLU`
30	`├─ Layer1: 3× BasicBlock(32→32)`
31	`├─ Layer2: 4× BasicBlock(32→64, stride=2)`
32	`├─ Layer3: 6× BasicBlock(64→128, stride=2)`
33	`├─ Layer4: 3× BasicBlock(128→256, stride=2)`
34	`│`
35	`├─ Statistics Pooling: mean + std → [B, 5120]`
36	`├─ Linear(5120→256) → L2 normalize`
37	`│`
38	`Output: [B, 256] speaker embedding`
39	```
40
41	`BatchNorm is fused into Conv2d at conversion time — no BN layers in the MLX model.`
42
43	`## Usage (Swift / MLX)`
44
45	```swift
46	`import SpeechVAD`
47
48	`// Speaker embedding`
49	`let model = try await WeSpeakerModel.fromPretrained()`
50	`let embedding = model.embed(audio: samples, sampleRate: 16000)`
51	`// embedding: [Float] of length 256, L2-normalized`
52
53	`// Compare speakers`
54	`let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)`
55
56	`// Full speaker diarization pipeline`
57	`let pipeline = try await DiarizationPipeline.fromPretrained()`
58	`let result = pipeline.diarize(audio: samples, sampleRate: 16000)`
59	`for seg in result.segments {`
60	`print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")`
61	`}`
62	```
63
64	`Part of [speech-swift](https://github.com/soniqo/speech-swift).`
65
66	`## Conversion`
67
68	```bash
69	`python3 scripts/convert_wespeaker.py --upload`
70	```
71
72	`Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:`
73
74	- Fuse BatchNorm into Conv2d: `w_fused = w × γ/√(σ²+ε)`, `b_fused = β − μ×γ/√(σ²+ε)`
75	- Transpose Conv2d weights: `[O, I, H, W]` → `[O, H, W, I]` for MLX channels-last
76	- Rename: strip `resnet.` prefix, `seg_1` → `embedding`
77	- Drop `num_batches_tracked` keys
78
79	`## Weight Mapping`
80
81	`\| PyTorch Key \| MLX Key \| Shape \|`
82	`\|-------------\|---------\|-------\|`
83	\| `resnet.conv1.weight` + `resnet.bn1.*` \| `conv1.weight` \| [32, 3, 3, 1] \|
84	\| `resnet.layer{L}.{B}.conv{1,2}.weight` + `bn{1,2}.*` \| `layer{L}.{B}.conv{1,2}.weight` \| [O, 3, 3, I] \|
85	\| `resnet.layer{L}.0.shortcut.0.weight` + `shortcut.1.*` \| `layer{L}.0.shortcut.weight` \| [O, 1, 1, I] \|
86	\| `resnet.seg_1.weight` \| `embedding.weight` \| [256, 5120] \|
87	\| `resnet.seg_1.bias` \| `embedding.bias` \| [256] \|
88
89	`## License`
90
91	`The original WeSpeaker model is released under the [MIT License](https://github.com/wenet-e2e/wespeaker/blob/master/LICENSE).`
92
93	`---`
94
95	`---`
96
97	`- Guide: [soniqo.audio/guides/embed-speaker](https://soniqo.audio/guides/embed-speaker)`
98	`- Docs: [soniqo.audio](https://soniqo.audio)`
99	`- GitHub: [soniqo/speech-swift](https://github.com/soniqo/speech-swift)`
100