README.md · s2-pro | QuantaMrkt

README.md

5.0 KB · 170 lines · markdown Raw

1	`---`
2	`language:`
3	`- zh`
4	`- en`
5	`- ja`
6	`- ko`
7	`- es`
8	`- pt`
9	`- ar`
10	`- ru`
11	`- fr`
12	`- de`
13	`- sv`
14	`- it`
15	`- tr`
16	`- 'no'`
17	`- nl`
18	`- cy`
19	`- eu`
20	`- ca`
21	`- da`
22	`- gl`
23	`- ta`
24	`- hu`
25	`- fi`
26	`- pl`
27	`- et`
28	`- hi`
29	`- la`
30	`- ur`
31	`- th`
32	`- vi`
33	`- jw`
34	`- bn`
35	`- yo`
36	`- sl`
37	`- cs`
38	`- sw`
39	`- nn`
40	`- he`
41	`- ms`
42	`- uk`
43	`- id`
44	`- kk`
45	`- bg`
46	`- lv`
47	`- my`
48	`- tl`
49	`- sk`
50	`- ne`
51	`- fa`
52	`- af`
53	`- el`
54	`- bo`
55	`- hr`
56	`- ro`
57	`- sn`
58	`- mi`
59	`- yi`
60	`- am`
61	`- be`
62	`- km`
63	`- is`
64	`- az`
65	`- sd`
66	`- br`
67	`- sq`
68	`- ps`
69	`- mn`
70	`- ht`
71	`- ml`
72	`- sr`
73	`- sa`
74	`- te`
75	`- ka`
76	`- bs`
77	`- pa`
78	`- lt`
79	`- kn`
80	`- si`
81	`- hy`
82	`- mr`
83	`- as`
84	`- gu`
85	`- fo`
86	`license: other`
87	`license_name: fish-audio-research-license`
88	`license_link: LICENSE.md`
89	`pipeline_tag: text-to-speech`
90	`tags:`
91	`- text-to-speech`
92	`- instruction-following`
93	`- multilingual`
94	`inference: false`
95	`extra_gated_prompt: You agree to not use the model to generate contents that violate`
96	`DMCA or local laws.`
97	`extra_gated_fields:`
98	`Country: country`
99	`Specific date: date_picker`
100	`I agree to use this model for non-commercial use ONLY: checkbox`
101	`---`
102
103	`# Fish Audio S2 Pro`
104
105	`<img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%">`
106
107	`[Technical Report](https://huggingface.co/papers/2603.08823) \| [GitHub](https://github.com/fishaudio/fish-speech) \| [Playground](https://fish.audio)`
108
109	`Fish Audio S2 Pro is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.`
110
111	`## Architecture`
112
113	`S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a Dual-Autoregressive (Dual-AR) architecture:`
114
115	`- Slow AR (4B parameters): Operates along the time axis and predicts the primary semantic codebook.`
116	`- Fast AR (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.`
117
118	`This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.`
119
120	`## Fine-Grained Inline Control`
121
122	S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using `[tag]` syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]` — allowing open-ended expression control at the word level.
123
124	`Common tags (15,000+ unique tags supported):`
125
126	`[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
127
128	`## Supported Languages`
129
130	`S2 Pro supports 80+ languages.`
131
132	`Tier 1: Japanese (ja), English (en), Chinese (zh)`
133
134	`Tier 2: Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)`
135
136	`Other supported languages: sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.`
137
138	`## Production Streaming Performance`
139
140	`On a single NVIDIA H200 GPU:`
141
142	`- Real-Time Factor (RTF): 0.195`
143	`- Time-to-first-audio: ~100 ms`
144	`- Throughput: 3,000+ acoustic tokens/s while maintaining RTF below 0.5`
145
146	`## Links`
147
148	`- [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)`
149	`- [Fish Audio Playground](https://fish.audio)`
150	`- [Blog & Tech Report](https://fish.audio/blog/fish-audio-open-sources-s2/)`
151
152	`## Technical Report`
153
154	`If you find our work useful, please consider citing our report:`
155
156	```bibtex
157	`@misc{liao2026fishaudios2technical,`
158	`title={Fish Audio S2 Technical Report},`
159	`author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},`
160	`year={2026},`
161	`eprint={2603.08823},`
162	`archivePrefix={arXiv},`
163	`primaryClass={cs.SD},`
164	`url={https://arxiv.org/abs/2603.08823},`
165	`}`
166	```
167
168	`## License`
169
170	`This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.`