README.md
5.0 KB · 170 lines · markdown Raw
1 ---
2 language:
3 - zh
4 - en
5 - ja
6 - ko
7 - es
8 - pt
9 - ar
10 - ru
11 - fr
12 - de
13 - sv
14 - it
15 - tr
16 - 'no'
17 - nl
18 - cy
19 - eu
20 - ca
21 - da
22 - gl
23 - ta
24 - hu
25 - fi
26 - pl
27 - et
28 - hi
29 - la
30 - ur
31 - th
32 - vi
33 - jw
34 - bn
35 - yo
36 - sl
37 - cs
38 - sw
39 - nn
40 - he
41 - ms
42 - uk
43 - id
44 - kk
45 - bg
46 - lv
47 - my
48 - tl
49 - sk
50 - ne
51 - fa
52 - af
53 - el
54 - bo
55 - hr
56 - ro
57 - sn
58 - mi
59 - yi
60 - am
61 - be
62 - km
63 - is
64 - az
65 - sd
66 - br
67 - sq
68 - ps
69 - mn
70 - ht
71 - ml
72 - sr
73 - sa
74 - te
75 - ka
76 - bs
77 - pa
78 - lt
79 - kn
80 - si
81 - hy
82 - mr
83 - as
84 - gu
85 - fo
86 license: other
87 license_name: fish-audio-research-license
88 license_link: LICENSE.md
89 pipeline_tag: text-to-speech
90 tags:
91 - text-to-speech
92 - instruction-following
93 - multilingual
94 inference: false
95 extra_gated_prompt: You agree to not use the model to generate contents that violate
96 DMCA or local laws.
97 extra_gated_fields:
98 Country: country
99 Specific date: date_picker
100 I agree to use this model for non-commercial use ONLY: checkbox
101 ---
102
103 # Fish Audio S2 Pro
104
105 <img src="overview.png" alt="Fish Audio S2 Pro overview — fine-grained control, multi-speaker multi-turn generation, low-latency streaming, and long-context inference." width="100%">
106
107 [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
108
109 **Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
110
111 ## Architecture
112
113 S2 Pro builds on a decoder-only transformer combined with an RVQ-based audio codec (10 codebooks, ~21 Hz frame rate) using a **Dual-Autoregressive (Dual-AR)** architecture:
114
115 - **Slow AR** (4B parameters): Operates along the time axis and predicts the primary semantic codebook.
116 - **Fast AR** (400M parameters): Generates the remaining 9 residual codebooks at each time step, reconstructing fine-grained acoustic detail.
117
118 This asymmetric design keeps inference efficient while preserving audio fidelity. Because the Dual-AR architecture is structurally isomorphic to standard autoregressive LLMs, it inherits all LLM-native serving optimizations from SGLang — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.
119
120 ## Fine-Grained Inline Control
121
122 S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using `[tag]` syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts **free-form textual descriptions** — such as `[whisper in small voice]`, `[professional broadcast tone]`, or `[pitch up]` — allowing open-ended expression control at the word level.
123
124 **Common tags (15,000+ unique tags supported):**
125
126 `[pause]` `[emphasis]` `[laughing]` `[inhale]` `[chuckle]` `[tsk]` `[singing]` `[excited]` `[laughing tone]` `[interrupting]` `[chuckling]` `[excited tone]` `[volume up]` `[echo]` `[angry]` `[low volume]` `[sigh]` `[low voice]` `[whisper]` `[screaming]` `[shouting]` `[loud]` `[surprised]` `[short pause]` `[exhale]` `[delight]` `[panting]` `[audience laughter]` `[with strong accent]` `[volume down]` `[clearing throat]` `[sad]` `[moaning]` `[shocked]`
127
128 ## Supported Languages
129
130 S2 Pro supports 80+ languages.
131
132 **Tier 1:** Japanese (ja), English (en), Chinese (zh)
133
134 **Tier 2:** Korean (ko), Spanish (es), Portuguese (pt), Arabic (ar), Russian (ru), French (fr), German (de)
135
136 **Other supported languages:** sv, it, tr, no, nl, cy, eu, ca, da, gl, ta, hu, fi, pl, et, hi, la, ur, th, vi, jw, bn, yo, xsl, cs, sw, nn, he, ms, uk, id, kk, bg, lv, my, tl, sk, ne, fa, af, el, bo, hr, ro, sn, mi, yi, am, be, km, is, az, sd, br, sq, ps, mn, ht, ml, sr, sa, te, ka, bs, pa, lt, kn, si, hy, mr, as, gu, fo, and more.
137
138 ## Production Streaming Performance
139
140 On a single NVIDIA H200 GPU:
141
142 - **Real-Time Factor (RTF):** 0.195
143 - **Time-to-first-audio:** ~100 ms
144 - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
145
146 ## Links
147
148 - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
149 - [Fish Audio Playground](https://fish.audio)
150 - [Blog & Tech Report](https://fish.audio/blog/fish-audio-open-sources-s2/)
151
152 ## Technical Report
153
154 If you find our work useful, please consider citing our report:
155
156 ```bibtex
157 @misc{liao2026fishaudios2technical,
158 title={Fish Audio S2 Technical Report},
159 author={Shijia Liao and Yuxuan Wang and Songting Liu and Yifan Cheng and Ruoyi Zhang and Tianyu Li and Shidong Li and Yisheng Zheng and Xingwei Liu and Qingzheng Wang and Zhizhuo Zhou and Jiahua Liu and Xin Chen and Dawei Han},
160 year={2026},
161 eprint={2603.08823},
162 archivePrefix={arXiv},
163 primaryClass={cs.SD},
164 url={https://arxiv.org/abs/2603.08823},
165 }
166 ```
167
168 ## License
169
170 This model is licensed under the [Fish Audio Research License](LICENSE.md). Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio.