README.md · MOSS-TTS

1

---

2

license: apache-2.0

3

tags:

4

- text-to-speech

5

language:

6

- zh

7

- en

8

- de

9

- es

10

- fr

11

- ja

12

- it

13

- he

14

- ko

15

- ru

16

- fa

17

- ar

18

- pl

19

- pt

20

- cs

21

- da

22

- sv

23

- hu

24

- el

25

- tr

26

---

27

# MOSS-TTS Family

28

29

30

<br>

31

32

<p align="center">

33

    

34

  <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />

35

</p>

36

37

38

39

<div align="center">

40

  <a href="https://github.com/OpenMOSS/MOSS-TTS/tree/main"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>

41

  <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>

42

  <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>

43

  <a href="https://arxiv.org/abs/2603.18090"><img src="https://img.shields.io/badge/Arxiv-2603.18090-red?logo=Arxiv&amp"></a>

44

45

  <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>

46

  <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>

47

<a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>

48

  <a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>

49

</div>

50

51

52

## Overview

53

MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.

54

55

56

## Introduction

57

58

<p align="center">

59

  <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />

60

</p>

61

62

63

When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.

64

65

- **MOSS‑TTS**: The flagship production model featuring high fidelity and optimal zero-shot voice cloning. It supports **long-speech generation**, **fine-grained control over Pinyin, phonemes, and duration**, as well as **multilingual/code-switched synthesis**.

66

- **MOSS‑TTSD**: A spoken dialogue generation model for expressive, multi-speaker, and ultra-long dialogues. The new **v1.0 version** achieves **industry-leading performance on objective metrics** and **outperformed top closed-source models like Doubao and Gemini 2.5-pro** in subjective evaluations. You can visit the [MOSS-TTSD repository](https://github.com/OpenMOSS/MOSS-TTSD) for details.

67

- **MOSS‑VoiceGenerator**: An open-source voice design model capable of generating diverse voices and styles directly from text prompts, **without any reference speech**. It unifies voice design, style control, and synthesis, functioning independently or as a design layer for downstream TTS. Its performance **surpasses other top-tier voice design models in arena ratings**.

68

- **MOSS‑TTS‑Realtime**: A multi-turn context-aware model for real-time voice agents. It uses incremental synthesis to ensure natural and coherent replies, making it **ideal for building low-latency voice agents when paired with text models**. The TTFB (Time To First Byte) of MOSS-TTS-Realtime reaches 180 ms, and the $T_{\text{LLM-first-sentence}} + T_{\text{MOSS-TTS-Realtime-TTFB}}$ is 377 ms.

69

- **MOSS‑SoundEffect**: A content creation model specialized in **sound effect generation** with wide category coverage and controllable duration. It generates audio for natural environments, urban scenes, biological sounds, human actions, and musical fragments, suitable for film, games, and interactive experiences.

70

71

72

## Model Architecture

73

74

We train **MossTTSDelay** and **MossTTSLocal** as complementary baselines under one training/evaluation setup: **Delay** emphasizes long-context stability, inference speed, and production readiness, while **Local** emphasizes lightweight flexibility and strong objective performance for streaming-oriented systems. Together they provide reproducible references for deployment and research.

75

76

**MossTTSRealtime** is not a third comparison baseline; it is a capability-driven design for voice agents. By modeling multi-turn context from both prior text and user acoustics, it delivers low-latency streaming speech that stays coherent and voice-consistent across turns.

77

78

79

| Architecture | Core Mechanism | Arch Details |

80

|---|---|---|

81

| `MossTTSDelay` |  Multi‑head parallel RVQ prediction with delay‑pattern scheduling | [![Arch Details](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/README.md) |

82

| `MossTTSLocal` | Time‑synchronous RVQ blocks with a depth transformer | [![Arch Details](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_local/README.md) |

83

| `MossTTSRealtime` | Hierarchical text–audio inputs for realtime synthesis | [![Arch Details](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_realtime/README.md) |

84

85

## Released Models

86

87

88

89

|---|---|---:|---|---|---|

90

| **MOSS-TTS** | `MossTTSDelay` | 8B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTS) |

91

|  | `MossTTSLocal` | 1.7B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTS-Local-Transformer) |

92

| **MOSS‑TTSD‑V1.0** | `MossTTSDelay` | 8B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_ttsd_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTSD-v1.0) |

93

| **MOSS‑VoiceGenerator** | `MossTTSDelay` | 1.7B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_voice_generator_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-VoiceGenerator) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-VoiceGenerator) |

94

| **MOSS‑SoundEffect** | `MossTTSDelay` | 8B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_sound_effect_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-SoundEffect) |

95

| **MOSS‑TTS‑Realtime** | `MossTTSRealtime` | 1.7B | [![Model Card](https://img.shields.io/badge/Model%20Card-View-blue?logo=markdown)](https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md) | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) | [![ModelScope](https://img.shields.io/badge/ModelScope-Model-lightgrey?logo=modelscope)](https://modelscope.cn/models/openmoss/MOSS-TTS-Realtime) |

96

97

## Supported Languages

98

99

MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports **20 languages**:

100

101

102

|---|---|---|---|---|---|---|---|---|

103

| Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |

104

| Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |

105

| Italian | it | 🇮🇹 | Hungarian | hu | 🇭🇺 | Korean | ko | 🇰🇷 |

106

| Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |

107

| Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |

108

| Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | | | |

109

| Greek | el | 🇬🇷 | Turkish | tr | 🇹🇷 | | | |

110

111

# MOSS-TTS

112

**MOSS-TTS** is a next-generation, production-grade TTS foundation model focused on **voice cloning**, **ultra-long stable speech generation**, **token-level duration control**, **multilingual & code-switched synthesis**, and **fine-grained Pinyin/phoneme-level pronunciation control**. It is built on a clean autoregressive discrete-token recipe that emphasizes high-quality audio tokenization, large-scale diverse pre-training data, and efficient discrete token modeling.

113

114

## 1. Overview

115

### 1.1 TTS Family Positioning

116

MOSS-TTS is the **flagship base model** in our open-source **TTS Family**. It is designed as a production-ready synthesis backbone that can serve as the primary high-quality engine for scalable voice applications, and as a strong research baseline for controllable TTS and discrete audio token modeling.

117

118

**Design goals**

119

- **Production readiness**: robust voice cloning with stable, on-brand speaker identity at scale

120

- **Controllability**: duration and pronunciation controls that integrate into real workflows

121

- **Long-form stability**: consistent identity and delivery for extended narration

122

- **Multilingual coverage**: multilingual and code-switched synthesis as first-class capabilities

123

124

125

126

### 1.2 Key Capabilities

127

128

MOSS-TTS delivers state-of-the-art quality while providing the fine-grained controllability and long-form stability required for production-grade voice applications, from zero-shot cloning and hour-long narration to token- and phoneme-level control across multilingual and code-switched speech.

129

130

* **State-of-the-art evaluation performance** — top-tier objective and subjective results across standard TTS benchmarks and in-house human preference testing, validating both fidelity and naturalness.

131

* **Zero-shot Voice Cloning (Voice Clone)** — clone a target speaker’s timbre (and part of speaking style) from short reference audio, without speaker-specific fine-tuning.

132

* **Ultra-long Speech Generation (up to 1 hour)** — support continuous long-form speech generation for up to one hour in a single run, designed for extended narration and long-session content creation.

133

* **Token-level Duration Control** — control pacing, rhythm, pauses, and speaking rate at token resolution for precise alignment and expressive delivery.

134

* **Phoneme-level Pronunciation Control** — supports:

135

136

* pure **Pinyin** input

137

* pure **IPA** phoneme input

138

* mixed **Chinese / English / Pinyin / IPA** input in any combination

139

* **Multilingual support** — high-quality multilingual synthesis with robust generalization across languages and accents.

140

* **Code-switching** — natural mixed-language generation within a single utterance (e.g., Chinese–English), with smooth transitions, consistent speaker identity, and pronunciation-aware rendering on both sides of the switch.

141

142

143

144

### 1.3 Model Architecture

145

146

MOSS-TTS includes **two complementary architectures**, both trained and released to explore different performance/latency tradeoffs and to support downstream research.

147

148

**Architecture A: Delay Pattern (MossTTSDelay)**

149

- Single Transformer backbone with **(n_vq + 1) heads**.

150

- Uses **delay scheduling** for multi-codebook audio tokens.

151

- Strong long-context stability, efficient inference, and production-friendly behavior.

152

153

**Architecture B: Global Latent + Local Transformer (MossTTSLocal)**

154

- Backbone produces a **global latent** per time step.

155

- A lightweight **Local Transformer** emits a token block per step.

156

- **Streaming-friendly** with simpler alignment (no delay scheduling).

157

158

**Why train both?**

159

- **Exploration of architectural potential** and validation across multiple generation paradigms.

160

- **Different tradeoffs**: Delay pattern tends to be faster and more stable for long-form synthesis; Local is smaller and excels on objective benchmarks.

161

- **Open-source value**: two strong baselines for research, ablation, and downstream innovation.

162

163

For full details, see:

164

- **[moss_tts_delay/README.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_delay/README.md)**

165

- **[moss_tts_local/README.md](https://github.com/OpenMOSS/MOSS-TTS/tree/main/moss_tts_local)**

166

167

168

169

### 1.4 Released Models

170

171

| Model | Description |

172

|---|---|

173

| **MossTTSDelay-8B** | **Recommended for production**. Faster inference, stronger long-context stability, and robust voice cloning quality. Best for large-scale deployment and long-form narration. |

174

| **MossTTSLocal-1.7B** | **Recommended for evaluation and research**. Smaller model size with SOTA objective metrics. Great for quick experiments, ablations, and academic studies. |

175

176

**Recommended decoding hyperparameters (per model)**

177

178

179

|---|---:|---:|---:|---:|

180

| **MossTTSDelay-8B** | 1.7 | 0.8 | 25 | 1.0 |

181

| **MossTTSLocal-1.7B** | 1.0 | 0.95 | 50 | 1.1 |

182

183

184

185

## 2. Quick Start

186

187

188

189

### Environment Setup

190

191

We recommend a clean, isolated Python environment with **Transformers 5.0.0** to avoid dependency conflicts.

192

193

```bash

194

conda create -n moss-tts python=3.12 -y

195

conda activate moss-tts

196

```

197

198

Install all required dependencies:

199

200

```bash

201

git clone https://github.com/OpenMOSS/MOSS-TTS.git

202

cd MOSS-TTS

203

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .

204

```

205

206

#### (Optional) Install FlashAttention 2

207

208

For better speed and lower GPU memory usage, you can install FlashAttention 2 if your hardware supports it.

209

210

```bash

211

pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]"

212

```

213

214

If your machine has limited RAM and many CPU cores, you can cap build parallelism:

215

216

```bash

217

MAX_JOBS=4 pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[flash-attn]"

218

```

219

220

Notes:

221

- Dependencies are managed in `pyproject.toml`, which currently pins `torch==2.9.1+cu128` and `torchaudio==2.9.1+cu128`.

222

- If FlashAttention 2 fails to build on your machine, you can skip it and use the default attention backend.

223

- FlashAttention 2 is only available on supported GPUs and is typically used with `torch.float16` or `torch.bfloat16`.

224

225

226

### Basic Usage

227

228

229

230

> Tip: For production usage, prioritize **MossTTSDelay-8B**. The examples below use this model; **MossTTSLocal-1.7B** supports the same API, and a practical walkthrough is available in [moss_tts_local/README.md](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer).

231

232

MOSS-TTS provides a convenient `generate` interface for rapid usage. The examples below cover:

233

1. Direct generation (Chinese / English / Pinyin / IPA)

234

2. Voice cloning

235

3. Duration control

236

237

```python

238

from pathlib import Path

239

import importlib.util

240

import torch

241

import torchaudio

242

from transformers import AutoModel, AutoProcessor

243

# Disable the broken cuDNN SDPA backend

244

torch.backends.cuda.enable_cudnn_sdp(False)

245

# Keep these enabled as fallbacks

246

torch.backends.cuda.enable_flash_sdp(True)

247

torch.backends.cuda.enable_mem_efficient_sdp(True)

248

torch.backends.cuda.enable_math_sdp(True)

249

250

251

pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS"

252

device = "cuda" if torch.cuda.is_available() else "cpu"

253

dtype = torch.bfloat16 if device == "cuda" else torch.float32

254

255

def resolve_attn_implementation() -> str:

256

# Prefer FlashAttention 2 when package + device conditions are met.

257

if (

258

device == "cuda"

259

and importlib.util.find_spec("flash_attn") is not None

260

and dtype in {torch.float16, torch.bfloat16}

261

):

262

major, _ = torch.cuda.get_device_capability()

263

if major >= 8:

264

return "flash_attention_2"

265

266

# CUDA fallback: use PyTorch SDPA kernels.

267

if device == "cuda":

268

return "sdpa"

269

270

# CPU fallback.

271

return "eager"

272

273

274

attn_implementation = resolve_attn_implementation()

275

print(f"[INFO] Using attn_implementation={attn_implementation}")

276

277

processor = AutoProcessor.from_pretrained(

278

pretrained_model_name_or_path,

279

trust_remote_code=True,

280

)

281

processor.audio_tokenizer = processor.audio_tokenizer.to(device)

282

283

text_1 = "亲爱的你，\n你好呀。\n\n今天，我想用最认真、最温柔的声音，对你说一些重要的话。\n这些话，像一颗小小的星星，希望能在你的心里慢慢发光。\n\n首先，我想祝你——\n每天都能平平安安、快快乐乐。\n\n希望你早上醒来的时候，\n窗外有光，屋子里很安静，\n你的心是轻轻的，没有着急，也没有害怕。\n\n希望你吃饭的时候胃口很好，\n走路的时候脚步稳稳，\n晚上睡觉的时候，能做一个又一个甜甜的梦。\n\n我希望你能一直保持好奇心。\n对世界充满问题，\n对天空、星星、花草、书本和故事感兴趣。\n当你问“为什么”的时候，\n希望总有人愿意认真地听你说话。\n\n我也希望你学会温柔。\n温柔地对待朋友，\n温柔地对待小动物，\n也温柔地对待自己。\n\n如果有一天你犯了错，\n请不要太快责怪自己，\n因为每一个认真成长的人，\n都会在路上慢慢学会更好的方法。\n\n愿你拥有勇气。\n当你站在陌生的地方时，\n当你第一次举手发言时，\n当你遇到困难、感到害怕的时候，\n希望你能轻轻地告诉自己：\n“我可以试一试。”\n\n就算没有一次成功，也没有关系。\n失败不是坏事，\n它只是告诉你，你正在努力。\n\n我希望你学会分享快乐。\n把开心的事情告诉别人，\n把笑声送给身边的人，\n因为快乐被分享的时候，\n会变得更大、更亮。\n\n如果有一天你感到难过，\n我希望你知道——\n难过并不丢脸，\n哭泣也不是软弱。\n\n愿你能找到一个安全的地方，\n慢慢把心里的话说出来，\n然后再一次抬起头，看见希望。\n\n我还希望你能拥有梦想。\n这个梦想也许很大，\n也许很小，\n也许现在还说不清楚。\n\n没关系。\n梦想会和你一起长大，\n在时间里慢慢变得清楚。\n\n最后，我想送你一个最最重要的祝福：\n\n愿你被世界温柔对待，\n也愿你成为一个温柔的人。\n\n愿你的每一天，\n都值得被记住，\n都值得被珍惜。\n\n亲爱的你，\n请记住，\n你是独一无二的，\n你已经很棒了，\n而你的未来，\n一定会慢慢变得闪闪发光。\n\n祝你健康、勇敢、幸福，\n祝你永远带着笑容向前走。"

284

text_2 = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision. It has learned to see, hear, speak, and think, and is beginning to become an extension of human capabilities. AI is not about replacing humans, but about amplifying human creativity, making knowledge more equitable, more efficient, and allowing imagination to reach further. A new era, jointly shaped by humans and intelligent systems, has arrived."

285

text_3 = "nin2 hao3，qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4？"

286

text_4 = "nin2 hao3，qing4 wen3 nin2 lai2 zi4 na4 zuo3 cheng4 shi3？"

287

text_5 = "您好，请问您来自哪 zuo4 cheng2 shi4？"

288

text_6 = "/həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/"

289

290

# Use audio from ./assets/audio to avoid downloading from the cloud.

291

ref_audio_1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"

292

ref_audio_2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

293

294

conversations = [

295

# Direct TTS (no reference)

296

[processor.build_user_message(text=text_1)],

297

[processor.build_user_message(text=text_2)],

298

# Pinyin or IPA input

299

[processor.build_user_message(text=text_3)],

300

[processor.build_user_message(text=text_4)],

301

[processor.build_user_message(text=text_5)],

302

[processor.build_user_message(text=text_6)],

303

# Voice cloning (with reference)

304

[processor.build_user_message(text=text_1, reference=[ref_audio_1])],

305

[processor.build_user_message(text=text_2, reference=[ref_audio_2])],

306

# Duration control

307

[processor.build_user_message(text=text_2, tokens=325)],

308

[processor.build_user_message(text=text_2, tokens=600)],

309

]

310

311

model = AutoModel.from_pretrained(

312

pretrained_model_name_or_path,

313

trust_remote_code=True,

314

# If FlashAttention 2 is installed, you can set attn_implementation="flash_attention_2"

315

attn_implementation=attn_implementation,

316

torch_dtype=dtype,

317

).to(device)

318

model.eval()

319

320

batch_size = 1

321

322

save_dir = Path("inference_root")

323

save_dir.mkdir(exist_ok=True, parents=True)

324

sample_idx = 0

325

with torch.no_grad():

326

for start in range(0, len(conversations), batch_size):

327

batch_conversations = conversations[start : start + batch_size]

328

batch = processor(batch_conversations, mode="generation")

329

input_ids = batch["input_ids"].to(device)

330

attention_mask = batch["attention_mask"].to(device)

331

332

outputs = model.generate(

333

input_ids=input_ids,

334

attention_mask=attention_mask,

335

max_new_tokens=4096,

336

)

337

338

for message in processor.decode(outputs):

339

audio = message.audio_codes_list[0]

340

out_path = save_dir / f"sample{sample_idx}.wav"

341

sample_idx += 1

342

torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)

343

344

```

345

346

### Continuation + Voice Cloning (Prefix Audio + Text)

347

348

MOSS-TTS supports continuation-based cloning: provide a prefix audio clip in the assistant message, and make sure the **prefix transcript** is included in the text. The model continues in the same speaker identity and style.

349

350

```python

351

from pathlib import Path

352

import importlib.util

353

import torch

354

import torchaudio

355

from transformers import AutoModel, AutoProcessor

356

# Disable the broken cuDNN SDPA backend

357

torch.backends.cuda.enable_cudnn_sdp(False)

358

# Keep these enabled as fallbacks

359

torch.backends.cuda.enable_flash_sdp(True)

360

torch.backends.cuda.enable_mem_efficient_sdp(True)

361

torch.backends.cuda.enable_math_sdp(True)

362

363

364

pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTS"

365

device = "cuda" if torch.cuda.is_available() else "cpu"

366

dtype = torch.bfloat16 if device == "cuda" else torch.float32

367

368

def resolve_attn_implementation() -> str:

369

# Prefer FlashAttention 2 when package + device conditions are met.

370

if (

371

device == "cuda"

372

and importlib.util.find_spec("flash_attn") is not None

373

and dtype in {torch.float16, torch.bfloat16}

374

):

375

major, _ = torch.cuda.get_device_capability()

376

if major >= 8:

377

return "flash_attention_2"

378

379

# CUDA fallback: use PyTorch SDPA kernels.

380

if device == "cuda":

381

return "sdpa"

382

383

# CPU fallback.

384

return "eager"

385

386

387

attn_implementation = resolve_attn_implementation()

388

print(f"[INFO] Using attn_implementation={attn_implementation}")

389

390

processor = AutoProcessor.from_pretrained(

391

pretrained_model_name_or_path,

392

trust_remote_code=True

393

)

394

processor.audio_tokenizer = processor.audio_tokenizer.to(device)

395

396

text_1 = "亲爱的你，\n你好呀。\n\n今天，我想用最认真、最温柔的声音，对你说一些重要的话。\n这些话，像一颗小小的星星，希望能在你的心里慢慢发光。\n\n首先，我想祝你——\n每天都能平平安安、快快乐乐。\n\n希望你早上醒来的时候，\n窗外有光，屋子里很安静，\n你的心是轻轻的，没有着急，也没有害怕。\n\n希望你吃饭的时候胃口很好，\n走路的时候脚步稳稳，\n晚上睡觉的时候，能做一个又一个甜甜的梦。\n\n我希望你能一直保持好奇心。\n对世界充满问题，\n对天空、星星、花草、书本和故事感兴趣。\n当你问“为什么”的时候，\n希望总有人愿意认真地听你说话。\n\n我也希望你学会温柔。\n温柔地对待朋友，\n温柔地对待小动物，\n也温柔地对待自己。\n\n如果有一天你犯了错，\n请不要太快责怪自己，\n因为每一个认真成长的人，\n都会在路上慢慢学会更好的方法。\n\n愿你拥有勇气。\n当你站在陌生的地方时，\n当你第一次举手发言时，\n当你遇到困难、感到害怕的时候，\n希望你能轻轻地告诉自己：\n“我可以试一试。”\n\n就算没有一次成功，也没有关系。\n失败不是坏事，\n它只是告诉你，你正在努力。\n\n我希望你学会分享快乐。\n把开心的事情告诉别人，\n把笑声送给身边的人，\n因为快乐被分享的时候，\n会变得更大、更亮。\n\n如果有一天你感到难过，\n我希望你知道——\n难过并不丢脸，\n哭泣也不是软弱。\n\n愿你能找到一个安全的地方，\n慢慢把心里的话说出来，\n然后再一次抬起头，看见希望。\n\n我还希望你能拥有梦想。\n这个梦想也许很大，\n也许很小，\n也许现在还说不清楚。\n\n没关系。\n梦想会和你一起长大，\n在时间里慢慢变得清楚。\n\n最后，我想送你一个最最重要的祝福：\n\n愿你被世界温柔对待，\n也愿你成为一个温柔的人。\n\n愿你的每一天，\n都值得被记住，\n都值得被珍惜。\n\n亲爱的你，\n请记住，\n你是独一无二的，\n你已经很棒了，\n而你的未来，\n一定会慢慢变得闪闪发光。\n\n祝你健康、勇敢、幸福，\n祝你永远带着笑容向前走。"

397

text_2 = "We stand on the threshold of the AI era.\nArtificial intelligence is no longer just a concept in laboratories, but is entering every industry, every creative endeavor, and every decision. It has learned to see, hear, speak, and think, and is beginning to become an extension of human capabilities. AI is not about replacing humans, but about amplifying human creativity, making knowledge more equitable, more efficient, and allowing imagination to reach further. A new era, jointly shaped by humans and intelligent systems, has arrived."

398

ref_text_1 = "太阳系八大行星之一。"

399

ref_text_2 = "But I really can't complain about not having a normal college experience to you."

400

# Use audio from ./assets/audio to avoid downloading from the cloud.

401

ref_audio_1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_zh.wav"

402

ref_audio_2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_en.m4a"

403

404

conversations = [

405

# Continuatoin only

406

[

407

processor.build_user_message(text=ref_text_1 + text_1),

408

processor.build_assistant_message(audio_codes_list=[ref_audio_1])

409

],

410

# Continuation with voice cloning

411

[

412

processor.build_user_message(text=ref_text_2 + text_2, reference=[ref_audio_2]),

413

processor.build_assistant_message(audio_codes_list=[ref_audio_2])

414

],

415

]

416

417

model = AutoModel.from_pretrained(

418

pretrained_model_name_or_path,

419

trust_remote_code=True,

420

# If FlashAttention 2 is installed, you can set attn_implementation="flash_attention_2"

421

attn_implementation=attn_implementation,

422

torch_dtype=dtype,

423

).to(device)

424

model.eval()

425

426

batch_size = 1

427

428

save_dir = Path("inference_root")

429

save_dir.mkdir(exist_ok=True, parents=True)

430

sample_idx = 0

431

with torch.no_grad():

432

for start in range(0, len(conversations), batch_size):

433

batch_conversations = conversations[start : start + batch_size]

434

batch = processor(batch_conversations, mode="continuation")

435

input_ids = batch["input_ids"].to(device)

436

attention_mask = batch["attention_mask"].to(device)

437

438

outputs = model.generate(

439

input_ids=input_ids,

440

attention_mask=attention_mask,

441

max_new_tokens=4096,

442

)

443

444

for message in processor.decode(outputs):

445

audio = message.audio_codes_list[0]

446

out_path = save_dir / f"sample{sample_idx}.wav"

447

sample_idx += 1

448

torchaudio.save(out_path, audio.unsqueeze(0), processor.model_config.sampling_rate)

449

450

```

451

452

453

454

### Input Types

455

456

**UserMessage**

457

458

459

|---|---|---:|---|

460

| `text` | `str` | Yes | Text to synthesize. Supports Chinese, English, German, French, Spanish, Japanese, Korean, etc. Can mix raw text with Pinyin or IPA for pronunciation control. |

461

| `reference` | `List[str]` | No | Reference audio for voice cloning. For current MOSS-TTS, **one audio** is expected in the list. |

462

| `tokens` | `int` | No | Expected number of audio tokens. **1s ≈ 12.5 tokens**. |

463

464

**AssistantMessage**

465

466

467

|---|---|---:|---|

468

| `audio_codes_list` | `List[str]` | Only for continuation | Prefix audio for continuation-based cloning. Use audio file paths or URLs. |

469

470

471

472

### Generation Hyperparameters

473

474

475

|---|---|---:|---|

476

| `max_new_tokens` | `int` | — | Controls total generated audio tokens. Use duration rule: **1s ≈ 12.5 tokens**. |

477

478

479

| `audio_top_k` | `int` | 25 | Top-K sampling. Lower values tighten sampling space. |

480

481

482

> Note: MOSS-TTS is a pretrained base model and is **sensitive to decoding hyperparameters**. See **Released Models** for recommended defaults.

483

484

485

486

### Pinyin Input

487

488

Use tone-numbered Pinyin such as `ni3 hao3 wo3 men1`. You can convert Chinese text with [pypinyin](https://github.com/mozillazg/python-pinyin), then adjust tones for pronunciation control.

489

490

```python

491

import re

492

from pypinyin import pinyin, Style

493

494

CN_PUNCT = r"，。！？；：、（）“”‘’"

495

496

497

def fix_punctuation_spacing(s: str) -> str:

498

s = re.sub(rf"\s+([{CN_PUNCT}])", r"\1", s)

499

s = re.sub(rf"([{CN_PUNCT}])\s+", r"\1", s)

500

return s

501

502

503

def zh_to_pinyin_tone3(text: str, strict: bool = True) -> str:

504

result = pinyin(

505

text,

506

style=Style.TONE3,

507

heteronym=False,

508

strict=strict,

509

errors="default",

510

)

511

512

s = " ".join(item[0] for item in result)

513

return fix_punctuation_spacing(s)

514

515

text = zh_to_pinyin_tone3("您好，请问您来自哪座城市？")

516

print(text)

517

518

# Expected: nin2 hao3，qing3 wen4 nin2 lai2 zi4 na3 zuo4 cheng2 shi4？

519

# Try: nin2 hao3，qing4 wen3 nin2 lai2 zi4 na4 zuo3 cheng4 shi3？

520

```

521

522

523

524

### IPA Input

525

526

Use `/.../` to wrap IPA sequences so they are distinct from normal text. You can use [DeepPhonemizer](https://github.com/spring-media/DeepPhonemizer) to convert English paragraphs or words into IPA sequences.

527

528

```python

529

from dp.phonemizer import Phonemizer

530

531

# Download a phonemizer checkpoint from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_ipa_forward.pt

532

model_path = "<path-to-phonemizer-checkpoint>"

533

phonemizer = Phonemizer.from_checkpoint(model_path)

534

535

english_texts = "Hello, may I ask which city you are from?"

536

phoneme_outputs = phonemizer(

537

english_texts,

538

lang="en_us",

539

batch_size=8

540

)

541

model_input_text = f"/{phoneme_outputs}/"

542

print(model_input_text)

543

544

# Expected: /həloʊ, meɪ aɪ æsk wɪtʃ sɪti juː ɑːr frʌm?/

545

```

546

547

548

549

## 3. Evaluation

550

MOSS-TTS achieved state-of-the-art results on the open-source zero-shot TTS benchmark Seed-TTS-eval, not only surpassing all open-source models but also rivaling the most powerful closed-source models.

551

552

553

|---|---:|:---:|---:|---:|---:|---:|

554

| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 |

555

| FishAudio‑S1 | 4B | ❌ | 1.72 | 62.57 | 1.22 | 72.1 |

556

| CosyVoice3 | 1.5B | ❌ | 2.22 | 72 | 1.12 | 78.1 |

557

| Seed‑TTS | | ❌ | 2.25 | 76.2 | 1.12 | 79.6 |

558

| MiniMax‑Speech | | ❌ | 1.65 | 69.2 | 0.83 | 78.3 |

559

| | | | | | | |

560

| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 |

561

| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 |

562

| CosyVoice3 | 0.5B | ✅ | 2.02 | 71.8 | 1.16 | 78 |

563

| F5‑TTS | 0.3B | ✅ | 2 | 67 | 1.53 | 76 |

564

| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66 |

565

| FireRedTTS | 0.5B | ✅ | 3.82 | 46 | 1.51 | 63.5 |

566

| FireRedTTS‑2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 |

567

| Qwen2.5‑Omni | 7B | ✅ | 2.72 | 63.2 | 1.7 | 75.2 |

568

| FishAudio‑S1‑mini | 0.5B | ✅ | 1.94 | 55 | 1.18 | 68.5 |

569

| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 |

570

| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 |

571

| HiggsAudio‑v2 | 3B | ✅ | 2.44 | 67.7 | 1.5 | 74 |

572

| GLM-TTS | 1.5B | ✅ | 2.23 | 67.2 | 1.03 | 76.1 |

573

| GLM-TTS-RL | 1.5B | ✅ | 1.91 | 68.1 | **0.89** | 76.4 |

574

| VoxCPM | 0.5B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 |

575

| Qwen3‑TTS | 0.6B | ✅ | 1.68 | 70.39 | 1.23 | 76.4 |

576

| Qwen3‑TTS | 1.7B | ✅ | **1.5** | 71.45 | 1.33 | 76.72 |

577

| | | | | | | |

578

| **MossTTSDelay** | **8B** | ✅ | 1.84 | 70.86 | 1.37 | 76.98 |

579

| **MossTTSLocal** | **1.7B** | ✅ | 1.93 | **73.28** | 1.44 | **79.62** |

580