README.md · mms-tts-hat

README.md

3.9 KB · 100 lines · markdown Raw

1
2	`---`
3	`license: cc-by-nc-4.0`
4	`tags:`
5	`- mms`
6	`- vits`
7	`pipeline_tag: text-to-speech`
8	`---`
9
10	`# Massively Multilingual Speech (MMS): Haitian Creole Text-to-Speech`
11
12	`This repository contains the Haitian Creole (hat) language text-to-speech (TTS) model checkpoint.`
13
14	`This model is part of Facebook's [Massively Multilingual Speech](https://arxiv.org/abs/2305.13516) project, aiming to`
15	`provide speech technology across a diverse range of languages. You can find more details about the supported languages`
16	`and their ISO 639-3 codes in the [MMS Language Coverage Overview](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html),`
17	`and see all MMS-TTS checkpoints on the Hugging Face Hub: [facebook/mms-tts](https://huggingface.co/models?sort=trending&search=facebook%2Fmms-tts).`
18
19	`MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards.`
20
21	`## Model Details`
22
23	`VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is an end-to-end`
24	`speech synthesis model that predicts a speech waveform conditional on an input text sequence. It is a conditional variational`
25	`autoencoder (VAE) comprised of a posterior encoder, decoder, and conditional prior.`
26
27	`A set of spectrogram-based acoustic features are predicted by the flow-based module, which is formed of a Transformer-based`
28	`text encoder and multiple coupling layers. The spectrogram is decoded using a stack of transposed convolutional layers,`
29	`much in the same style as the HiFi-GAN vocoder. Motivated by the one-to-many nature of the TTS problem, where the same text`
30	`input can be spoken in multiple ways, the model also includes a stochastic duration predictor, which allows the model to`
31	`synthesise speech with different rhythms from the same input text.`
32
33	`The model is trained end-to-end with a combination of losses derived from variational lower bound and adversarial training.`
34	`To improve the expressiveness of the model, normalizing flows are applied to the conditional prior distribution. During`
35	`inference, the text encodings are up-sampled based on the duration prediction module, and then mapped into the`
36	`waveform using a cascade of the flow module and HiFi-GAN decoder. Due to the stochastic nature of the duration predictor,`
37	`the model is non-deterministic, and thus requires a fixed seed to generate the same speech waveform.`
38
39	`For the MMS project, a separate VITS checkpoint is trained on each langauge.`
40
41	`## Usage`
42
43	`MMS-TTS is available in the 🤗 Transformers library from version 4.33 onwards. To use this checkpoint,`
44	`first install the latest version of the library:`
45
46	```
47	`pip install --upgrade transformers accelerate`
48	```
49
50	`Then, run inference with the following code-snippet:`
51
52	```python
53	`from transformers import VitsModel, AutoTokenizer`
54	`import torch`
55
56	`model = VitsModel.from_pretrained("facebook/mms-tts-hat")`
57	`tokenizer = AutoTokenizer.from_pretrained("facebook/mms-tts-hat")`
58
59	`text = "some example text in the Haitian Creole language"`
60	`inputs = tokenizer(text, return_tensors="pt")`
61
62	`with torch.no_grad():`
63	`output = model(**inputs).waveform`
64	```
65
66	The resulting waveform can be saved as a `.wav` file:
67
68	```python
69	`import scipy`
70
71	`scipy.io.wavfile.write("techno.wav", rate=model.config.sampling_rate, data=output)`
72	```
73
74	`Or displayed in a Jupyter Notebook / Google Colab:`
75
76	```python
77	`from IPython.display import Audio`
78
79	`Audio(output, rate=model.config.sampling_rate)`
80	```
81
82
83
84	`## BibTex citation`
85
86	`This model was developed by Vineel Pratap et al. from Meta AI. If you use the model, consider citing the MMS paper:`
87
88	```
89	`@article{pratap2023mms,`
90	`title={Scaling Speech Technology to 1,000+ Languages},`
91	`author={Vineel Pratap and Andros Tjandra and Bowen Shi and Paden Tomasello and Arun Babu and Sayani Kundu and Ali Elkahky and Zhaoheng Ni and Apoorv Vyas and Maryam Fazel-Zarandi and Alexei Baevski and Yossi Adi and Xiaohui Zhang and Wei-Ning Hsu and Alexis Conneau and Michael Auli},`
92	`journal={arXiv},`
93	`year={2023}`
94	`}`
95	```
96
97	`## License`
98
99	`The model is licensed as CC-BY-NC 4.0.`
100