README.md · speecht5_tts

README.md

10.6 KB · 272 lines · markdown Raw

1	`---`
2	`license: mit`
3	`tags:`
4	`- audio`
5	`- text-to-speech`
6	`datasets:`
7	`- libritts`
8	`---`
9
10	`# SpeechT5 (TTS task)`
11
12	`SpeechT5 model fine-tuned for speech synthesis (text-to-speech) on LibriTTS.`
13
14	`This model was introduced in [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.`
15
16	`SpeechT5 was first released in [this repository](https://github.com/microsoft/SpeechT5/), [original weights](https://huggingface.co/mechanicalsea/speecht5-tts). The license used is [MIT](https://github.com/microsoft/SpeechT5/blob/main/LICENSE).`
17
18
19
20	`## Model Description`
21
22	Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.
23
24	`Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder.`
25
26	`Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.`
27
28	`- Developed by: Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.`
29	`- Shared by [optional]: [Matthijs Hollemans](https://huggingface.co/Matthijs)`
30	`- Model type: text-to-speech`
31	`- Language(s) (NLP): [More Information Needed]`
32	`- License: [MIT](https://github.com/microsoft/SpeechT5/blob/main/LICENSE)`
33	`- Finetuned from model [optional]: [More Information Needed]`
34
35
36	`## Model Sources [optional]`
37
38	`<!-- Provide the basic links for the model. -->`
39
40	`- Repository: [https://github.com/microsoft/SpeechT5/]`
41	`- Paper: [https://arxiv.org/pdf/2110.07205.pdf]`
42	`- Blog Post: [https://huggingface.co/blog/speecht5]`
43	`- Demo: [https://huggingface.co/spaces/Matthijs/speecht5-tts-demo]`
44
45
46	`# Uses`
47
48	`<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->`
49
50	`## 🤗 Transformers Usage`
51
52	`You can run SpeechT5 TTS locally with the 🤗 Transformers library.`
53
54	`1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), sentencepiece, soundfile and datasets(optional):`
55
56	```
57	`pip install --upgrade pip`
58	`pip install --upgrade transformers sentencepiece datasets[audio]`
59	```
60
61	2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can access the SpeechT5 model via the TTS pipeline in just a few lines of code!
62
63	```python
64	`from transformers import pipeline`
65	`from datasets import load_dataset`
66	`import soundfile as sf`
67
68	`synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")`
69
70	`embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")`
71	`speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)`
72	`# You can replace this embedding with your own as well.`
73
74	`speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"speaker_embeddings": speaker_embedding})`
75
76	`sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])`
77	```
78
79	`3. Run inference via the Transformers modelling code - You can use the processor + generate code to convert text into a mono 16 kHz speech waveform for more fine-grained control.`
80
81	```python
82	`from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan`
83	`from datasets import load_dataset`
84	`import torch`
85	`import soundfile as sf`
86	`from datasets import load_dataset`
87
88	`processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")`
89	`model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")`
90	`vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")`
91
92	`inputs = processor(text="Hello, my dog is cute.", return_tensors="pt")`
93
94	`# load xvector containing speaker's voice characteristics from a dataset`
95	`embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")`
96	`speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)`
97
98	`speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)`
99
100	`sf.write("speech.wav", speech.numpy(), samplerate=16000)`
101	```
102
103	`### Fine-tuning the Model`
104
105	`Refer to [this Colab notebook](https://colab.research.google.com/drive/1i7I5pzBcU3WDFarDnzweIj4-sVVoIUFJ) for an example of how to fine-tune SpeechT5 for TTS on a different dataset or a new language.`
106
107
108	`## Direct Use`
109
110	`<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->`
111
112	`You can use this model for speech synthesis. See the [model hub](https://huggingface.co/models?search=speecht5) to look for fine-tuned versions on a task that interests you.`
113
114	`## Downstream Use [optional]`
115
116	`<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->`
117
118	`[More Information Needed]`
119
120	`## Out-of-Scope Use`
121
122	`<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->`
123
124	`[More Information Needed]`
125
126	`# Bias, Risks, and Limitations`
127
128	`<!-- This section is meant to convey both technical and sociotechnical limitations. -->`
129
130	`[More Information Needed]`
131
132	`## Recommendations`
133
134	`<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->`
135
136	`Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.`
137
138	`# Training Details`
139
140	`## Training Data`
141
142	`<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->`
143
144	`LibriTTS`
145
146	`## Training Procedure`
147
148	`<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->`
149
150	`### Preprocessing [optional]`
151
152	`Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text.`
153
154
155	`### Training hyperparameters`
156	`- Precision: [More Information Needed] <!--fp16, bf16, fp8, fp32 -->`
157	`- Regime: [More Information Needed] <!--mixed precision or not -->`
158
159	`### Speeds, Sizes, Times [optional]`
160
161	`<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->`
162
163	`[More Information Needed]`
164
165	`# Evaluation`
166
167	`<!-- This section describes the evaluation protocols and provides the results. -->`
168
169	`## Testing Data, Factors & Metrics`
170
171	`### Testing Data`
172
173	`<!-- This should link to a Data Card if possible. -->`
174
175	`[More Information Needed]`
176
177	`### Factors`
178
179	`<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->`
180
181	`[More Information Needed]`
182
183	`### Metrics`
184
185	`<!-- These are the evaluation metrics being used, ideally with a description of why. -->`
186
187	`[More Information Needed]`
188
189	`## Results`
190
191	`[More Information Needed]`
192
193	`### Summary`
194
195
196
197	`# Model Examination [optional]`
198
199	`<!-- Relevant interpretability work for the model goes here -->`
200
201	`Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.`
202
203	`# Environmental Impact`
204
205	`<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->`
206
207	`Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).`
208
209	`- Hardware Type: [More Information Needed]`
210	`- Hours used: [More Information Needed]`
211	`- Cloud Provider: [More Information Needed]`
212	`- Compute Region: [More Information Needed]`
213	`- Carbon Emitted: [More Information Needed]`
214
215	`# Technical Specifications [optional]`
216
217	`## Model Architecture and Objective`
218
219	`The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.`
220
221	`After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.`
222
223	`## Compute Infrastructure`
224
225	`[More Information Needed]`
226
227	`### Hardware`
228
229	`[More Information Needed]`
230
231	`### Software`
232
233	`[More Information Needed]`
234
235	`# Citation [optional]`
236
237	`<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->`
238
239	`BibTeX:`
240
241	```bibtex
242	`@inproceedings{ao-etal-2022-speecht5,`
243	`title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},`
244	`author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},`
245	`booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},`
246	`month = {May},`
247	`year = {2022},`
248	`pages={5723--5738},`
249	`}`
250	```
251
252	`# Glossary [optional]`
253
254	`<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->`
255
256	`- text-to-speech to synthesize audio`
257
258	`# More Information [optional]`
259
260	`[More Information Needed]`
261
262	`# Model Card Authors [optional]`
263
264	`Disclaimer: The team releasing SpeechT5 did not write a model card for this model so this model card has been written by the Hugging Face team.`
265
266	`# Model Card Contact`
267
268	`[More Information Needed]`
269
270
271
272