README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | library_name: transformers.js |
| 4 | language: |
| 5 | - en |
| 6 | base_model: |
| 7 | - hexgrad/Kokoro-82M |
| 8 | pipeline_tag: text-to-speech |
| 9 | --- |
| 10 | |
| 11 | # Kokoro TTS |
| 12 | |
| 13 | Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). |
| 14 | |
| 15 | ## Table of contents |
| 16 | |
| 17 | - [Usage](#usage) |
| 18 | - [JavaScript](#javascript) |
| 19 | - [Python](#python) |
| 20 | - [Voices/Samples](#voicessamples) |
| 21 | - [Quantizations](#quantizations) |
| 22 | |
| 23 | |
| 24 | ## Usage |
| 25 | |
| 26 | ### JavaScript |
| 27 | |
| 28 | First, install the `kokoro-js` library from [NPM](https://npmjs.com/package/kokoro-js) using: |
| 29 | ```bash |
| 30 | npm i kokoro-js |
| 31 | ``` |
| 32 | |
| 33 | You can then generate speech as follows: |
| 34 | |
| 35 | ```js |
| 36 | import { KokoroTTS } from "kokoro-js"; |
| 37 | |
| 38 | const model_id = "onnx-community/Kokoro-82M-ONNX"; |
| 39 | const tts = await KokoroTTS.from_pretrained(model_id, { |
| 40 | dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16" |
| 41 | }); |
| 42 | |
| 43 | const text = "Life is like a box of chocolates. You never know what you're gonna get."; |
| 44 | const audio = await tts.generate(text, { |
| 45 | // Use `tts.list_voices()` to list all available voices |
| 46 | voice: "af_bella", |
| 47 | }); |
| 48 | audio.save("audio.wav"); |
| 49 | ``` |
| 50 | |
| 51 | |
| 52 | ### Python |
| 53 | |
| 54 | ```python |
| 55 | import os |
| 56 | import numpy as np |
| 57 | from onnxruntime import InferenceSession |
| 58 | |
| 59 | # You can generate token ids as follows: |
| 60 | # 1. Convert input text to phonemes using https://github.com/hexgrad/misaki |
| 61 | # 2. Map phonemes to ids using https://huggingface.co/hexgrad/Kokoro-82M/blob/785407d1adfa7ae8fbef8ffd85f34ca127da3039/config.json#L34-L148 |
| 62 | tokens = [50, 157, 43, 135, 16, 53, 135, 46, 16, 43, 102, 16, 56, 156, 57, 135, 6, 16, 102, 62, 61, 16, 70, 56, 16, 138, 56, 156, 72, 56, 61, 85, 123, 83, 44, 83, 54, 16, 53, 65, 156, 86, 61, 62, 131, 83, 56, 4, 16, 54, 156, 43, 102, 53, 16, 156, 72, 61, 53, 102, 112, 16, 70, 56, 16, 138, 56, 44, 156, 76, 158, 123, 56, 16, 62, 131, 156, 43, 102, 54, 46, 16, 102, 48, 16, 81, 47, 102, 54, 16, 54, 156, 51, 158, 46, 16, 70, 16, 92, 156, 135, 46, 16, 54, 156, 43, 102, 48, 4, 16, 81, 47, 102, 16, 50, 156, 72, 64, 83, 56, 62, 16, 156, 51, 158, 64, 83, 56, 16, 44, 157, 102, 56, 16, 44, 156, 76, 158, 123, 56, 4] |
| 63 | |
| 64 | # Context length is 512, but leave room for the pad token 0 at the start & end |
| 65 | assert len(tokens) <= 510, len(tokens) |
| 66 | |
| 67 | # Style vector based on len(tokens), ref_s has shape (1, 256) |
| 68 | voices = np.fromfile('./voices/af.bin', dtype=np.float32).reshape(-1, 1, 256) |
| 69 | ref_s = voices[len(tokens)] |
| 70 | |
| 71 | # Add the pad ids, and reshape tokens, should now have shape (1, <=512) |
| 72 | tokens = [[0, *tokens, 0]] |
| 73 | |
| 74 | model_name = 'model.onnx' # Options: model.onnx, model_fp16.onnx, model_quantized.onnx, model_q8f16.onnx, model_uint8.onnx, model_uint8f16.onnx, model_q4.onnx, model_q4f16.onnx |
| 75 | sess = InferenceSession(os.path.join('onnx', model_name)) |
| 76 | |
| 77 | audio = sess.run(None, dict( |
| 78 | input_ids=tokens, |
| 79 | style=ref_s, |
| 80 | speed=np.ones(1, dtype=np.float32), |
| 81 | ))[0] |
| 82 | ``` |
| 83 | |
| 84 | Optionally, save the audio to a file: |
| 85 | ```py |
| 86 | import scipy.io.wavfile as wavfile |
| 87 | wavfile.write('audio.wav', 24000, audio[0]) |
| 88 | ``` |
| 89 | |
| 90 | |
| 91 | ## Voices/Samples |
| 92 | |
| 93 | |
| 94 | > Life is like a box of chocolates. You never know what you're gonna get. |
| 95 | |
| 96 | |
| 97 | | Name | Nationality | Gender | Sample | |
| 98 | | ------------ | ----------- | ------ | --------------------------------------------------------------------------------------------------------------------------------------- | |
| 99 | | **af_heart** | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/S_9tkA75BT_QHKOzSX6S-.wav"></audio> | |
| 100 | | af_alloy | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/wiZ3gvlL--p5pRItO4YRE.wav"></audio> | |
| 101 | | af_aoede | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/Nv1xMwzjTdF9MR8v0oEEJ.wav"></audio> | |
| 102 | | af_bella | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/sWN0rnKU6TlLsVdGqRktF.wav"></audio> | |
| 103 | | af_jessica | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/2Oa4wITWAmiCXJ_Q97-7R.wav"></audio> | |
| 104 | | af_kore | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/AOIgyspzZWDGpn7oQgwtu.wav"></audio> | |
| 105 | | af_nicole | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/EY_V2OGr-hzmtTGrTCTyf.wav"></audio> | |
| 106 | | af_nova | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/X-xdEkx3GPlQG5DK8Gsqd.wav"></audio> | |
| 107 | | af_river | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/ZqaV2-xGUZdBQmZAF1Xqy.wav"></audio> | |
| 108 | | af_sarah | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/xzoJBl1HCvkE8Fl8Xu2R4.wav"></audio> | |
| 109 | | af_sky | American | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/ubebYQoaseyQk-jDLeWX7.wav"></audio> | |
| 110 | | am_adam | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/tvauhDVRGvGK98I-4wv3H.wav"></audio> | |
| 111 | | am_echo | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/qy_KuUB0hXsu-u8XaJJ_Z.wav"></audio> | |
| 112 | | am_eric | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/JhqPjbpMhraUv5nTSPpwD.wav"></audio> | |
| 113 | | am_fenrir | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/c0R9caBdBiNjGUUalI_DQ.wav"></audio> | |
| 114 | | am_liam | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/DFHvulaLeOjXIDKecvNG3.wav"></audio> | |
| 115 | | am_michael | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/IPKhsnjq1tPh3JmHH8nEg.wav"></audio> | |
| 116 | | am_onyx | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/ov0pFDfE8NNKZ80LqW6Di.wav"></audio> | |
| 117 | | am_puck | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/MOC654sLMHWI64g8HWesV.wav"></audio> | |
| 118 | | am_santa | American | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/LzA6JmHBvQlhOviy8qVfJ.wav"></audio> | |
| 119 | | bf_alice | British | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/9mnYZ3JWq7f6U12plXilA.wav"></audio> | |
| 120 | | bf_emma | British | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/_fvGtKMttRI0cZVGqxMh8.wav"></audio> | |
| 121 | | bf_isabella | British | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/VzlcJpqGEND_Q3duYnhiu.wav"></audio> | |
| 122 | | bf_lily | British | Female | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/qZCoartohiRlVamY8Xpok.wav"></audio> | |
| 123 | | bm_daniel | British | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/Eb0TLnLXHDRYOA3TJQKq3.wav"></audio> | |
| 124 | | bm_fable | British | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/NT9XkmvlezQ0FJ6Th5hoZ.wav"></audio> | |
| 125 | | bm_george | British | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/y6VJbCESszLZGupPoqNkF.wav"></audio> | |
| 126 | | bm_lewis | British | Male | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/RlB5BRvLt-IFvTjzQNxCh.wav"></audio> | |
| 127 | |
| 128 | |
| 129 | ## Quantizations |
| 130 | |
| 131 | The model is resilient to quantization, enabling efficient high-quality speech synthesis at a fraction of the original model size. |
| 132 | |
| 133 | > How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born. |
| 134 | |
| 135 | |
| 136 | | Model | Size (MB) | Sample | |
| 137 | |------------------------------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------| |
| 138 | | model.onnx (fp32) | 326 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/njexBuqPzfYUvWgs9eQ-_.wav"></audio> | |
| 139 | | model_fp16.onnx (fp16) | 163 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/8Ebl44hMQonZs4MlykExt.wav"></audio> | |
| 140 | | model_quantized.onnx (8-bit) | 92.4 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/9SLOt6ETclZ4yRdlJ0VIj.wav"></audio> | |
| 141 | | model_q8f16.onnx (Mixed precision) | 86 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/gNDMqb33YEmYMbAIv_Grx.wav"></audio> | |
| 142 | | model_uint8.onnx (8-bit & mixed precision) | 177 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/tpOWRHIWwEb0PJX46dCWQ.wav"></audio> | |
| 143 | | model_uint8f16.onnx (Mixed precision) | 114 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/vtZhABzjP0pvGD7dRb5Vr.wav"></audio> | |
| 144 | | model_q4.onnx (4-bit matmul) | 305 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/8FVn0IJIUfccEBWq8Fnw_.wav"></audio> | |
| 145 | | model_q4f16.onnx (4-bit matmul & fp16 weights) | 154 | <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/61b253b7ac5ecaae3d1efe0c/7DrgWC_1q00s-wUJuG44X.wav"></audio> | |
| 146 | |