|
| 1 | +# llama.cpp/example/tts |
| 2 | +This example demonstrates the Text To Speech feature. It uses a |
| 3 | +[model](https://www.outeai.com/blog/outetts-0.2-500m) from |
| 4 | +[outeai](https://www.outeai.com/). |
| 5 | + |
| 6 | +## Quickstart |
| 7 | +If you have built llama.cpp with `-DLLAMA_CURL=ON` you can simply run the |
| 8 | +following command and the required models will be downloaded automatically: |
| 9 | +```console |
| 10 | +$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav |
| 11 | +``` |
| 12 | +For details about the models and how to convert them to the required format |
| 13 | +see the following sections. |
| 14 | + |
| 15 | +### Model conversion |
| 16 | +Checkout or download the model that contains the LLM model: |
| 17 | +```console |
| 18 | +$ pushd models |
| 19 | +$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M |
| 20 | +$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull |
| 21 | +$ popd |
| 22 | +``` |
| 23 | +Convert the model to .gguf format: |
| 24 | +```console |
| 25 | +(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \ |
| 26 | + --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16 |
| 27 | +``` |
| 28 | +The generated model will be `models/outetts-0.2-0.5B-f16.gguf`. |
| 29 | + |
| 30 | +We can optionally quantize this to Q8_0 using the following command: |
| 31 | +```console |
| 32 | +$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \ |
| 33 | + models/outetts-0.2-0.5B-q8_0.gguf q8_0 |
| 34 | +``` |
| 35 | +The quantized model will be `models/outetts-0.2-0.5B-q8_0.gguf`. |
| 36 | + |
| 37 | +Next we do something simlar for the audio decoder. First download or checkout |
| 38 | +the model for the voice decoder: |
| 39 | +```console |
| 40 | +$ pushd models |
| 41 | +$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token |
| 42 | +$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull |
| 43 | +$ popd |
| 44 | +``` |
| 45 | +This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to |
| 46 | +huggingface format: |
| 47 | +```console |
| 48 | +(venv) python examples/tts/convert_pt_to_hf.py \ |
| 49 | + models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt |
| 50 | +... |
| 51 | +Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors |
| 52 | +Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json |
| 53 | +Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json |
| 54 | +``` |
| 55 | +Then we can convert the huggingface format to gguf: |
| 56 | +```console |
| 57 | +(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \ |
| 58 | + --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16 |
| 59 | +... |
| 60 | +INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf |
| 61 | +``` |
| 62 | + |
| 63 | +### Running the example |
| 64 | + |
| 65 | +With both of the models generated, the LLM model and the voice decoder model, |
| 66 | +we can run the example: |
| 67 | +```console |
| 68 | +$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \ |
| 69 | + -mv ./models/wavtokenizer-large-75-f16.gguf \ |
| 70 | + -p "Hello world" |
| 71 | +... |
| 72 | +main: audio written to file 'output.wav' |
| 73 | +``` |
| 74 | +The output.wav file will contain the audio of the prompt. This can be heard |
| 75 | +by playing the file with a media player. On Linux the following command will |
| 76 | +play the audio: |
| 77 | +```console |
| 78 | +$ aplay output.wav |
| 79 | +``` |
| 80 | + |
0 commit comments