|
9 | 9 |
|
10 | 10 | Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
|
11 | 11 |
|
12 |
| -### 🚧 Incoming breaking change + refactoring: |
| 12 | +### Hot topics |
13 | 13 |
|
14 |
| -See PR https://github.com/ggerganov/llama.cpp/pull/2398 for more info. |
| 14 | +A new file format has been introduced: [GGUF](https://github.com/ggerganov/llama.cpp/pull/2398) |
15 | 15 |
|
16 |
| -To devs: avoid making big changes to `llama.h` / `llama.cpp` until merged |
| 16 | +Last revision compatible with the old format: [dadbed9](https://github.com/ggerganov/llama.cpp/commit/dadbed99e65252d79f81101a392d0d6497b86caa) |
| 17 | + |
| 18 | +### Current `master` should be considered in Beta - expect some issues for a few days! |
| 19 | + |
| 20 | +### Be prepared to re-convert and / or re-quantize your GGUF models while this notice is up! |
| 21 | + |
| 22 | +### Issues with non-GGUF models will be considered with low priority! |
17 | 23 |
|
18 | 24 | ----
|
19 | 25 |
|
@@ -291,7 +297,7 @@ When built with Metal support, you can enable GPU inference with the `--gpu-laye
|
291 | 297 | Any value larger than 0 will offload the computation to the GPU. For example:
|
292 | 298 |
|
293 | 299 | ```bash
|
294 |
| -./main -m ./models/7B/ggml-model-q4_0.bin -n 128 -ngl 1 |
| 300 | +./main -m ./models/7B/ggml-model-q4_0.gguf -n 128 -ngl 1 |
295 | 301 | ```
|
296 | 302 |
|
297 | 303 | ### MPI Build
|
@@ -330,7 +336,7 @@ The above will distribute the computation across 2 processes on the first host a
|
330 | 336 | Finally, you're ready to run a computation using `mpirun`:
|
331 | 337 |
|
332 | 338 | ```bash
|
333 |
| -mpirun -hostfile hostfile -n 3 ./main -m ./models/7B/ggml-model-q4_0.bin -n 128 |
| 339 | +mpirun -hostfile hostfile -n 3 ./main -m ./models/7B/ggml-model-q4_0.gguf -n 128 |
334 | 340 | ```
|
335 | 341 |
|
336 | 342 | ### BLAS Build
|
@@ -513,10 +519,10 @@ python3 convert.py models/7B/
|
513 | 519 | python convert.py models/7B/ --vocabtype bpe
|
514 | 520 |
|
515 | 521 | # quantize the model to 4-bits (using q4_0 method)
|
516 |
| -./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0 |
| 522 | +./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0 |
517 | 523 |
|
518 | 524 | # run the inference
|
519 |
| -./main -m ./models/7B/ggml-model-q4_0.bin -n 128 |
| 525 | +./main -m ./models/7B/ggml-model-q4_0.gguf -n 128 |
520 | 526 | ```
|
521 | 527 |
|
522 | 528 | When running the larger models, make sure you have enough disk space to store all the intermediate files.
|
@@ -572,7 +578,7 @@ Here is an example of a few-shot interaction, invoked with the command
|
572 | 578 | ./examples/chat-13B.sh
|
573 | 579 |
|
574 | 580 | # custom arguments using a 13B model
|
575 |
| -./main -m ./models/13B/ggml-model-q4_0.bin -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt |
| 581 | +./main -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt |
576 | 582 | ```
|
577 | 583 |
|
578 | 584 | Note the use of `--color` to distinguish between user input and generated text. Other parameters are explained in more detail in the [README](examples/main/README.md) for the `main` example program.
|
@@ -635,6 +641,8 @@ OpenLLaMA is an openly licensed reproduction of Meta's original LLaMA model. It
|
635 | 641 |
|
636 | 642 | ### Using [GPT4All](https://github.com/nomic-ai/gpt4all)
|
637 | 643 |
|
| 644 | +*Note: these instructions are likely obsoleted by the GGUF update* |
| 645 | +
|
638 | 646 | - Obtain the `tokenizer.model` file from LLaMA model and put it to `models`
|
639 | 647 | - Obtain the `added_tokens.json` file from Alpaca model and put it to `models`
|
640 | 648 | - Obtain the `gpt4all-lora-quantized.bin` file from GPT4All model and put it to `models/gpt4all-7B`
|
@@ -710,7 +718,7 @@ If your issue is with model generation quality, then please at least scan the fo
|
710 | 718 | #### How to run
|
711 | 719 |
|
712 | 720 | 1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
|
713 |
| -2. Run `./perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw` |
| 721 | +2. Run `./perplexity -m models/7B/ggml-model-q4_0.gguf -f wiki.test.raw` |
714 | 722 | 3. Output:
|
715 | 723 | ```
|
716 | 724 | perplexity : calculating perplexity over 655 chunks
|
@@ -809,13 +817,13 @@ docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --all-in-
|
809 | 817 | On completion, you are ready to play!
|
810 | 818 |
|
811 | 819 | ```bash
|
812 |
| -docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 |
| 820 | +docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:full --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 |
813 | 821 | ```
|
814 | 822 |
|
815 | 823 | or with a light image:
|
816 | 824 |
|
817 | 825 | ```bash
|
818 |
| -docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 |
| 826 | +docker run -v /path/to/models:/models ghcr.io/ggerganov/llama.cpp:light -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 |
819 | 827 | ```
|
820 | 828 |
|
821 | 829 | ### Docker With CUDA
|
@@ -846,8 +854,8 @@ The resulting images, are essentially the same as the non-CUDA images:
|
846 | 854 | After building locally, Usage is similar to the non-CUDA examples, but you'll need to add the `--gpus` flag. You will also want to use the `--n-gpu-layers` flag.
|
847 | 855 |
|
848 | 856 | ```bash
|
849 |
| -docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1 |
850 |
| -docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1 |
| 857 | +docker run --gpus all -v /path/to/models:/models local/llama.cpp:full-cuda --run -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1 |
| 858 | +docker run --gpus all -v /path/to/models:/models local/llama.cpp:light-cuda -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1 |
851 | 859 | ```
|
852 | 860 |
|
853 | 861 | ### Contributing
|
|
0 commit comments