|
7 | 7 |
|
8 | 8 | Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
|
9 | 9 |
|
| 10 | +## ⚠️ TEMPORARY NOTICE ABOUT UPCOMING BREAKING CHANGE ⚠️ |
| 11 | + |
| 12 | +**The quantization formats will soon be updated: https://github.com/ggerganov/llama.cpp/pull/1305** |
| 13 | + |
| 14 | +**All `ggml` model files using the old format will not work with the latest `llama.cpp` code after that change is merged** |
| 15 | + |
| 16 | +--- |
| 17 | + |
10 | 18 | **Hot topics:**
|
11 | 19 |
|
12 | 20 | - [Roadmap May 2023](https://github.com/ggerganov/llama.cpp/discussions/1220)
|
13 | 21 | - [New quantization methods](https://github.com/ggerganov/llama.cpp#quantization)
|
14 | 22 |
|
| 23 | +<details> |
| 24 | + <summary>Table of Contents</summary> |
| 25 | + <ol> |
| 26 | + <li> |
| 27 | + <a href="#description">Description</a> |
| 28 | + </li> |
| 29 | + <li> |
| 30 | + <a href="#usage">Usage</a> |
| 31 | + <ul> |
| 32 | + <li><a href="#get-the-code">Get the Code</a></li> |
| 33 | + <li><a href="#build">Build</a></li> |
| 34 | + <li><a href="#blas-build">BLAS Build</a></li> |
| 35 | + <li><a href="#prepare-data--run">Prepare Data & Run</a></li> |
| 36 | + <li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li> |
| 37 | + <li><a href="#quantization">Quantization</a></li> |
| 38 | + <li><a href="#interactive-mode">Interactive mode</a></li> |
| 39 | + <li><a href="#instruction-mode-with-alpaca">Instruction mode with Alpaca</a></li> |
| 40 | + <li><a href="#using-gpt4all">Using GPT4All</a></li> |
| 41 | + <li><a href="#using-pygmalion-7b--metharme-7b">Using Pygmalion 7B & Metharme 7B</a></li> |
| 42 | + <li><a href="#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data">Obtaining the Facebook LLaMA original model and Stanford Alpaca model data</a></li> |
| 43 | + <li><a href="#verifying-the-model-files">Verifying the model files</a></li> |
| 44 | + <li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li> |
| 45 | + <li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li> |
| 46 | + <li><a href="#android">Android</a></li> |
| 47 | + <li><a href="#docker">Docker</a></li> |
| 48 | + </ul> |
| 49 | + </li> |
| 50 | + <li><a href="#contributing">Contributing</a></li> |
| 51 | + <li><a href="#coding-guidelines">Coding guidelines</a></li> |
| 52 | + <li><a href="#docs">Docs</a></li> |
| 53 | + </ol> |
| 54 | +</details> |
| 55 | + |
15 | 56 | ## Description
|
16 | 57 |
|
17 | 58 | The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook
|
@@ -46,6 +87,7 @@ as the main playground for developing new features for the [ggml](https://github
|
46 | 87 | - [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
|
47 | 88 | - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
|
48 | 89 | - [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
|
| 90 | +- [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b) |
49 | 91 |
|
50 | 92 | **Bindings:**
|
51 | 93 |
|
@@ -257,6 +299,8 @@ Building the program with BLAS support may lead to some performance improvements
|
257 | 299 | cmake --build . --config Release
|
258 | 300 | ```
|
259 | 301 |
|
| 302 | +Note: Because llama.cpp uses multiple CUDA streams for matrix multiplication results [are not guaranteed to be reproducible](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility). If you need reproducibility, set `GGML_CUDA_MAX_STREAMS` in the file `ggml-cuda.cu` to 1. |
| 303 | +
|
260 | 304 | ### Prepare Data & Run
|
261 | 305 |
|
262 | 306 | ```bash
|
@@ -296,17 +340,25 @@ Several quantization methods are supported. They differ in the resulting model d
|
296 | 340 |
|
297 | 341 | | Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
|
298 | 342 | |------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|-------:|
|
299 |
| -| 7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 | |
| 343 | +| 7B | perplexity | 5.9066 | 6.1620 | 6.0910 | 6.1466 | 5.9862 | 5.9481 | 5.9069 | |
300 | 344 | | 7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
|
301 | 345 | | 7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
|
302 | 346 | | 7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
|
303 | 347 | | 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
|
304 |
| -| 13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 | |
| 348 | +| 13B | perplexity | 5.2543 | 5.3863 | 5.3607 | 5.3513 | 5.2856 | 5.2706 | 5.2548 | |
305 | 349 | | 13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
|
306 | 350 | | 13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
|
307 | 351 | | 13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
|
308 | 352 | | 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
|
309 | 353 |
|
| 354 | +### Perplexity (measuring model quality) |
| 355 | +
|
| 356 | +You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better). |
| 357 | +For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity). |
| 358 | +
|
| 359 | +The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512. |
| 360 | +The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads. |
| 361 | +
|
310 | 362 | ### Interactive mode
|
311 | 363 |
|
312 | 364 | If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
|
@@ -373,6 +425,19 @@ python3 convert.py models/gpt4all-7B/gpt4all-lora-quantized.bin
|
373 | 425 |
|
374 | 426 | - The newer GPT4All-J model is not yet supported!
|
375 | 427 |
|
| 428 | +### Using Pygmalion 7B & Metharme 7B |
| 429 | + |
| 430 | +- Obtain the [LLaMA weights](#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data) |
| 431 | +- Obtain the [Pygmalion 7B](https://huggingface.co/PygmalionAI/pygmalion-7b/) or [Metharme 7B](https://huggingface.co/PygmalionAI/metharme-7b) XOR encoded weights |
| 432 | +- Convert the LLaMA model with [the latest HF convert script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py) |
| 433 | +- Merge the XOR files with the converted LLaMA weights by running the [xor_codec](https://huggingface.co/PygmalionAI/pygmalion-7b/blob/main/xor_codec.py) script |
| 434 | +- Convert to `ggml` format using the `convert.py` script in this repo: |
| 435 | +```bash |
| 436 | +python3 convert.py pygmalion-7b/ --outtype q4_1 |
| 437 | +``` |
| 438 | +> The Pygmalion 7B & Metharme 7B weights are saved in [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) precision. If you wish to convert to `ggml` without quantizating, please specify the `--outtype` as `f32` instead of `f16`. |
| 439 | +
|
| 440 | + |
376 | 441 | ### Obtaining the Facebook LLaMA original model and Stanford Alpaca model data
|
377 | 442 |
|
378 | 443 | - **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.**
|
@@ -405,26 +470,6 @@ If your issue is with model generation quality, then please at least scan the fo
|
405 | 470 | - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
|
406 | 471 | - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
|
407 | 472 |
|
408 |
| -### Perplexity (measuring model quality) |
409 |
| - |
410 |
| -You can use the `perplexity` example to measure perplexity over the given prompt. For more background, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity). However, in general, lower perplexity is better for LLMs. |
411 |
| - |
412 |
| -#### Latest measurements |
413 |
| - |
414 |
| -The latest perplexity scores for the various model sizes and quantizations are being tracked in [discussion #406](https://github.com/ggerganov/llama.cpp/discussions/406). `llama.cpp` is measuring very well compared to the baseline implementations. Quantization has a small negative impact on quality, but, as you can see, running |
415 |
| -13B at q4_0 beats the 7B f16 model by a significant amount. |
416 |
| - |
417 |
| -All measurements are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context). |
418 |
| -Note that changing the context length will have a significant impact on perplexity (longer context = better perplexity). |
419 |
| -``` |
420 |
| -Perplexity - model options |
421 |
| -5.5985 - 13B, q4_0 |
422 |
| -5.9565 - 7B, f16 |
423 |
| -6.3001 - 7B, q4_1 |
424 |
| -6.5949 - 7B, q4_0 |
425 |
| -6.5995 - 7B, q4_0, --memory_f16 |
426 |
| -``` |
427 |
| - |
428 | 473 | #### How to run
|
429 | 474 |
|
430 | 475 | 1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
|
|
0 commit comments