Skip to content

Commit 127f68e

Browse files
committed
Merge 'origin/master' into hipblas
2 parents 070cbcc + b608b55 commit 127f68e

File tree

14 files changed

+585
-160
lines changed

14 files changed

+585
-160
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -43,5 +43,6 @@ zig-out/
4343
zig-cache/
4444

4545
ppl-*.txt
46+
qnt-*.txt
4647

4748
examples/jeopardy/results.txt

README.md

+67-22
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,52 @@
77

88
Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
99

10+
## ⚠️ TEMPORARY NOTICE ABOUT UPCOMING BREAKING CHANGE ⚠️
11+
12+
**The quantization formats will soon be updated: https://github.com/ggerganov/llama.cpp/pull/1305**
13+
14+
**All `ggml` model files using the old format will not work with the latest `llama.cpp` code after that change is merged**
15+
16+
---
17+
1018
**Hot topics:**
1119

1220
- [Roadmap May 2023](https://github.com/ggerganov/llama.cpp/discussions/1220)
1321
- [New quantization methods](https://github.com/ggerganov/llama.cpp#quantization)
1422

23+
<details>
24+
<summary>Table of Contents</summary>
25+
<ol>
26+
<li>
27+
<a href="#description">Description</a>
28+
</li>
29+
<li>
30+
<a href="#usage">Usage</a>
31+
<ul>
32+
<li><a href="#get-the-code">Get the Code</a></li>
33+
<li><a href="#build">Build</a></li>
34+
<li><a href="#blas-build">BLAS Build</a></li>
35+
<li><a href="#prepare-data--run">Prepare Data & Run</a></li>
36+
<li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li>
37+
<li><a href="#quantization">Quantization</a></li>
38+
<li><a href="#interactive-mode">Interactive mode</a></li>
39+
<li><a href="#instruction-mode-with-alpaca">Instruction mode with Alpaca</a></li>
40+
<li><a href="#using-gpt4all">Using GPT4All</a></li>
41+
<li><a href="#using-pygmalion-7b--metharme-7b">Using Pygmalion 7B & Metharme 7B</a></li>
42+
<li><a href="#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data">Obtaining the Facebook LLaMA original model and Stanford Alpaca model data</a></li>
43+
<li><a href="#verifying-the-model-files">Verifying the model files</a></li>
44+
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
45+
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
46+
<li><a href="#android">Android</a></li>
47+
<li><a href="#docker">Docker</a></li>
48+
</ul>
49+
</li>
50+
<li><a href="#contributing">Contributing</a></li>
51+
<li><a href="#coding-guidelines">Coding guidelines</a></li>
52+
<li><a href="#docs">Docs</a></li>
53+
</ol>
54+
</details>
55+
1556
## Description
1657

1758
The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook
@@ -46,6 +87,7 @@ as the main playground for developing new features for the [ggml](https://github
4687
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
4788
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
4889
- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
90+
- [X] [Pygmalion 7B / Metharme 7B](#using-pygmalion-7b--metharme-7b)
4991

5092
**Bindings:**
5193

@@ -257,6 +299,8 @@ Building the program with BLAS support may lead to some performance improvements
257299
cmake --build . --config Release
258300
```
259301
302+
Note: Because llama.cpp uses multiple CUDA streams for matrix multiplication results [are not guaranteed to be reproducible](https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility). If you need reproducibility, set `GGML_CUDA_MAX_STREAMS` in the file `ggml-cuda.cu` to 1.
303+
260304
### Prepare Data & Run
261305
262306
```bash
@@ -296,17 +340,25 @@ Several quantization methods are supported. They differ in the resulting model d
296340
297341
| Model | Measure | F16 | Q4_0 | Q4_1 | Q4_2 | Q5_0 | Q5_1 | Q8_0 |
298342
|------:|--------------|-------:|-------:|-------:|-------:|-------:|-------:|-------:|
299-
| 7B | perplexity | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0139 | 5.9934 | 5.9571 |
343+
| 7B | perplexity | 5.9066 | 6.1620 | 6.0910 | 6.1466 | 5.9862 | 5.9481 | 5.9069 |
300344
| 7B | file size | 13.0G | 4.0G | 4.8G | 4.0G | 4.4G | 4.8G | 7.1G |
301345
| 7B | ms/tok @ 4th | 128 | 56 | 61 | 84 | 91 | 95 | 75 |
302346
| 7B | ms/tok @ 8th | 128 | 47 | 55 | 48 | 53 | 59 | 75 |
303347
| 7B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
304-
| 13B | perplexity | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.2768 | 5.2582 | 5.2458 |
348+
| 13B | perplexity | 5.2543 | 5.3863 | 5.3607 | 5.3513 | 5.2856 | 5.2706 | 5.2548 |
305349
| 13B | file size | 25.0G | 7.6G | 9.1G | 7.6G | 8.4G | 9.1G | 14G |
306350
| 13B | ms/tok @ 4th | 239 | 104 | 113 | 160 | 176 | 185 | 141 |
307351
| 13B | ms/tok @ 8th | 240 | 85 | 99 | 97 | 108 | 117 | 147 |
308352
| 13B | bits/weight | 16.0 | 5.0 | 6.0 | 5.0 | 5.5 | 6.0 | 9.0 |
309353
354+
### Perplexity (measuring model quality)
355+
356+
You can use the `perplexity` example to measure perplexity over a given prompt (lower perplexity is better).
357+
For more information, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity).
358+
359+
The perplexity measurements in table above are done against the `wikitext2` test dataset (https://paperswithcode.com/dataset/wikitext-2), with context length of 512.
360+
The time per token is measured on a MacBook M1 Pro 32GB RAM using 4 and 8 threads.
361+
310362
### Interactive mode
311363
312364
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
@@ -373,6 +425,19 @@ python3 convert.py models/gpt4all-7B/gpt4all-lora-quantized.bin
373425

374426
- The newer GPT4All-J model is not yet supported!
375427

428+
### Using Pygmalion 7B & Metharme 7B
429+
430+
- Obtain the [LLaMA weights](#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data)
431+
- Obtain the [Pygmalion 7B](https://huggingface.co/PygmalionAI/pygmalion-7b/) or [Metharme 7B](https://huggingface.co/PygmalionAI/metharme-7b) XOR encoded weights
432+
- Convert the LLaMA model with [the latest HF convert script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)
433+
- Merge the XOR files with the converted LLaMA weights by running the [xor_codec](https://huggingface.co/PygmalionAI/pygmalion-7b/blob/main/xor_codec.py) script
434+
- Convert to `ggml` format using the `convert.py` script in this repo:
435+
```bash
436+
python3 convert.py pygmalion-7b/ --outtype q4_1
437+
```
438+
> The Pygmalion 7B & Metharme 7B weights are saved in [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) precision. If you wish to convert to `ggml` without quantizating, please specify the `--outtype` as `f32` instead of `f16`.
439+
440+
376441
### Obtaining the Facebook LLaMA original model and Stanford Alpaca model data
377442

378443
- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.**
@@ -405,26 +470,6 @@ If your issue is with model generation quality, then please at least scan the fo
405470
- [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
406471
- [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
407472

408-
### Perplexity (measuring model quality)
409-
410-
You can use the `perplexity` example to measure perplexity over the given prompt. For more background, see [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity). However, in general, lower perplexity is better for LLMs.
411-
412-
#### Latest measurements
413-
414-
The latest perplexity scores for the various model sizes and quantizations are being tracked in [discussion #406](https://github.com/ggerganov/llama.cpp/discussions/406). `llama.cpp` is measuring very well compared to the baseline implementations. Quantization has a small negative impact on quality, but, as you can see, running
415-
13B at q4_0 beats the 7B f16 model by a significant amount.
416-
417-
All measurements are done against the wikitext2 test dataset (https://paperswithcode.com/dataset/wikitext-2), with default options (512 length context).
418-
Note that changing the context length will have a significant impact on perplexity (longer context = better perplexity).
419-
```
420-
Perplexity - model options
421-
5.5985 - 13B, q4_0
422-
5.9565 - 7B, f16
423-
6.3001 - 7B, q4_1
424-
6.5949 - 7B, q4_0
425-
6.5995 - 7B, q4_0, --memory_f16
426-
```
427-
428473
#### How to run
429474

430475
1. Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research

convert.py

+7-3
Original file line numberDiff line numberDiff line change
@@ -766,7 +766,7 @@ def load() -> UnquantizedTensor:
766766
return UnquantizedTensor(np.frombuffer(buf, dtype=numpy_dtype).reshape(shape))
767767
description = f'safetensors begin={begin} end={end} type={data_type} path={path}'
768768
return LazyTensor(load, shape, data_type, description)
769-
model = {name: convert(info) for (name, info) in header.items()}
769+
model = {name: convert(info) for (name, info) in header.items() if name != '__metadata__'}
770770
return ModelPlus(model=model, paths=[path], format='safetensors', vocab=None)
771771

772772

@@ -1051,8 +1051,12 @@ def load_some_model(path: Path) -> ModelPlus:
10511051
'''Load a model of any supported format.'''
10521052
# Be extra-friendly and accept either a file or a directory:
10531053
if path.is_dir():
1054-
globs = ["consolidated.00.pth", "pytorch_model-00001-of-*.bin", "*.pt"]
1055-
files = [file for glob in globs for file in path.glob(glob)]
1054+
# Check if it's a set of safetensors files first
1055+
files = list(path.glob("model-00001-of-*.safetensors"))
1056+
if not files:
1057+
# Try the PyTorch patterns too, with lower priority
1058+
globs = ["consolidated.00.pth", "pytorch_model-00001-of-*.bin", "*.pt"]
1059+
files = [file for glob in globs for file in path.glob(glob)]
10561060
if not files:
10571061
# Try GGML too, but with lower priority, since if both a non-GGML
10581062
# model and a GGML model exist in the same directory, we assume the

0 commit comments

Comments
 (0)