Skip to content

Shape Error When Running Inference after Converting OpenLlama 3B to GGML #1709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Emm9625 opened this issue Jun 6, 2023 · 8 comments · Fixed by #1958
Closed

Shape Error When Running Inference after Converting OpenLlama 3B to GGML #1709

Emm9625 opened this issue Jun 6, 2023 · 8 comments · Fixed by #1958

Comments

@Emm9625
Copy link

Emm9625 commented Jun 6, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ x] I carefully followed the README.md.
  • [ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Model loads successfully and inference can be run.

Current Behavior

Model fails to load with shape error.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 5315Y CPU @ 3.20GHz
Stepping:                        6
CPU MHz:                         3200.011
BogoMIPS:                        6400.05
Hypervisor vendor:               Xen
Virtualization type:             full
L1d cache:                       384 KiB
L1i cache:                       256 KiB
L2 cache:                        10 MiB
L3 cache:                        96 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Mitigation; PTE Inversion
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdt
                                 scp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand h
                                 ypervisor lahf_lm abm 3dnowprefetch cpuid_fault pti ibpb fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_n
                                 i xsaveopt xsavec xgetbv1 xsaves umip rdpid
  • Operating System, e.g. for Linux:

$ uname -a

Linux nwlujxf2ho 5.4.0-122-generic #138~18.04.1-Ubuntu SMP Fri Jun 24 14:14:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • SDK version, e.g. for Linux:
Python 3.9.16

 GNU Make 4.2.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 


Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. Conversion is ran with this command "python convert.py ../open_llama_3b_600bt_preview/"
  2. Inference : ./main -m ../open_llama_3b_600bt_preview/ggml-model-f16.bin

Failure Logs

CONVERSION:

python convert.py ../open_llama_3b_600bt_preview/
Loading model file ../open_llama_3b_600bt_preview/pytorch_model.bin
Loading vocab file ../open_llama_3b_600bt_preview/tokenizer.model
Writing vocab...
INFERENCE:

root@nwlujxf2ho:/notebooks/llama.cpp# ./main -m ../open_llama_3b_600bt_preview/ggml-model-f16.bin
main: build = 1 (f4c55d3)
main: seed  = 1686014332
llama.cpp: loading model from ../open_llama_3b_600bt_preview/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 3200
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 25
llama_model_load_internal: n_layer    = 26
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 8704
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 3B
llama_model_load_internal: ggml ctx size =    0.06 MB
error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected  3200 x  8704, got  3200 x  8640
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../open_llama_3b_600bt_preview/ggml-model-f16.bin'
main: error: unable to load model
@ThioJoe
Copy link

ThioJoe commented Jun 7, 2023

Also seeing this error after trying to use the latest WizardLM-7B-uncensored.ggml.q8_0.bin

Actually I realize was loading the wrong model which was using the old format. Downloaded one for "ggjt v3" and the error went away. Though I'm getting a different error now, which is this one: #1732

@BrickBee
Copy link

error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 3200 x 8704, got 3200 x 8640

Same for me. It is also broken in the original commit (ffb06a3), tested with the 600bt version.

The error can be fixed by applying the hack in #1588. Quantized models will also work fine then. I don't see either the original hack or a suitable replacement being merged with the original PR, @SlyEcho.

Something else broke since it was added though, as quantized models will output garbage in the current version (92f20d9). The converted fp16 model still works fine (with the hack).

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 17, 2023

@BrickBee, which quantization format is broken for you? I can confirm that 3B Q4_0 and Q5_1 is working with the current master build = 701 (4f9c43e)

I have the files up on https://huggingface.co/SlyEcho/open_llama_3b_ggml and if you want to create them yourself, the Makefile and diff file can create all the models and checksums from scratch.

@BrickBee
Copy link

I can confirm that the quantized files that you've linked work fine with the release version that you have linked. My quantized versions that I've created at the time of the PR also still work correctly with the current version.
Yet when using the current version to convert (using the patch) and quantize the source model again, then the quantized version will output garbage. The resulting files also differ in file size.
Yours: open-llama-3b-q4_0.bin: 1.928.446.208
Mine: open_llama_3b_q4_0.ggml: 1.954.846.208

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 18, 2023

OK, I can trace it back to PR #1807, which for some reason starts to quantize a single tensor using Q6_K, regardless of the user's choice of format and making those models broken when K quants are not compiled (they are optional) or not supported.

This was actually reverted temporarily in #1711, but added back in.

What was the thinking behind this change, @ikawrakow?

@ikawrakow
Copy link
Contributor

What was the thinking behind this change, @ikawrakow?

Clearly, there wasn't enough thinking here ;-)

More seriously, the decision to bring it back was based on a discussion with @ggerganov that we should use the more accurate Q6_K quantization for the output weights once k-quants are implemented for all ggml-supported architectures (CPU, GPU via CUDA and OpenCL, and Metal for the Apple GPU). Using Q6_K for output.weight does improve generation quality at nearly negligible increase in model size. What we missed in the decision making process is that in the meantime there are models other than Meta LLaMA being used, which have tensor sizes that are not a multiple of the k-quants super-block size of 256. This is now taken care of with the last missing check in PR #1932, so llama.cpp can be built without the hassle of explicitly disabling k-quants at compile time.

On that note, I wonder how the OpenLLaMA 3B model is being used. I downloaded the fp16 model from Hugginface and used the convert.py script to convert to ggml format. But the model wouldn't load because the feed forward network size is being mispredicted as 8704 instead of the actual size of 8640 by this line in llama.cpp:

 const uint32_t n_ff = ((2*(4*hparams.n_embd)/3 + hparams.n_mult - 1)/hparams.n_mult)*hparams.n_mult;

If I fix this so I'm able to load the model and run a perplexity calculation, I get wild values in excess of 2000. What am I missing? Is it because the tokenization is different and, if so, how do you use the 3B model? I would like to use it to work on the k-quants adaptation to not 256-divisible model sizes, so any help is appreciated.

@BrickBee
Copy link

Conversion and fp16 inference works after applying this diff.
This was by the way the original point of this issue. The 3b model can't be used with the current code if no pre-converted version is available (or the code is patched).

@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 19, 2023

On that note, I wonder how the OpenLLaMA 3B model is being used. I downloaded the fp16 model from Hugginface and used the convert.py script to convert to ggml format.

convert.py is still broken and we didn't want to commit the crude hacks. But since the model has a free license the files are up for download.

Check my HF repo for the converted files and also the full Makefile to run it yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants