Skip to content

Only use Q6_K for output weights if tensor size is multiple of 256 #1932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 19, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2495,7 +2495,7 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
if (quantized_type == GGML_TYPE_Q2_K || quantized_type == GGML_TYPE_Q3_K || quantized_type == GGML_TYPE_Q4_K ||
quantized_type == GGML_TYPE_Q5_K || quantized_type == GGML_TYPE_Q6_K) {
int nx = tensor.ne.at(0);
int ny = tensor.ne.at(0);
int ny = tensor.ne.at(1);
if (nx % QK_K != 0 || ny % QK_K != 0) {
fprintf(stderr, "\n\n========================= Tensor sizes %d x %d are not divisible by %d\n",nx,ny,QK_K);
fprintf(stderr, "This is required to be able to use k-quants for now!\n");
Expand All @@ -2504,7 +2504,11 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
}
}
if (tensor.name == "output.weight") {
new_type = GGML_TYPE_Q6_K;
int nx = tensor.ne.at(0);
int ny = tensor.ne.at(1);
if (nx % QK_K == 0 || ny % QK_K == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I don't think I'm really qualified to review your pulls!

The only thing I'd say is doing if ((nx * ny) % QK_K == 0) { in both places might be clearer for people with bad math knowledge (like me) that don't necessarily know if one of them is divisible then the number of elements in the tensor necessarily is.

There's currently no case for LLaMA models where the tensor sizes wouldn't be divisible. Correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this! It must be nx % QK_K == 0 && ny % QK_K == 0

Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I'm really confused. Was && actually necessary here?

I assumed your code was correct initially, and did a little verification in the Python REPL just to prove it to myself:

>>> x = 256
>>> y = 256
>>> x * y % 256
0
>>> x = 13
>>> x * y % 256
0
>>> x = 256
>>> y = 13
>>> x * y % 256
0

Obviously that's not very scientific, but it seems like if one of the dimensions is divisible than the number of elements will be (at least for 2D tensors).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is not about the total number of elements being a multiple of 256, but the number of columns being a multiple of 256. As a tensor can get used in left and right multiplications of a vector, it is best to check that rows and columns are a multiple of 256 (the current k-quants super=block size).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I understand correctly:

If the number of elements % QK_K is 0 then quantizing/dequantizing is safe (but performing actual operations like matmul may not be).

If that's true, are asserts like:

https://github.com/ggerganov/llama.cpp/blob/16b9cd193965769089881bb8ec012fccca7b37b6/k_quants.c#L1504-L1505

actually enough to ensure Bad Stuff doesn't occur? It seems like that's only checking the number of elements and not that the rows/columns conform. This might be out of scope for the current pull but there should be something ensuring that the parameters are valid for operations like that (and maybe I'm misunderstanding and that's already the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function call is always triggered with 1 row in a tensor multiplying a vector of embeddings, so n is the number of columns. Because the implementation does not handle the case where n (the number of columns in the quantized tensor) is not a multiple of QK_K, there is the assert.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the question I'm asking in a roundabout way is if some random code that uses GGML quantizes a tensor as Q6_K, Q4_K, whatever which doesn't respect those invariants and then uses it in a matmul operation, what happens? Do we hit an assert and refuse to continue? It seems like the answer there would be "no" since Falcon 7B models appeared to work but didn't produce correct output.

Basically adding an assert to llama.cpp in that function does fix the problem for llama.cpp but it doesn't prevent other consumers of GGML + k-quants from shooting themselves in the foot unless there are asserts guarding the lower level functions.

new_type = GGML_TYPE_Q6_K;
}
} else if (tensor.name.find("attention.wv.weight") != std::string::npos) {
if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q2_K) new_type = GGML_TYPE_Q4_K;
else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) new_type = GGML_TYPE_Q5_K;
Expand Down