-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Only use Q6_K for output weights if tensor size is multiple of 256 #1932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I don't think I'm really qualified to review your pulls!
The only thing I'd say is doing
if ((nx * ny) % QK_K == 0) {
in both places might be clearer for people with bad math knowledge (like me) that don't necessarily know if one of them is divisible then the number of elements in the tensor necessarily is.There's currently no case for LLaMA models where the tensor sizes wouldn't be divisible. Correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this! It must be
nx % QK_K == 0 && ny % QK_K == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I'm really confused. Was
&&
actually necessary here?I assumed your code was correct initially, and did a little verification in the Python REPL just to prove it to myself:
Obviously that's not very scientific, but it seems like if one of the dimensions is divisible than the number of elements will be (at least for 2D tensors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is not about the total number of elements being a multiple of 256, but the number of columns being a multiple of 256. As a tensor can get used in left and right multiplications of a vector, it is best to check that rows and columns are a multiple of 256 (the current k-quants super=block size).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure I understand correctly:
If the number of elements %
QK_K
is 0 then quantizing/dequantizing is safe (but performing actual operations like matmul may not be).If that's true, are asserts like:
https://github.com/ggerganov/llama.cpp/blob/16b9cd193965769089881bb8ec012fccca7b37b6/k_quants.c#L1504-L1505
actually enough to ensure Bad Stuff doesn't occur? It seems like that's only checking the number of elements and not that the rows/columns conform. This might be out of scope for the current pull but there should be something ensuring that the parameters are valid for operations like that (and maybe I'm misunderstanding and that's already the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function call is always triggered with 1 row in a tensor multiplying a vector of embeddings, so
n
is the number of columns. Because the implementation does not handle the case wheren
(the number of columns in the quantized tensor) is not a multiple ofQK_K
, there is the assert.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question I'm asking in a roundabout way is if some random code that uses GGML quantizes a tensor as Q6_K, Q4_K, whatever which doesn't respect those invariants and then uses it in a matmul operation, what happens? Do we hit an assert and refuse to continue? It seems like the answer there would be "no" since Falcon 7B models appeared to work but didn't produce correct output.
Basically adding an assert to
llama.cpp
in that function does fix the problem forllama.cpp
but it doesn't prevent other consumers of GGML + k-quants from shooting themselves in the foot unless there are asserts guarding the lower level functions.