-
Notifications
You must be signed in to change notification settings - Fork 11.6k
clip : refactor clip_init, add tests #12757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful!
Does the Qwen2-VL test fail for you too? It segfaults on my mac:
...
0.01.596.379 I llama_context: CPU output buffer size = 0.58 MiB
0.01.596.383 I init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
0.01.607.493 I init: Metal KV buffer size = 112.00 MiB
0.01.607.497 I llama_context: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
0.01.620.985 I llama_context: Metal compute buffer size = 299.75 MiB
0.01.620.986 I llama_context: CPU compute buffer size = 11.51 MiB
0.01.620.986 I llama_context: graph nodes = 1042
0.01.620.987 I llama_context: graph splits = 114
Segmentation fault: 11
examples/llava/clip.h
Outdated
enum clip_log_level { | ||
CLIP_LOG_NONE = 0, | ||
CLIP_LOG_ERROR = 1, | ||
CLIP_LOG_WARNING = 2, | ||
CLIP_LOG_INFO = 3, | ||
CLIP_LOG_DEBUG = 4, | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum clip_log_level { | |
CLIP_LOG_NONE = 0, | |
CLIP_LOG_ERROR = 1, | |
CLIP_LOG_WARNING = 2, | |
CLIP_LOG_INFO = 3, | |
CLIP_LOG_DEBUG = 4, | |
}; | |
enum clip_log_level { | |
CLIP_LOG_LEVEL_NONE = 0, | |
CLIP_LOG_LEVEL_ERROR = 1, | |
CLIP_LOG_LEVEL_WARNING = 2, | |
CLIP_LOG_LEVEL_INFO = 3, | |
CLIP_LOG_LEVEL_DEBUG = 4, | |
}; | |
Also align the values with the existing ggml_log_level
enum or even use it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also refactored the logging logic in 88aec68
(Most of the code copied from common/log.h
)
examples/llava/tests.sh
Outdated
add_test "llama-gemma3-cli" "ggml-org/gemma-3-4b-it-GGUF" | ||
add_test "llama-llava-cli" "guinmoon/MobileVLM-3B-GGUF" | ||
add_test "llama-llava-cli" "THUDM/glm-edge-v-5b-gguf" | ||
add_test "llama-llava-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K" | ||
add_test "llama-llava-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K" | ||
add_test "llama-llava-cli" "ibm-research/granite-vision-3.2-2b-GGUF" | ||
add_test "llama-minicpmv-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted | ||
add_test "llama-minicpmv-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K" | ||
add_test "llama-qwen2vl-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we have to source all of these models from ggml-org
, for 2 main reasons:
- Stability (i.e. we know they won't disappear)
- Safety (i.e. cannot be replaced with malicious versions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I completely agree with this.
Also FYI, I ran this test script on an A10G space on HF and they all passed. My space was a ipynb, but I think it would be nice if we can have a gradio space which we can simply enter the PR number or commit SHA to be tested.
No it doesn't. The command that I used is: If it still fails, could you try gdb or lldb to see the stack trace? |
Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
The problem is for some reason the Metal crashes in the
I think the Metal is not happy to have 2 different buffers point to the same data? It crashes M1 Pro, M2 Ultra and M4 Max. Which chip do you have? In any case, this patch fixes it: diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 1399a29b6..dd9afc6b0 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -465,7 +465,7 @@ static ggml_cgraph * clip_image_build_graph_siglip(clip_ctx * ctx, const clip_im
V = ggml_cont(ctx0, ggml_permute(ctx0, V, 1, 2, 0, 3));
struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);
- KQ = ggml_scale_inplace(ctx0, KQ, 1.0f / sqrtf((float)d_head));
+ KQ = ggml_scale(ctx0, KQ, 1.0f / sqrtf((float)d_head));
KQ = ggml_soft_max_inplace(ctx0, KQ);
struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ);
@@ -721,7 +721,7 @@ static ggml_cgraph * clip_image_build_graph_legacy(clip_ctx * ctx, const clip_im
ctx0, Q, positions, nullptr,
d_head/2, mrope_sections, GGML_ROPE_TYPE_VISION, 32768, 10000, 1, 0, 1, 32, 1);
}
- Q = ggml_scale_inplace(ctx0, Q, 1.0f / sqrt((float)d_head));
+ Q = ggml_scale(ctx0, Q, 1.0f / sqrt((float)d_head));
Q = ggml_cont(ctx0, ggml_permute(ctx0, Q, 0, 2, 1, 3));
Q = ggml_reshape_3d(ctx0, Q, d_head, num_positions, n_head * batch_size);
@@ -1033,7 +1033,7 @@ static ggml_cgraph * clip_image_build_graph_legacy(clip_ctx * ctx, const clip_im
}
struct ggml_tensor * Q = ggml_add(ctx0, ggml_mul_mat(ctx0, model.mm_model_attn_q_w, q), model.mm_model_attn_q_b);
- Q = ggml_scale_inplace(ctx0, Q, 1.0f / sqrt((float)d_head));
+ Q = ggml_scale(ctx0, Q, 1.0f / sqrt((float)d_head));
struct ggml_tensor * K = ggml_add(ctx0, ggml_mul_mat(ctx0, model.mm_model_attn_k_w, k), model.mm_model_attn_k_b);
struct ggml_tensor * V = ggml_add(ctx0, ggml_mul_mat(ctx0, model.mm_model_attn_v_w, v), model.mm_model_attn_v_b);
// permute |
I'm using M3 Max (a bit funny, but how do you have 1, 2, 4 but skip 3 😂 ) Can you also give a try with KQ = ggml_soft_max_ext(ctx0, KQ, nullptr, 1.0f / sqrtf((float)d_head), 0.0f); |
I did some more debugging - it's not related to Metal, there is actually a legitimate bug somewhere. The reason is that the multi-rope is not supported by the Metal backend so it is offloaded to the CPU. When the next op is llama.cpp/examples/llava/clip.cpp Lines 712 to 727 in 376f80a
Here is the problematic part of the generated graph. The node that crashes is # 42:
Running with a debugger, the problems is that the buffer of the Simply changing the operation in the code above from @slaren Do you have any guess what could be the root cause for this? |
|
Got it. @ngxson I think this is good to merge.
M3 seemed like a too minor upgrade. Not that M4 was really that significant either, but my M1 laptop was getting old and I needed a new one. |
Thanks for reviewing and testing this. I'll merge once CI is green. In the last commit, I also fixed issue with Yi-VL model. Although the model passes "the NY times" image test, it doesn't seem to be able to describe more complex scene. I think the model is quite old anyway and judging by the number of downloads, I'm doubt if someone actually using it: https://huggingface.co/cmp-nct/Yi-VL-6B-GGUF (Also leaving a link to the original PR here, for reference: #5093) Anyway, it's truly a surprise to see how many models are supported by this clip/llava infrastructure. We're currently having 11 different model archs in the |
add_test "llama-gemma3-cli" "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M" | ||
add_test "llama-llava-cli" "cmp-nct/Yi-VL-6B-GGUF:Q5_K" | ||
add_test "llama-llava-cli" "guinmoon/MobileVLM-3B-GGUF:Q4_K_M" | ||
add_test "llama-llava-cli" "THUDM/glm-edge-v-5b-gguf:Q4_K_M" | ||
add_test "llama-llava-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K" | ||
add_test "llama-llava-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K" | ||
add_test "llama-llava-cli" "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M" | ||
add_test "llama-minicpmv-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted | ||
add_test "llama-minicpmv-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K" | ||
add_test "llama-minicpmv-cli" "openbmb/MiniCPM-o-2_6-gguf:Q4_0" | ||
add_test "llama-qwen2vl-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw @bartowski1182 do you have any other models to add to the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, can't think of any other vision models off the top of my head, but i can take a closer look
@ngxson I think this PR might have broken clip quantization, https://github.com/ggml-org/llama.cpp/blob/master/examples/llava/clip-quantize-cli.cpp no longer works after this (determined by bisecting). |
* refactor clip_init * fix loading file * fix style * test ok * better test with report * add missing headers * clarify * add KEY_MM_PATCH_MERGE_TYPE * remove bool has_* pattern * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * Update examples/llava/clip.cpp Co-authored-by: Georgi Gerganov <[email protected]> * use ggml_soft_max_ext * refactor logging system * add minicpm-v-o 2.6 for testing * use nullptr everywhere * fix Yi-VL model --------- Co-authored-by: Georgi Gerganov <[email protected]>
Well I found out why clip-quantize-cli was broken, since in #12869 A very ugly hack to keep Don't know if such a band aid fix would be accepted here, but I'd be happy to PR it if desired. |
Tbh I don't really like the code of And also, the quantization code can be completely outside of |
Cont #12322
In this PR:
clip_model_loader
llava/tests.sh
script which allow testing multiple models in one goSmaller changes:
patch_merge_type
, so that we no longer need to dostrcmp(const char)
bool has_(tensor name)
patternTests can be run via
./examples/llava/tests.sh
script, you may need ~20GB to download model weightsResult: