Skip to content

clip : refactor clip_init, add tests #12757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 5, 2025
Merged

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 4, 2025

Cont #12322

In this PR:

  • Add clip_model_loader
  • Add llava/tests.sh script which allow testing multiple models in one go

Smaller changes:

  • Add enum patch_merge_type, so that we no longer need to dostrcmp(const char)
  • Remove the bool has_(tensor name) pattern

Tests can be run via ./examples/llava/tests.sh script, you may need ~20GB to download model weights

Result:

OK:   llama-gemma3-cli ggml-org/gemma-3-4b-it-GGUF
OK:   llama-llava-cli guinmoon/MobileVLM-3B-GGUF
OK:   llama-llava-cli THUDM/glm-edge-v-5b-gguf
OK:   llama-llava-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
OK:   llama-llava-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K
OK:   llama-llava-cli ibm-research/granite-vision-3.2-2b-GGUF
OK:   llama-minicpmv-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
OK:   llama-minicpmv-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
OK:   llama-qwen2vl-cli bartowski/Qwen2-VL-2B-Instruct-GGUF

@ngxson ngxson marked this pull request as ready for review April 4, 2025 20:39
@ngxson ngxson requested a review from ggerganov April 4, 2025 20:39
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful!

Does the Qwen2-VL test fail for you too? It segfaults on my mac:

...
0.01.596.379 I llama_context:        CPU  output buffer size =     0.58 MiB
0.01.596.383 I init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
0.01.607.493 I init:      Metal KV buffer size =   112.00 MiB
0.01.607.497 I llama_context: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
0.01.620.985 I llama_context:      Metal compute buffer size =   299.75 MiB
0.01.620.986 I llama_context:        CPU compute buffer size =    11.51 MiB
0.01.620.986 I llama_context: graph nodes  = 1042
0.01.620.987 I llama_context: graph splits = 114
Segmentation fault: 11

Comment on lines 27 to 34
enum clip_log_level {
CLIP_LOG_NONE = 0,
CLIP_LOG_ERROR = 1,
CLIP_LOG_WARNING = 2,
CLIP_LOG_INFO = 3,
CLIP_LOG_DEBUG = 4,
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum clip_log_level {
CLIP_LOG_NONE = 0,
CLIP_LOG_ERROR = 1,
CLIP_LOG_WARNING = 2,
CLIP_LOG_INFO = 3,
CLIP_LOG_DEBUG = 4,
};
enum clip_log_level {
CLIP_LOG_LEVEL_NONE = 0,
CLIP_LOG_LEVEL_ERROR = 1,
CLIP_LOG_LEVEL_WARNING = 2,
CLIP_LOG_LEVEL_INFO = 3,
CLIP_LOG_LEVEL_DEBUG = 4,
};

Also align the values with the existing ggml_log_level enum or even use it directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also refactored the logging logic in 88aec68

(Most of the code copied from common/log.h)

Comment on lines 28 to 36
add_test "llama-gemma3-cli" "ggml-org/gemma-3-4b-it-GGUF"
add_test "llama-llava-cli" "guinmoon/MobileVLM-3B-GGUF"
add_test "llama-llava-cli" "THUDM/glm-edge-v-5b-gguf"
add_test "llama-llava-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K"
add_test "llama-llava-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K"
add_test "llama-llava-cli" "ibm-research/granite-vision-3.2-2b-GGUF"
add_test "llama-minicpmv-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted
add_test "llama-minicpmv-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K"
add_test "llama-qwen2vl-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we have to source all of these models from ggml-org, for 2 main reasons:

  • Stability (i.e. we know they won't disappear)
  • Safety (i.e. cannot be replaced with malicious versions)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I completely agree with this.

Also FYI, I ran this test script on an A10G space on HF and they all passed. My space was a ipynb, but I think it would be nice if we can have a gradio space which we can simply enter the PR number or commit SHA to be tested.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 5, 2025

Does the Qwen2-VL test fail for you too? It segfaults on my mac

No it doesn't. The command that I used is: llama-qwen2vl-cli -hf bartowski/Qwen2-VL-2B-Instruct-GGUF --image ... -p "what do you see"

If it still fails, could you try gdb or lldb to see the stack trace?

ngxson and others added 2 commits April 5, 2025 14:35
Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov
Copy link
Member

The problem is for some reason the Metal crashes in the ggml_scale_inplace op:

Target 0: (llama-qwen2vl-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x30000002b)
  * frame #0: 0x00000001031346c0 AGXMetalG14X`void AGX::ComputeContext<AGX::G14X::Encoders, AGX::G14X::Classes, AGX::G14X::ObjClasses, AGX::G14X::EncoderComputeServiceClasses>::setBuffer_impl<AGXBuffer>(AGXBuffer const*, unsigned long, unsigned int, unsigned long) + 64
    frame #1: 0x0000000100ecc288 libggml-metal.dylib`ggml_metal_encode_node(backend=0x0000600000020c00, idx=0, encoder=0x0000600001c70f30) at ggml-metal.m:1826:17
    frame #2: 0x0000000100eca890 libggml-metal.dylib`__ggml_backend_metal_set_n_cb_block_invoke(.block_descriptor=0x0000600003f8dfb0, iter=1) at ggml-metal.m:4962:25
    frame #3: 0x0000000100eca214 libggml-metal.dylib`ggml_metal_graph_compute(backend=0x0000600000020c00, gf=0x00000001300d8348) at ggml-metal.m:4539:13
    frame #4: 0x0000000100ec9eac libggml-metal.dylib`ggml_backend_metal_graph_compute(backend=0x0000600000020c00, cgraph=0x00000001300d8348) at ggml-metal.m:4942:12
    frame #5: 0x00000001010b7edc libggml-base.dylib`ggml_backend_graph_compute_async(backend=0x0000600000020c00, cgraph=0x00000001300d8348) at ggml-backend.cpp:334:12
    frame #6: 0x00000001010bb8f8 libggml-base.dylib`ggml_backend_sched_compute_splits(sched=0x000000012b310a00) at ggml-backend.cpp:1399:35
    frame #7: 0x00000001010bb588 libggml-base.dylib`ggml_backend_sched_graph_compute_async(sched=0x000000012b310a00, graph=0x0000000130008020) at ggml-backend.cpp:1590:12
    frame #8: 0x00000001010bb4f0 libggml-base.dylib`ggml_backend_sched_graph_compute(sched=0x000000012b310a00, graph=0x0000000130008020) at ggml-backend.cpp:1574:28
    frame #9: 0x000000010002d770 llama-qwen2vl-cli`clip_image_batch_encode(ctx=0x000000012a609a10, n_threads=16, imgs=0x000000016fdfd218, vec=0x0000000132388000) at clip.cpp:2651:19
    frame #10: 0x000000010002c688 llama-qwen2vl-cli`clip_image_encode(ctx=0x000000012a609a10, n_threads=16, img=0x00006000038727d0, vec=0x0000000132388000) at clip.cpp:2457:12
    frame #11: 0x0000000100017a04 llama-qwen2vl-cli`encode_image_with_clip(ctx_clip=0x000000012a609a10, n_threads=16, img=0x0000600002ef03c0, image_embd=0x0000000131c70000, n_img_pos=0x000000016fdfd5fc) at llava.cpp:277:27
    frame #12: 0x0000000100017694 llama-qwen2vl-cli`llava_image_embed_make_with_clip_img(ctx_clip=0x000000012a609a10, n_threads=16, img=0x0000600002ef03c0, image_embd_out=0x000000016fdfd668, n_img_pos_out=0x000000016fdfd664) at llava.cpp:430:10
    frame #13: 0x00000001000189d4 llama-qwen2vl-cli`llava_image_embed_make_with_bytes(ctx_clip=0x000000012a609a10, n_threads=16, image_bytes="\xff\xd8\xff\xe0", image_bytes_length=124071) at llava.cpp:503:31
    frame #14: 0x0000000100018b0c llama-qwen2vl-cli`llava_image_embed_make_with_filename(ctx_clip=0x000000012a609a10, n_threads=16, image_path="/Users/ggerganov/development/github/llama.cpp/examples/llava/test-1.jpeg") at llava.cpp:565:32
    frame #15: 0x0000000100004ccc llama-qwen2vl-cli`load_image(ctx_llava=0x0000600002ef03a0, params=0x000000016fdfd8e8, fname="/Users/ggerganov/development/github/llama.cpp/examples/llava/test-1.jpeg") at qwen2vl-cli.cpp:228:17
    frame #16: 0x0000000100004498 llama-qwen2vl-cli`main(argc=9, argv=0x000000016fdfed40) at qwen2vl-cli.cpp:565:34
    frame #17: 0x000000018bc04274 dyld`start + 2840

I think the Metal is not happy to have 2 different buffers point to the same data?

It crashes M1 Pro, M2 Ultra and M4 Max. Which chip do you have?

In any case, this patch fixes it:

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 1399a29b6..dd9afc6b0 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -465,7 +465,7 @@ static ggml_cgraph * clip_image_build_graph_siglip(clip_ctx * ctx, const clip_im
             V = ggml_cont(ctx0, ggml_permute(ctx0, V, 1, 2, 0, 3));
 
             struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);
-            KQ = ggml_scale_inplace(ctx0, KQ, 1.0f / sqrtf((float)d_head));
+            KQ = ggml_scale(ctx0, KQ, 1.0f / sqrtf((float)d_head));
             KQ = ggml_soft_max_inplace(ctx0, KQ);
 
             struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ);
@@ -721,7 +721,7 @@ static ggml_cgraph * clip_image_build_graph_legacy(clip_ctx * ctx, const clip_im
                     ctx0, Q, positions, nullptr,
                     d_head/2, mrope_sections, GGML_ROPE_TYPE_VISION, 32768, 10000, 1, 0, 1, 32, 1);
             }
-            Q = ggml_scale_inplace(ctx0, Q, 1.0f / sqrt((float)d_head));
+            Q = ggml_scale(ctx0, Q, 1.0f / sqrt((float)d_head));
             Q = ggml_cont(ctx0, ggml_permute(ctx0, Q, 0, 2, 1, 3));
             Q = ggml_reshape_3d(ctx0, Q, d_head, num_positions, n_head * batch_size);
 
@@ -1033,7 +1033,7 @@ static ggml_cgraph * clip_image_build_graph_legacy(clip_ctx * ctx, const clip_im
                 }
 
                 struct ggml_tensor * Q = ggml_add(ctx0, ggml_mul_mat(ctx0, model.mm_model_attn_q_w, q), model.mm_model_attn_q_b);
-                Q = ggml_scale_inplace(ctx0, Q, 1.0f / sqrt((float)d_head));
+                Q = ggml_scale(ctx0, Q, 1.0f / sqrt((float)d_head));
                 struct ggml_tensor * K = ggml_add(ctx0, ggml_mul_mat(ctx0, model.mm_model_attn_k_w, k), model.mm_model_attn_k_b);
                 struct ggml_tensor * V = ggml_add(ctx0, ggml_mul_mat(ctx0, model.mm_model_attn_v_w, v), model.mm_model_attn_v_b);
                 // permute

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 5, 2025

It crashes M1 Pro, M2 Ultra and M4 Max. Which chip do you have?

I'm using M3 Max (a bit funny, but how do you have 1, 2, 4 but skip 3 😂 )

Can you also give a try with ggml_soft_max_ext? Something like this:

KQ = ggml_soft_max_ext(ctx0, KQ, nullptr, 1.0f / sqrtf((float)d_head), 0.0f);

@ggerganov
Copy link
Member

I did some more debugging - it's not related to Metal, there is actually a legitimate bug somewhere. The reason is that the multi-rope is not supported by the Metal backend so it is offloaded to the CPU. When the next op is ggml_scale_inplace something goes wrong, though I am not 100% sure what exactly. This is the specific code that triggers it:

// self-attention
{
struct ggml_tensor * Q =
ggml_add(ctx0, ggml_mul_mat(ctx0, model.layers[il].q_w, cur), model.layers[il].q_b);
Q = ggml_reshape_4d(ctx0, Q, d_head, n_head, num_positions, batch_size);
if (ctx->has_qwen2vl_merger) {
Q = ggml_rope_multi(
ctx0, Q, positions, nullptr,
d_head/2, mrope_sections, GGML_ROPE_TYPE_VISION, 32768, 10000, 1, 0, 1, 32, 1);
}
Q = ggml_scale_inplace(ctx0, Q, 1.0f / sqrt((float)d_head));
Q = ggml_cont(ctx0, ggml_permute(ctx0, Q, 0, 2, 1, 3));
Q = ggml_reshape_3d(ctx0, Q, d_head, num_positions, n_head * batch_size);

Here is the problematic part of the generated graph. The node that crashes is # 42:

llama_context:      Metal compute buffer size =   299.75 MiB
llama_context:        CPU compute buffer size =    11.51 MiB
llama_context: graph nodes  = 1042
llama_context: graph splits = 114
## SPLIT #0: Metal # 1 inputs: [inp_raw (   3M)] 
node #  0 (    IM2COL):               node_0 (   1M) [Metal         ]:  v.patch_embd.weight (   1M) [Metal         ]      Metal#inp_raw#0 (   3M) [ NULL         ]
node #  3 (   MUL_MAT):               node_3 (   8M) [Metal         ]:           (reshaped) (   1M) [Metal         ] v.patch_embd.weight  (   1M) [Metal         ]
node #  6 (      CONT):  (reshaped) (permute (   8M) [Metal         ]:  (reshaped) (permute (   8M) [Metal         ]
node #  7 (    IM2COL):               node_7 (   1M) [Metal         ]: v.patch_embd.weight. (   1M) [Metal         ]      Metal#inp_raw#0 (   3M) [ NULL         ]
node # 10 (   MUL_MAT):              node_10 (   8M) [Metal         ]:           (reshaped) (   1M) [Metal         ] v.patch_embd.weight. (   1M) [Metal         ]
node # 13 (      CONT):  (reshaped) (permute (   8M) [Metal         ]:  (reshaped) (permute (   8M) [Metal         ]
node # 14 (       ADD):              node_14 (   8M) [Metal         ]:  (reshaped) (permute (   8M) [Metal         ]  (reshaped) (permute (   8M) [Metal         ]
node # 16 (      CONT):    (permuted) (cont) (   8M) [Metal         ]:           (permuted) (   8M) [Metal         ]
node # 20 (      CONT):  (permuted) (cont) ( (   8M) [Metal         ]:  (permuted) (cont) ( (   8M) [Metal         ]
node # 22 (      NORM):              node_22 (   8M) [Metal         ]:  (permuted) (cont) ( (   8M) [Metal         ]
node # 23 (       MUL):              node_23 (   8M) [Metal         ]:              node_22 (   8M) [Metal         ]   v.blk.0.ln1.weight (   5K) [Metal         ]
node # 24 (       ADD):              node_24 (   8M) [Metal         ]:              node_23 (   8M) [Metal         ]     v.blk.0.ln1.bias (   5K) [Metal         ]
node # 25 (   MUL_MAT):              node_25 (   8M) [Metal         ]: v.blk.0.attn_v.weigh (   3M) [Metal         ]              node_24 (   8M) [Metal         ]
node # 26 (       ADD):              node_26 (   8M) [Metal         ]:              node_25 (   8M) [Metal         ]  v.blk.0.attn_v.bias (   5K) [Metal         ]
node # 29 (      CONT):  (reshaped) (permute (   8M) [Metal         ]:  (reshaped) (permute (   8M) [Metal         ]
node # 31 (   MUL_MAT):              node_31 (   8M) [Metal         ]: v.blk.0.attn_k.weigh (   3M) [Metal         ]              node_24 (   8M) [Metal         ]
node # 32 (       ADD):              node_32 (   8M) [Metal         ]:              node_31 (   8M) [Metal         ]  v.blk.0.attn_k.bias (   5K) [Metal         ]
## SPLIT #1: CPU # 0 inputs
node # 34 (      ROPE):              node_34 (   8M) [  CPU         ]:           (reshaped) (   8M) [Metal         ]            positions (  25K) [  CPU         ]
## SPLIT #2: Metal # 1 inputs: [ (permuted) (   8M)] 
node # 36 (      CONT):    (permuted) (cont) (   8M) [Metal         ]:  Metal# (permuted)#0 (   8M) [ NULL         ]
node # 38 (   MUL_MAT):              node_38 (   8M) [Metal         ]: v.blk.0.attn_q.weigh (   3M) [Metal         ]              node_24 (   8M) [Metal         ]
node # 39 (       ADD):              node_39 (   8M) [Metal         ]:              node_38 (   8M) [Metal         ]  v.blk.0.attn_q.bias (   5K) [Metal         ]
## SPLIT #3: CPU # 0 inputs
node # 41 (      ROPE):              node_41 (   8M) [  CPU         ]:           (reshaped) (   8M) [Metal         ]            positions (  25K) [  CPU         ]
## SPLIT #4: Metal # 2 inputs: [node_41 (   8M)] [ (view) (permuted) (   8M)] 
node # 42 (     SCALE):               (view) (   8M) [Metal         ]:      Metal#node_41#0 (   8M) [ NULL         ]
node # 44 (      CONT):  (view) (permuted) ( (   8M) [Metal         ]: Metal# (view) (permu (   8M) [ NULL         ]
node # 46 (   MUL_MAT):              node_46 ( 167M) [Metal         ]:  (permuted) (cont) ( (   8M) [Metal         ]  (view) (permuted) ( (   8M) [Metal         ]
node # 47 (  SOFT_MAX):               (view) ( 167M) [Metal         ]:              node_46 ( 167M) [Metal         ]
node # 48 (   MUL_MAT):              node_48 (   8M) [Metal         ]:  (reshaped) (permute (   8M) [Metal         ]               (view) ( 167M) [Metal         ]
node # 51 (      CONT):  (reshaped) (permute (   8M) [Metal         ]:  (reshaped) (permute (   8M) [Metal         ]
...

Running with a debugger, the problems is that the buffer of the view_src of the scale node is bogus.

Simply changing the operation in the code above from ggml_scale_inplace to ggml_scale fixes the issue.

@slaren Do you have any guess what could be the root cause for this?

@slaren
Copy link
Member

slaren commented Apr 5, 2025

ggml_backend_sched cannot deal with this situation correctly. The output from the CPU backend is never copied back to the Metal backend because it is using an inplace operation, so rather than using the output from the CPU it is using the original tensor. It is best if explicit in-place operations are only used when strictly necessary, e.g. to modify a static tensor such as the KV cache. ggml-alloc will automatically make operations in-place when possible.

@ggerganov
Copy link
Member

Got it.

@ngxson I think this is good to merge.

but how do you have 1, 2, 4 but skip 3

M3 seemed like a too minor upgrade. Not that M4 was really that significant either, but my M1 laptop was getting old and I needed a new one.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 5, 2025

Thanks for reviewing and testing this. I'll merge once CI is green.

In the last commit, I also fixed issue with Yi-VL model. Although the model passes "the NY times" image test, it doesn't seem to be able to describe more complex scene. I think the model is quite old anyway and judging by the number of downloads, I'm doubt if someone actually using it: https://huggingface.co/cmp-nct/Yi-VL-6B-GGUF

(Also leaving a link to the original PR here, for reference: #5093)

Anyway, it's truly a surprise to see how many models are supported by this clip/llava infrastructure. We're currently having 11 different model archs in the tests.sh

Comment on lines +28 to +38
add_test "llama-gemma3-cli" "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
add_test "llama-llava-cli" "cmp-nct/Yi-VL-6B-GGUF:Q5_K"
add_test "llama-llava-cli" "guinmoon/MobileVLM-3B-GGUF:Q4_K_M"
add_test "llama-llava-cli" "THUDM/glm-edge-v-5b-gguf:Q4_K_M"
add_test "llama-llava-cli" "second-state/Llava-v1.5-7B-GGUF:Q2_K"
add_test "llama-llava-cli" "cjpais/llava-1.6-mistral-7b-gguf:Q3_K"
add_test "llama-llava-cli" "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M"
add_test "llama-minicpmv-cli" "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted
add_test "llama-minicpmv-cli" "openbmb/MiniCPM-V-2_6-gguf:Q2_K"
add_test "llama-minicpmv-cli" "openbmb/MiniCPM-o-2_6-gguf:Q4_0"
add_test "llama-qwen2vl-cli" "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw @bartowski1182 do you have any other models to add to the list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, can't think of any other vision models off the top of my head, but i can take a closer look

@ngxson ngxson merged commit 0364178 into ggml-org:master Apr 5, 2025
51 checks passed
@LostRuins
Copy link
Collaborator

@ngxson I think this PR might have broken clip quantization, https://github.com/ggml-org/llama.cpp/blob/master/examples/llava/clip-quantize-cli.cpp no longer works after this (determined by bisecting).

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 29, 2025
* refactor clip_init

* fix loading file

* fix style

* test ok

* better test with report

* add missing headers

* clarify

* add KEY_MM_PATCH_MERGE_TYPE

* remove bool has_* pattern

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* Update examples/llava/clip.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* use ggml_soft_max_ext

* refactor logging system

* add minicpm-v-o 2.6 for testing

* use nullptr everywhere

* fix Yi-VL model

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@LostRuins
Copy link
Collaborator

Well I found out why clip-quantize-cli was broken, since in #12869 ctx_gguf was changed to a smart pointer. The way clip_init() is structured, the clip_model_loader goes out of scope before the rest of clip_model_quantize runs. And then this PR removes the new_clip->ctx_gguf = ctx; line as well.

A very ugly hack to keep clip_model_loader in scope while doing a clip_model_quantize LostRuins@dbb6bbf

Don't know if such a band aid fix would be accepted here, but I'd be happy to PR it if desired.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 30, 2025

Tbh I don't really like the code of clip_model_quantize and it will be removed in near future, to be replaced with something more manageable.

And also, the quantization code can be completely outside of clip.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants