Skip to content

llava: Fix cpu-only clip image encoding sefault #12907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 12, 2025

Conversation

mattjcly
Copy link
Contributor

@mattjcly mattjcly commented Apr 11, 2025

Overview

Properly manage/use backend unique_ptrs and the memory that they point to in clip.cpp when only using cpu backend. Note that I've observed CPU-only to perform better than Metal on my M3 Pro 36GB for image encoding in certain situations.

Issue

In clip.cpp, backend_cpu is moved from here:

backend = std::move(backend_cpu);

and then pushed to backend_ptrs here:

backend_ptrs.push_back(backend_cpu.get());

which causes a segfault in ggml-backend.cpp when that backend (NULL) is attempted to have its device member accessed:

ggml_backend_buffer_type_t ggml_backend_get_default_buffer_type(ggml_backend_t backend) {
    return ggml_backend_dev_buffer_type(backend->device);
}

This PR fixes that, and then also properly checks which backend unique_ptr is pointing to a cpu backend to set n_threads.

Repro

The issue can be observed by setting use_gpu to false in gemma3-cli.cpp.

Click to see segfault in output
-> % /Users/matt/Workspace/forks/llama.cpp/build/bin/llama-gemma3-cli -m /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf --mmproj /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/mmproj-model-f16.gguf --image /Users/matt/Documents/dice.jpg -p "what is this"
build: 5120 (e8a62631) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 444 tensors from /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q6_K:   35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.12 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors: Metal_Mapped model buffer size =  2368.18 MiB
load_tensors:   CPU_Mapped model buffer size =   525.00 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init:      Metal KV buffer size =   544.00 MiB
llama_context: KV self size  =  544.00 MiB, K (f16):  272.00 MiB, V (f16):  272.00 MiB
llama_context:      Metal compute buffer size =   517.00 MiB
llama_context:        CPU compute buffer size =    21.01 MiB
llama_context: graph nodes  = 1435
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_ctx: CLIP using CPU backend
[1]    56301 segmentation fault  /Users/matt/Workspace/forks/llama.cpp/build/bin/llama-gemma3-cli -m  --mmproj

Fix

Output with these changes

Click to see correct output
-> % /Users/matt/Workspace/forks/llama.cpp/build/bin/llama-gemma3-cli -m /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf --mmproj /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/mmproj-model-f16.gguf --image /Users/matt/Documents/dice.jpg -p "what is this"
build: 5121 (a7fb802d) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.3.0
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 27647 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 444 tensors from /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 4b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 4B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 4b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 2560
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 34
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 10240
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 8
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 4
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  205 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q6_K:   35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.31 GiB (5.12 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2560
print_info: n_layer          = 34
print_info: n_head           = 8
print_info: n_head_kv        = 4
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 10240
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 4B
print_info: model params     = 3.88 B
print_info: general.name     = Gemma 3 4b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 35/35 layers to GPU
load_tensors: Metal_Mapped model buffer size =  2368.18 MiB
load_tensors:   CPU_Mapped model buffer size =   525.00 MiB
.................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Pro
ggml_metal_init: picking default device: Apple M3 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = false
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 28991.03 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     1.00 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 34, can_shift = 1
init:      Metal KV buffer size =   544.00 MiB
llama_context: KV self size  =  544.00 MiB, K (f16):  272.00 MiB, V (f16):  272.00 MiB
llama_context:      Metal compute buffer size =   517.00 MiB
llama_context:        CPU compute buffer size =    21.01 MiB
llama_context: graph nodes  = 1435
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_ctx: CLIP using CPU backend
clip_model_loader: model name:   
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    439
clip_model_loader: n_kv:         16

load_hparams: text_encoder:       0
load_hparams: vision_encoder:     1
load_hparams: llava_projector:    0
load_hparams: minicpmv_projector: 0
load_hparams: minicpmv_version:   2
load_hparams: glm_projector:      0
load_hparams: model size:         811.79 MiB
load_hparams: metadata size:      0.15 MiB
alloc_compute_meta:        CPU compute buffer size =  1128.81 MiB
main: /Users/matt/.cache/lm-studio/models/ggml-org/gemma-3-4b-it-GGUF/gemma-3-4b-it-Q4_K_M.gguf
encoding image...
image encoded in 25698 ms
image decoded in 465 ms
Based on the image, those are **wooden dice**. 

You can see they have the standard six sides with dots representing the numbers 1 through 6. 

They're arranged on a dark, likely black, surface.
ggml_metal_free: deallocating

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more careful examine, I think the backend_cpu is only used by ggml_backend_cpu_set_n_threads, so probably it's not that useful to use a smart pointer for it.

I'll simplify this by remove the smart ptr altogether for ggml_backend_t

@ngxson
Copy link
Collaborator

ngxson commented Apr 11, 2025

Could you give this a try @mattjcly ? Thanks!!

@mattjcly
Copy link
Contributor Author

That caused a gpu enabled failure:

clip_ctx: CLIP using Metal backend
/Users/matt/Workspace/forks/llama.cpp/ggml/src/ggml-backend.cpp:1458: GGML_ASSERT(ggml_backend_dev_type(ggml_backend_get_device(backends[n_backends - 1])) == GGML_BACKEND_DEVICE_TYPE_CPU) failed
[1]    60661 abort      /Users/matt/Workspace/forks/llama.cpp/build/bin/llama-gemma3-cli -m  --mmproj

Put up the fix (tiny typo). @ngxson lmk if that looks correct to you.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this looks perfect, thanks 🔥

(merging once the CI is green)

@ngxson ngxson merged commit e59ea53 into ggml-org:master Apr 12, 2025
51 checks passed
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
* llava: Fix cpu-only clip image encoding

* clip : no smart ptr for ggml_backend_t

* Fix for backend_ptr push_back

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants