Feature Request: Support for Qwen2-VL #9246

isr431 · 2024-08-29T22:34:11Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Qwen just released Qwen2-VL 2B & 7B under the Apache 2.0 License.

Motivation

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

Possible Implementation

No response

chigkim · 2024-08-31T00:07:21Z

+1 This would be another great addition!

crzroot · 2024-08-31T02:37:38Z

This model is awesome

suepradun · 2024-08-31T03:07:02Z

I am looking forward to it very much

xzlinux · 2024-08-31T12:06:23Z

+1 I am looking forward to it very much

yukiarimo · 2024-08-31T23:05:39Z

We can try Llamafing it

XDesktopSoft · 2024-09-01T02:31:01Z

+1

WildCatApp · 2024-09-01T11:20:41Z

+1

uestcbraid · 2024-09-02T01:30:37Z

+1

mrhalyang · 2024-09-02T02:40:32Z

+1

elyzionz · 2024-09-02T05:41:09Z

+1

eaucoin · 2024-09-02T20:35:37Z

+1

Kimizhao · 2024-09-03T03:20:07Z

+1

enryteam · 2024-09-04T09:22:58Z

+1

yukiarimo · 2024-09-04T10:38:59Z

Any updates?

apipino · 2024-09-05T01:09:30Z

+1

Xhehab · 2024-09-05T04:02:21Z

+1

Seaman3body · 2024-09-05T13:43:32Z

+1

zenoverflow · 2024-09-05T15:39:29Z

+1

whoisltd · 2024-09-06T03:12:55Z

+1

eav-solution · 2024-09-07T16:16:31Z

+1

feynmanloo · 2024-09-08T16:16:50Z

I can not wait for it !!!

chigkim · 2024-09-08T19:04:58Z

Maybe people should also express interest and ask Qwen2-VL devs to implement.
QwenLM/Qwen2.5-VL#7

wmx-github · 2024-09-11T01:56:57Z

Expect to use llama.cpp end side inference

HimariO · 2024-09-11T02:29:22Z

Is anyone already working on this? If not, I would like to give it a try.

solangii · 2024-09-11T08:17:54Z

+1
is there any updates?

PredyDaddy · 2024-09-12T09:06:51Z

+1

shobhit9618 · 2024-09-12T12:32:23Z

+1

zhouxihong1 · 2024-09-13T08:08:20Z

+1

gitl33 · 2024-12-07T05:29:23Z

Hi all

in cmake . -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=$(which nvcc) -DTCNN_CUDA_ARCHITECTURES=61

How to build using ' DGGML_SYCL=ON ' to get a build package like this:

llama-b4218-bin-win-sycl-x64.zip

I'll appreciate a lot any help

thanks guys!!

brianestadimas · 2024-12-13T10:50:04Z

@huucuong1503 if you have any spare time ，anwser me thank you！！

hey I have uploaded my model, you can check at this https://www.kaggle.com/models/cngnguyntrnhu/qwen2vl_gguf_quantize4_k_m

Thank you so much!

beginor · 2024-12-15T12:11:59Z

I have tried llama-qwen2vl-cli with qwen2-vl-72b-instruct-q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with command (M1 Max 64B):

llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg

Got an error:

ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op

The full output is:

build: 4329 (89d604f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama_load_model_from_file: using device Metal (Apple M1 Max) - 57343 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /Users/zhang/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tongyi-qianwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2-VL-...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2 VL 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  16:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  17:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  18:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  20:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-72B-Instruct-GGU...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2 VL 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256

llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 45213.45 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   668.25 MiB
..................................................................................................
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   570.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2806
llama_new_context_with_model: graph splits = 322
ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
zsh: abort      llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj

Vi-cs · 2024-12-18T20:19:31Z

I have tried llama-qwen2vl-cli with qwen2-vl-72b-instruct-q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with command (M1 Max 64B):

llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg

Got an error:

ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op

The full output is:

build: 4329 (89d604f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.2.0
llama_load_model_from_file: using device Metal (Apple M1 Max) - 57343 MiB free
llama_model_loader: loaded meta data with 39 key-value pairs and 963 tensors from /Users/zhang/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 VL 72B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2-VL
llama_model_loader: - kv   5:                         general.size_label str              = 72B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = tongyi-qianwen
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2-VL-...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2 VL 72B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-VL-72B
llama_model_loader: - kv  13:                               general.tags arr[str,2]       = ["multimodal", "image-text-to-text"]
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  16:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  17:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  18:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  19:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  20:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  21:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 15
llama_model_loader: - kv  24:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2-VL-72B-Instruct-GGU...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_0:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q4_K:  401 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 14
llm_load_vocab: token to piece cache size = 0.9309 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2vl
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 29568
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 8
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 72.71 B
llm_load_print_meta: model size       = 44.15 GiB (5.22 BPW) 
llm_load_print_meta: general.name     = Qwen2 VL 72B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256

llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: Metal_Mapped model buffer size = 45213.45 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   668.25 MiB
..................................................................................................
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 60129.54 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init:      Metal KV buffer size =  1280.00 MiB
llama_new_context_with_model: KV self size  = 1280.00 MiB, K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:      Metal compute buffer size =   570.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2806
llama_new_context_with_model: graph splits = 322
ggml_metal_encode_node: error: unsupported op 'IM2COL'
~/Developer/llama.cpp/ggml/src/ggml-metal/ggml-metal.m:1263: unsupported op
zsh: abort      llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj

same issue on M4 max 128 GB

chigkim · 2024-12-19T01:07:40Z

Same on M3-Max 64GB
error: unsupported op 'IM2COL'

bunnyfu · 2024-12-19T11:55:12Z

Same error on MBP M3-Max 128GB

chigkim · 2024-12-19T12:30:57Z

Those who have a problem with Mac, share your setup, mac spec, which models and quants (both llm and mmproj) you tried here.
#10361

@bunnyfu, @Vi-cs, @beginor

ggerganov · 2024-12-19T12:56:10Z

Mac issues should be fixed with #10896

remixer-dec · 2024-12-19T13:50:52Z

I'm getting

2.04.266.280 I encode_image_with_clip: load_image_size 640 512
2.04.266.281 I encode_image_with_clip: image embedding created: 437 tokens
2.04.266.282 I encode_image_with_clip: image encoded in   121.50 ms by CLIP (    0.28 ms per image patch)
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_range_insert

when running images with --parallel >= 3 on CUDA via llama-box server, possibly an issue on their side since server in llama.cpp does not support mmproj

UPD: setting bigger context length seems to help

chigkim · 2024-12-19T14:53:29Z

Thanks! It now works on my m3-max with #10896.

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git fetch origin pull/10896/head:pr10896
git checkout pr10896
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-qwen2vl-cli -m xxx.gguf --mmproj yyyy.gguf --image img.png -p "Describe the image."

beginor · 2024-12-20T02:44:20Z

I have tried model qwen2-vl-72b-instruct.q4_k_m.gguf and qwen2-vl-72b-instruct.f32.mmproj.gguf with llama.cpp version 4367, on m1 max, it works with png and jpeg, but it does not works with webp images, the error is:

clip_image_load_from_bytes: failed to decode image bytes
llava_image_embed_make_with_bytes: can't load image from bytes, is it a valid image?load_image: is /Users/zhang/Downloads/xuguimei.webp really an image file?
main: failed to load image ~/Downloads/test.webp. Terminating

chigkim · 2024-12-20T12:33:26Z

I don't think it supports webp. Just convert to png or jpeg for now.

gaussiangit · 2024-12-22T10:30:54Z

how do I merge the 2 ggufs for ollama ? llm gguf and vision encodoer gguf merge ?

chigkim · 2024-12-23T04:34:24Z

@gaussiangit Ollama doesn't support qwen2-vl yet.
Feature request for Ollama here:
ollama/ollama#6564

Cheesper · 2024-12-24T12:02:12Z

Any updates?

embedsri · 2024-12-26T23:33:43Z

I'm able to successfully test llama-qwen2vl-cli to describe an image using qwen2-VL-7B model on Android(Samsung S21+ to be specific). The operation takes reasonable 3-4 minutes with quantization. I'll be looking to include metal or vulkan to further improve performance by using GPU on the phones. Also, repeat this on IOS as well.

vojtapolasek · 2025-01-10T08:40:16Z

Hello @embedsri could you please share more details how you did that?
I downloaded official model from Huggingface, I used convert_hg_to_gguf.py to convert the text part and then qwen2vl_surgery script to extract the mmproj from the original model.
I am running it on Fedora with latest llama.cpp (c3f9d25 ) on cpu (AMD Rizen) only, 64 GiB of RAM.
I observe two strange things:

The image is encoded very slowly, it takes like 4 minutes.
I don't get any meaningful output.

See this:

encode_image_with_clip: step 1 of 1 encoded in 234926.53 ms
encode_image_with_clip: all 1 segments encoded in 234926.55 ms
encode_image_with_clip: load_image_size 1536 2048
encode_image_with_clip: image embedding created: 4070 tokens

encode_image_with_clip: image encoded in 234939.22 ms by CLIP (   57.72 ms per image patch)
llama_decode: failed to decode, ret = 1
eval_tokens : failed to eval. token 0/12 (batch size 2048, n_past 4085)

I did not quantize the mmproj model, but I tried quantizing the text model to q4_0, no difference.

auriocus · 2025-01-10T08:56:19Z

1. The image is encoded very slowly, it takes like 4 minutes.

yes, this CLIP encoding is quite compute-intensive. Especially with the newest commits where the GPU acceleration was deactivated (because it only ever worked on CUDA and everyone else started complaining), it takes some time. But I also think your image is quite large:

2. I don't get any meaningful output.

See this:

encode_image_with_clip: step 1 of 1 encoded in 234926.53 ms
encode_image_with_clip: all 1 segments encoded in 234926.55 ms
encode_image_with_clip: load_image_size 1536 2048
encode_image_with_clip: image embedding created: 4070 tokens

How did you set the context length? when the image already takes up 4070 tokens, maybe there is nothing left for the prompt and result. I'd first try to downscale the image and see what happens.

vojtapolasek · 2025-02-03T09:09:18Z

Hello, thank you. Choosing a smaller image helped.

xpatronum · 2025-02-03T14:41:26Z

@ggerganov it works as a command line, but not as a server-side via llama-server. Any timeline for supporting it through native API on llama.cpp side?

auriocus · 2025-02-04T06:30:59Z

@ggerganov it works as a command line, but not as a server-side via llama-server. Any timeline for supporting it through native API on llama.cpp side?

see #8010

jxxk1986 · 2025-02-28T09:13:35Z

Hello, I encountered an issue while developing multi-batch inference. In multi-batch processing, the first query returns a correct answer, but the second one outputs garbled text. Does llama_batch support multi-batch inference for qwen2vl? @HimariO

jxxk1986 · 2025-04-09T11:54:18Z

When performing prefill on image tokens, the results from PyTorch and llama.cpp fail to align. Specifically, in the build_qwen2vl module, the query (q) and key (k) after applying ggml_rope_multi can match those from PyTorch, but the ‘kqv_out’ tensor generated by llm_build_kv cannot be aligned with PyTorch’s results. @HimariO

HimariO · 2025-04-09T13:46:11Z

@jxxk1986 seems like it may be similar to this issue related to inplace operation?

jxxk1986 · 2025-04-10T05:50:16Z

@jxxk1986 seems like it may be similar to this issue related to inplace operation?

The program does not crash, but the result is incorrect.
I tried using text for prefill, and the results from PyTorch and llama.cpp are consistent. I’m not sure if it’s due to differences in how mrope handles images versus text. @HimariO

isr431 added the enhancement New feature or request label Aug 29, 2024

FelisDwan mentioned this issue Aug 30, 2024

add Qwen2-VL/Qwen2.5-VL ollama/ollama#6564

Open

ggerganov mentioned this issue Dec 19, 2024

clip : disable GPU support #10896

Merged

elter-tef mentioned this issue Jan 28, 2025

Feature Request: UI-TARS support #11439

Closed

4 tasks

nick-pape mentioned this issue Feb 12, 2025

GPT-4o/Vision models cannot use GPU due to CLIP changes mudler/LocalAI#4815

Closed

github-actions bot added the stale label Mar 31, 2025

github-actions bot removed the stale label Apr 10, 2025

Feature Request: Support for Qwen2-VL #9246

Feature Request: Support for Qwen2-VL #9246

Comments

isr431 commented Aug 29, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

chigkim commented Aug 31, 2024 • edited Loading

crzroot commented Aug 31, 2024

suepradun commented Aug 31, 2024

xzlinux commented Aug 31, 2024

yukiarimo commented Aug 31, 2024

XDesktopSoft commented Sep 1, 2024

WildCatApp commented Sep 1, 2024

uestcbraid commented Sep 2, 2024

mrhalyang commented Sep 2, 2024

elyzionz commented Sep 2, 2024

eaucoin commented Sep 2, 2024

Kimizhao commented Sep 3, 2024

enryteam commented Sep 4, 2024

yukiarimo commented Sep 4, 2024

apipino commented Sep 5, 2024

Xhehab commented Sep 5, 2024

Seaman3body commented Sep 5, 2024

zenoverflow commented Sep 5, 2024

whoisltd commented Sep 6, 2024

eav-solution commented Sep 7, 2024

feynmanloo commented Sep 8, 2024

chigkim commented Sep 8, 2024

wmx-github commented Sep 11, 2024

HimariO commented Sep 11, 2024

solangii commented Sep 11, 2024

PredyDaddy commented Sep 12, 2024

shobhit9618 commented Sep 12, 2024

zhouxihong1 commented Sep 13, 2024

gitl33 commented Dec 7, 2024

brianestadimas commented Dec 13, 2024

beginor commented Dec 15, 2024

Vi-cs commented Dec 18, 2024

chigkim commented Dec 19, 2024

bunnyfu commented Dec 19, 2024

chigkim commented Dec 19, 2024

ggerganov commented Dec 19, 2024

remixer-dec commented Dec 19, 2024 • edited Loading

chigkim commented Dec 19, 2024

beginor commented Dec 20, 2024

chigkim commented Dec 20, 2024

gaussiangit commented Dec 22, 2024

chigkim commented Dec 23, 2024

Cheesper commented Dec 24, 2024

embedsri commented Dec 26, 2024

vojtapolasek commented Jan 10, 2025

auriocus commented Jan 10, 2025

vojtapolasek commented Feb 3, 2025

xpatronum commented Feb 3, 2025

auriocus commented Feb 4, 2025

jxxk1986 commented Feb 28, 2025

jxxk1986 commented Apr 9, 2025

HimariO commented Apr 9, 2025

jxxk1986 commented Apr 10, 2025

chigkim commented Aug 31, 2024 •

edited

Loading

remixer-dec commented Dec 19, 2024 •

edited

Loading