-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Feature Request: Support for Qwen2-VL #9246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 This would be another great addition! |
This model is awesome |
I am looking forward to it very much |
+1 I am looking forward to it very much |
We can try Llamafing it |
+1 |
7 similar comments
+1 |
+1 |
+1 |
+1 |
+1 |
+1 |
+1 |
Any updates? |
+1 |
5 similar comments
+1 |
+1 |
+1 |
+1 |
+1 |
I can not wait for it !!! |
Maybe people should also express interest and ask Qwen2-VL devs to implement. |
Expect to use llama.cpp end side inference |
Is anyone already working on this? If not, I would like to give it a try. |
+1 |
+1 |
2 similar comments
+1 |
+1 |
Hi all in cmake . -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=$(which nvcc) -DTCNN_CUDA_ARCHITECTURES=61 How to build using ' DGGML_SYCL=ON ' to get a build package like this: llama-b4218-bin-win-sycl-x64.zip I'll appreciate a lot any help thanks guys!! |
Thank you so much! |
I have tried llama-qwen2vl-cli -m ~/Downloads/qwen2-vl-72b-instruct-q4_k_m.gguf --mmproj ~/Downloads/qwen2-vl-72b-instruct.f32.mmproj.gguf --image demos/images/03.jpg Got an error:
The full output is:
|
same issue on M4 max 128 GB |
Same on M3-Max 64GB |
Same error on MBP M3-Max 128GB |
Mac issues should be fixed with #10896 |
I'm getting
when running images with UPD: setting bigger context length seems to help |
Thanks! It now works on my m3-max with #10896. git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git fetch origin pull/10896/head:pr10896
git checkout pr10896
cmake -B build
cmake --build build --config Release -j
./build/bin/llama-qwen2vl-cli -m xxx.gguf --mmproj yyyy.gguf --image img.png -p "Describe the image." |
I have tried model
|
I don't think it supports webp. Just convert to png or jpeg for now. |
how do I merge the 2 ggufs for ollama ? llm gguf and vision encodoer gguf merge ? |
@gaussiangit Ollama doesn't support qwen2-vl yet. |
Any updates? |
I'm able to successfully test llama-qwen2vl-cli to describe an image using qwen2-VL-7B model on Android(Samsung S21+ to be specific). The operation takes reasonable 3-4 minutes with quantization. I'll be looking to include metal or vulkan to further improve performance by using GPU on the phones. Also, repeat this on IOS as well. |
Hello @embedsri could you please share more details how you did that?
See this:
I did not quantize the mmproj model, but I tried quantizing the text model to q4_0, no difference. |
yes, this CLIP encoding is quite compute-intensive. Especially with the newest commits where the GPU acceleration was deactivated (because it only ever worked on CUDA and everyone else started complaining), it takes some time. But I also think your image is quite large:
How did you set the context length? when the image already takes up 4070 tokens, maybe there is nothing left for the prompt and result. I'd first try to downscale the image and see what happens. |
Hello, thank you. Choosing a smaller image helped. |
@ggerganov it works as a command line, but not as a server-side via |
see #8010 |
Hello, I encountered an issue while developing multi-batch inference. In multi-batch processing, the first query returns a correct answer, but the second one outputs garbled text. Does llama_batch support multi-batch inference for qwen2vl? @HimariO |
When performing prefill on image tokens, the results from PyTorch and llama.cpp fail to align. Specifically, in the build_qwen2vl module, the query (q) and key (k) after applying ggml_rope_multi can match those from PyTorch, but the ‘kqv_out’ tensor generated by llm_build_kv cannot be aligned with PyTorch’s results. @HimariO |
The program does not crash, but the result is incorrect. |
Prerequisites
Feature Description
Qwen just released Qwen2-VL 2B & 7B under the Apache 2.0 License.
Motivation
SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: