-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images. #12973
Comments
@woshiwanlei1 Can you try specifying |
I'll come back and give it a try, thank you. |
Can you also try out #14336 and see if it alleviates the issue? |
If I disable_mm_preprocessor_cache=True, the memory is stable but inference quite slow. So, I decided to try your fix. I pulled the latest docker image with:
I think this issue is not resolved yet. Loading was fine:
Then, I run a 3 tests with 100 images each time. The following are the RAM usage results from the instance (EC2 g5.xlarge): process 1: 15.9% -- (1st test [100 images] 20.1%) -- (2nd test [same 100 images] 20.1%) -- (3rd test [different 100 images] 40%) [didn't finish and all RAM was used, instance crashed] It doesn't seem to get stable after an Nth number of calls, but rather grow with every unseen image or am I missing something? @woshiwanlei1 Could you please elaborate on this:
|
I assume you mean that vLLM may still crash if you don't set If you are using latest commit, #14805 has been merged already so you can try setting |
Thank you. With VLLM_MM_INPUT_CACHE_GiB it is more controllable. |
A note for anyone else encountering something like this, the env var is actually |
@cchadowitz / @ktobah, can you pls let me know on how can I set this environment variable VLLM_MM_INPUT_CACHE_GIB |
You just set it like any other environment variable. For example, to limit the cache size to 4 GiB when running
|
os.environ["VLLM_MM_INPUT_CACHE_GIB"] = "4" I have tried setting this way as I am using vllm wrapper, but still consuming high CPU RAM |
You should set it before importing vLLM. It's preferable to set it in command line. If CPU usage is still high, you can try |
have tried disable-mm-preprocessor-cache, still its consuming high RAM with each increasing request |
have tried setting variable on command line before execution: export VLLM_MM_INPUT_CACHE_GIB=4 |
cc @ywang96 I think you're working on this issue with the frontend? |
Hello @kuladeephx ! This issue has been fixed on the main branch - please see a post-mortem/analysis at the comment here |
@ywang96, I have tried with latest main branch
still facing the same issue |
Have tried with vllm serve(openai server based). Did not face memory issue using disable-mm-preprocessor-cache |
using |
@Mitix-EPI, is it using open ai server based or using the vllm wrapper |
Have you solved the problem? I'm using 0.7.x and facing similar issue. |
@kuladeephx vllm wrapper with the v1 api |
The problem I encountered
After deploying Qwen2-VL-7B-Instruct-GPTQ-Int4 using VLLM, continuous requests from clients will cause CPU memory to continue to rise. Is it because some memory has not been reclaimed?
My specific usage scenario is:
I have two GPUs. When I use the ray framework for distributed deployment, as the number of VL models processed increases, my CPU memory becomes larger, leading to actor crashes in ray.
I have tested the native loading method of Qwen2-VL-7B-Instruct-GPTQ-Int4 and it does not cause CPU memory overflow. Once the VLLM framework is used for loading, there will be continuous CPU overflow
[Special note]: When you test, be sure to change the image each time, so that you can clearly see the CPU memory overflow. If only the same image is used, it will only leak once, causing the memory overflow to appear inconspicuous.
My code and environment
Here is my code
This is vllm version information
Name: vllm
Version: 0.7.2
This is my gpu info
This is memory leak information
The text was updated successfully, but these errors were encountered: