-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Bug]: Stuck When Launching Llama-4-Maverick-17B-128E-Instruct-FP8 #16152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The same as me |
@HermitSun can you try seeing if it works without |
For me, it still happens |
After waiting for over 10 mins, it did work fine. It seems the model loading step took quite a while.
|
@HermitSun @sarckk Could you please specify the version of runai-model-streamer you're using? If that is the problem, version 0.13.0 should fix it |
I'm on version 0.13.0. |
over 10mins is definitely not normal. I'm not able to reproduce this long of a load time from my end. |
Could this be related to the file system I'm using? I chose runai_streamer because GPFS has performance issues when loading safetensors. |
Hey @HermitSun The reason I suspect it takes such long time is the fact the loader is currently neither support sharded mode loading nor multi-gpu loading. What actually happen is that every rank (8 in your case) read the entire model (400GB) and take only small portion of the files. The solution to that would be to make sure that the data is read once, and gets to the appropriate rank. The following PR add support for sharded mode with RunAI Model Streamer: #16317 We are working on more sophisticated mechanism of reading to multiple GPUs in an optimized way, without the need to shard the model first Thanks, Omer |
@HermitSun if the safetensors files in GPFS have direct I/O policy set then the page caching is disabled. With direct I/O the files are read directly from storage 8 times instead of utilizing the page cache. |
Also notice that in the command you run That means the process will download the model from the internet if its not exists in the cache, and only then will load it to the GPU. That's definitely not the recommended way to get the best performance. In your counting, you dont take the download process into account right? |
The model has already been downloaded to GPFS — I simulated the download behavior by linking the cache from GPFS to some required paths. To avoid any potential impact from symlinks (or anything others), I used the following command to read from the absolute path on GPFS: vllm serve /models/preset/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -tp 8 --max-model-len 128000 --override-generation-config='{"attn_temperature_tuning": true}' --kv-cache-dtype fp8 Since I'm using Kubernetes, each start should theoretically be in a clean environment, and I’m able to consistently reproduce this result. In fact, loading the model weights wasn't slow — it seems some steps after the weights are read took quite a while. I'm wondering what happened after weights loaded🤔:
The following result is from restarting the service within the same pod, where local cache should already exist — and it's noticeably faster:
|
I've noticed that safetensors uses mmap during loading — is this always done with direct I/O? |
While mmap interacts with the page cache by default, it doesn't inherently enforce or prevent Direct I/O |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
When attempting to launch the vLLM server using the following command from the documentation, and after the model finished loading, it has been stuck there for over ten hours.
vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -tp 8 --max-model-len 128000 --load-format runai_streamer --override-generation-config='{"attn_temperature_tuning": true}' --kv-cache-dtype fp8
Logs:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: