[RFC]: copy pynvml code into vllm codebase #12977

youkaichao · 2025-02-09T06:54:37Z

Motivation.

We have suffered a lot from the module pynvml recently, see #12847 for example.

libnvml.so is the library behind nvidia-smi, and pynvml is a Python wrapper around it. We use it to get GPU status without initializing CUDA context in the current process.
Historically, there are two packages that provide a module named pynvml:

nvidia-ml-py (https://pypi.org/project/nvidia-ml-py/): The official wrapper. It is a dependency of vLLM, and is installed when users install vLLM. It provides a Python module named pynvml.
pynvml (https://pypi.org/project/pynvml/): An unofficial wrapper. Prior to version 12.0, it also provides a Python module pynvml, and therefore conflicts with the official one. What's worse, the module is a Python package, and has higher priority than the official one which is a standalone Python file. This causes errors when both of them are installed. Starting from version 12.0, it migrates to a new module named pynvml_utils to avoid the conflict.

To make vLLM work, we have to make sure, there's no pynvml package, or the pynvml package has version 12.0 or higher. However, neither of them is a doable solution:

As a Python package, we cannot ask people to uninstall pynvml just to make vLLM work.
If we pin pynvml==12.0 as vLLM's dependency, then it can work for vLLM, but will break other libraries. Notably, deepspeed depends on pynvml==11.5.0: https://github.com/ray-project/ray/blob/9e3ec5972cd952d2b50f3b20abc24ced5abb8b54/python/requirements_compiled.txt#L1611 The module is so confusing, that lots of community libraries don't know nvidia-ml-py is the official one. Lots of community libraries depends pynvml, e.g. https://github.com/Sygil-Dev/sygil-webui/blob/d88fa9e8c4d9cefbbfb0b445ad79d4ddb85c8e36/requirements.txt#L17 . What's worse, even nvidia official container nvcr.io/nvidia/pytorch:25.01-py3 uses the unofficial pynvml<12.0 .

To summarize, we are in a dependency hell due to the historical confusing packages.

Proposed Change.

To solve the problem, I propose to copy the code from nvidia-ml-py into vLLM, and use vllm.third_party.pynvml to import it. See #12963 for the prototype.

The solution is only to rescue us from the dependency hell. We don't need to maintain the code. If there are bugfixes in nvidia-ml-py in the future, we can periodically sync the code.

This is the first time we copy a whole package into vllm, so I'm creating a separate directory vllm/third_party to hold the code.

This RFC is for future reference, when we need to copy code into vllm/third_party.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

youkaichao · 2025-02-09T06:57:16Z

The approach I take here, is inspired by ray, as I found they also copy pynvml into their codebase, see https://github.com/ray-project/ray/blob/master/python/ray/_private/thirdparty/pynvml/pynvml.py .

houseroad · 2025-02-09T07:16:01Z

Actually, it's not a bad idea, especially only need to copy one file. PyTorch also leverage some similar ideas to solve some build or deps issue, such as using miniz as zip library, etc.

noooop · 2025-02-10T04:53:25Z

This should have been done a long time ago

DarkLight1337 · 2025-02-10T04:58:00Z

Closing as completed by #12963

This code is copied from nvidia-ml-py and as per vllm-project#12977 we will need to periodically sync the code to pick up bugfixes. Signed-off-by: Mark McLoughlin <[email protected]>

youkaichao added the RFC label Feb 9, 2025

This was referenced Feb 9, 2025

[Bug]: RuntimeError: Failed to infer device type with v0.7.2 #12847

Closed

[core] port pynvml into vllm codebase #12963

Merged

DarkLight1337 closed this as completed Feb 10, 2025

markmc mentioned this issue Mar 13, 2025

[V1][Metrics] Allow V1 AsyncLLM to use custom logger #14661

Merged

SolitaryThinker mentioned this issue Mar 27, 2025

v1 hao-ai-lab/FastVideo#270

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: copy pynvml code into vllm codebase #12977

[RFC]: copy pynvml code into vllm codebase #12977

youkaichao commented Feb 9, 2025

youkaichao commented Feb 9, 2025

houseroad commented Feb 9, 2025

noooop commented Feb 10, 2025

DarkLight1337 commented Feb 10, 2025

[RFC]: copy pynvml code into vllm codebase #12977

[RFC]: copy pynvml code into vllm codebase #12977

Comments

youkaichao commented Feb 9, 2025

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

youkaichao commented Feb 9, 2025

houseroad commented Feb 9, 2025

noooop commented Feb 10, 2025

DarkLight1337 commented Feb 10, 2025