Skip to content

Avoid mistakenly picking Gaudi/HPU if XPU is requested. #11018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

janimo
Copy link
Contributor

@janimo janimo commented Dec 9, 2024

In setup.py _is_hpu() will return true when /dev/accel/accel0 is present but that can happen with XPU devices also (Intel iGPU).

This change allows VLLM_TARGET_DEVICE="xpu" to override that and proceed with an XPU install.

Maybe a simpler is_hpu() that solely relies on VLLM_TARGET_DEVICE="hpu" would be cleaner but the current setup is probably needed for some existing setups.

Copy link

github-actions bot commented Dec 9, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@jikunshang
Copy link
Contributor

nice catch! I think you are using some latest Intel CPU with GPU and NPU, and /dev/accel/accel0 is actually a NPU device. when you want to install vllm-XPU, is_hpu method also returns True, right.
cc @kzawora-intel , any better approach to determine whether HPU (rather NPU or other accelerator device) exists on host?

@janimo
Copy link
Contributor Author

janimo commented Dec 13, 2024

@jikunshang updated the change to be simpler and more generic. On the same computer with that NPU present, openvino, xpu and cpu are all valid target names and are all preceded by the is_hpu() check in setup.py and thus cannot be selected.

@jikunshang
Copy link
Contributor

@jikunshang updated the change to be simpler and more generic. On the same computer with that NPU present, openvino, xpu and cpu are all valid target names and are all preceded by the is_hpu() check in setup.py and thus cannot be selected.

I feel the best fix is making is_hpu_available correct. Your fix works abolutely, but user need add extra VLLM_TARGET_DEVICE=hpu on hpu device with this fix. Any comments @kzawora-intel

@janimo
Copy link
Contributor Author

janimo commented Dec 15, 2024

@jikunshang you're right because VLLM_TARGET_DEVICE is always set, defaulting to cuda :(

I think explicitly requiring setting it for HPU as it is done with most other targets would be best. If there are setups where vllm on gaudi is being tested relying only on hw detection, they would need to be updated and that may be too complicated (if CI/production systems not keeping up closely with changes in vllm)

Other ways of fixing this:

  1. have target be empty if not set, but this will probably break many more setups that just assume cuda.

  2. return false if target is cpu/xpu/openvino which are situations where hpu can be misdetected. This would be a good localized fixed probably.

  3. put the checks for is_hpu at the end, after cpu/xpu/opnevino in the two places in setup.py where such tests are run, with a comment explaining that order is important.

I still find the cleanest would be every target being selected explicitly. I feel the heuristics for the hw check may get even more complex with different types of hw and naming of devices in the future unless there is a single good indicator of Gaudi hw being present (like a line in dmesg)

@kzawora-intel
Copy link
Contributor

Hi, I opened a PR #12046 fixing this issue - it definitely is a bug in _is_hpu that it returns True for non-HPU platforms, it should not do that. To prevent any future misdetections, I skipped autodetection if different platform is requested explicitly with VLLM_TARGET_DEVICE. and if autodetection continues, it no longer uses conditions that could pass on different platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants