Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet #16048

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Shafi-Hussain
Copy link
Contributor

@Shafi-Hussain Shafi-Hussain commented Apr 4, 2025

Changes:

  • bumped up opencv-python version to 4.11.0.86 (in sync with x86)
  • (temporary) added opencv patch for version 86 (already fixed on main branch, will get resolved in next release)
  • building and installing hf-xet from source

Copy link

github-actions bot commented Apr 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Apr 4, 2025
@Shafi-Hussain Shafi-Hussain force-pushed the ppc64le-docker-dependencies branch from 20c8e34 to 0fc52a2 Compare April 4, 2025 05:29
@Shafi-Hussain Shafi-Hussain marked this pull request as ready for review April 7, 2025 09:45
@Shafi-Hussain
Copy link
Contributor Author

@DarkLight1337 could you please enable ppc64le specific tests for this PR to have them verified?

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 7, 2025
@DarkLight1337
Copy link
Member

Hmm. @khluu I think you need to enable this over at https://github.com/vllm-project/buildkite-ci

@Shafi-Hussain
Copy link
Contributor Author

@khluu @DarkLight1337 is the build/test for ppc64le enabled for this PR?
For reference, I wanted these jobs to run: #15402

@DarkLight1337
Copy link
Member

Unblocking ppc64le test. Sorry for the delay!

@DarkLight1337
Copy link
Member

cc @khluu why are the scripts missing from the CI?

@khluu
Copy link
Collaborator

khluu commented Apr 10, 2025

I think this branch is stale (it was created when the script was not moved yet). Can you rebase this branch with main?

@Shafi-Hussain Shafi-Hussain force-pushed the ppc64le-docker-dependencies branch from b2e8362 to 97264b7 Compare April 10, 2025 12:05
@hmellor
Copy link
Member

hmellor commented Apr 10, 2025

I've unblocked the ppc64le test

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the test passes

@hmellor
Copy link
Member

hmellor commented Apr 10, 2025

It failed, but because of Triton this time, rather than because of hf_xet.

@Shafi-Hussain do you want to fix that in this PR or in a follow up?

@Shafi-Hussain
Copy link
Contributor Author

Hi @hmellor, I'd prefer to have it fixed in this PR itself. Could we please hold off on merging for a while? Also, if you could share which exact step failed, it would be helpful to debug. Triton should not have been built for ppc64le builds

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll change the review so that the merge button cannot be pressed.

@hmellor
Copy link
Member

hmellor commented Apr 10, 2025

You can see the logs in https://buildkite.com/vllm/ci/builds/17352/steps?jid=01961f98-719f-47a7-b20a-04d69d7ad280

 => ERROR [vllmcache-builder 5/5] RUN --mount=type=cache,target=/root/.cache/uv     --mount=type=bind,from=torch-builder,source=/torchwheels/,target=/t  111.1s
------
 > [vllmcache-builder 5/5] RUN --mount=type=cache,target=/root/.cache/uv     --mount=type=bind,from=torch-builder,source=/torchwheels/,target=/torchwheels/,ro     --mount=type=bind,from=arrow-builder,source=/arrowwheels/,target=/arrowwheels/,ro     --mount=type=bind,from=cv-builder,source=/opencvwheels/,target=/opencvwheels/,ro     --mount=type=bind,src=.,dst=/src/,rw     source /opt/rh/gcc-toolset-13/enable &&     uv pip install /opencvwheels/*.whl /arrowwheels/*.whl /torchwheels/*.whl &&     sed -i -e 's/.*torch.*//g' /src/pyproject.toml /src/requirements/*.txt &&     uv pip install pandas pythran pybind11 /hf_wheels/*.whl &&     export PKG_CONFIG_PATH=$(find / -type d -name "pkgconfig" 2>/dev/null | tr '\n' ':') &&     uv pip install -r /src/requirements/common.txt -r /src/requirements/cpu.txt -r /src/requirements/build.txt --no-build-isolation &&     cd /src/ &&     uv build --wheel --out-dir /vllmwheel/ --no-build-isolation &&     uv pip install /vllmwheel/*.whl:
0.278 Using Python 3.12.9 environment at: /opt/vllm
0.965 Resolved 16 packages in 683ms
2.510    Building pillow==11.1.0
23.53       Built pillow==11.1.0
23.55 Prepared 6 packages in 22.58s
24.39 Installed 15 packages in 837ms
24.39  + filelock==3.18.0
24.39  + fsspec==2025.3.2
24.39  + jinja2==3.1.6
24.39  + markupsafe==3.0.2
24.39  + mpmath==1.3.0
24.39  + networkx==3.4.2
24.39  + numpy==2.2.4
24.39  + opencv-python-headless==4.11.0.86 (from file:///opencvwheels/opencv_python_headless-4.11.0.86-cp312-cp312-linux_ppc64le.whl)
24.39  + pillow==11.1.0
24.39  + pyarrow==19.0.1 (from file:///arrowwheels/pyarrow-19.0.1-cp312-cp312-linux_ppc64le.whl)
24.39  + sympy==1.13.1
24.39  + torch==2.6.0 (from file:///torchwheels/torch-2.6.0-cp312-cp312-linux_ppc64le.whl)
24.39  + torchaudio==2.6.0 (from file:///torchwheels/torchaudio-2.6.0-cp312-cp312-linux_ppc64le.whl)
24.39  + torchvision==0.21.0 (from file:///torchwheels/torchvision-0.21.0-cp312-cp312-linux_ppc64le.whl)
24.39  + typing-extensions==4.13.1
24.41 Using Python 3.12.9 environment at: /opt/vllm
25.16 Resolved 13 packages in 748ms
25.18 Downloading pythran (4.1MiB)
26.07    Building pandas==2.2.3
26.95  Downloaded pythran
108.6       Built pandas==2.2.3
108.6 Prepared 10 packages in 1m 23s
108.9 Installed 11 packages in 294ms
108.9  + beniget==0.4.2.post1
108.9  + gast==0.6.0
108.9  + hf-xet==1.0.3 (from file:///hf_wheels/hf_xet-1.0.3-cp37-abi3-linux_ppc64le.whl)
108.9  + pandas==2.2.3
108.9  + ply==3.11
108.9  + pybind11==2.13.6
108.9  + python-dateutil==2.9.0.post0
108.9  + pythran==0.17.0
108.9  + pytz==2025.2
108.9  + six==1.17.0
108.9  + tzdata==2025.2
109.5 Using Python 3.12.9 environment at: /opt/vllm
110.2   × No solution found when resolving dependencies:
110.2   ╰─▶ Because triton==3.2.0 has no wheels with a matching platform tag (e.g.,
110.2       `manylinux_2_34_ppc64le`) and you require triton==3.2.0, we can conclude
110.2       that your requirements are unsatisfiable.
110.2
110.2       hint: Wheels are available for `triton` (v3.2.0) on the following
110.2       platforms: `manylinux_2_17_x86_64`, `manylinux2014_x86_64`

@Shafi-Hussain
Copy link
Contributor Author

@hmellor created an issue for trtiton
#16413

@Shafi-Hussain
Copy link
Contributor Author

Shafi-Hussain commented Apr 11, 2025

@hmellor can we have this PR merged? Since the triton conflicts have been resolved

@hmellor
Copy link
Member

hmellor commented Apr 11, 2025

Can you rebase so that we can run the test again to verify that this fix works?

Removing pinned setuptools version

Signed-off-by: Md. Shafi Hussain <[email protected]>
@Shafi-Hussain Shafi-Hussain force-pushed the ppc64le-docker-dependencies branch from 97264b7 to 7172e72 Compare April 11, 2025 11:06
@hmellor
Copy link
Member

hmellor commented Apr 11, 2025

Test unblocked and running here

@hmellor
Copy link
Member

hmellor commented Apr 11, 2025

Test failed, this time seemingly because of opentelemetry

Signed-off-by: Md. Shafi Hussain <[email protected]>
@Shafi-Hussain
Copy link
Contributor Author

@hmellor fixed the errors. It doesn't seem like the ppc64le build got triggered. Could you please check and confirm?

@hmellor
Copy link
Member

hmellor commented Apr 11, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants