Skip to content

[RHOAIENG-9004] Adjust existing test and workflow for GPU testing #575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2024

Conversation

sutaakar
Copy link
Contributor

@sutaakar sutaakar commented Jun 20, 2024

Issue link

RHOAIENG-9004

What changes have been made

Adjust GitHub e2e workflow to setup and install NVidia operator.
Convert Ray tests to use MNIST fashion test case on GPU. Adjust TestMNISTPyTorchAppWrapper to use GPU.
As a workaround for having just one GPU the tests are executed sequentially.

I also had to alter uploading CFO image to be loaded by KinD CLI.
The reason is that GPU image uses Docker and doesn't interact well with locally created registry.

Edit:
Thinking whether we should add GPU tests next to existing nonGPU ones, or rather replace them. Having both will increase number of tests, on the other side non GPU tests are faster and can be ran anywhere.

Verification steps

Run e2e PR check - GPU is made available on KinD cluster and tests use it.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@sutaakar sutaakar force-pushed the gpu branch 6 times, most recently from bb04f3e to 1916b7f Compare July 1, 2024 14:37
@sutaakar sutaakar marked this pull request as draft July 1, 2024 14:38
@sutaakar sutaakar force-pushed the gpu branch 7 times, most recently from 532edca to 4467bd3 Compare July 2, 2024 06:13
@sutaakar sutaakar marked this pull request as ready for review July 2, 2024 07:59
@openshift-ci openshift-ci bot requested review from anishasthana and tedhtchang July 2, 2024 07:59
@sutaakar sutaakar requested a review from astefanutti July 2, 2024 07:59
@sutaakar
Copy link
Contributor Author

sutaakar commented Jul 2, 2024

@astefanutti Ready for second round of review.
Right now GH action runs both CPU and GPU tests. Thinking whether we should run only GPU tests on PR check, leaving CPU just for possible local execution (will speed up PR check).
Dropped MNIST fashion example, all tests are using "plain" MNIST, to keep changes smaller. Can be ported to some other example later if needed.

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Yes CPU is more for local development. It's OK to only run GPU on CI now.

Also if we are to refactor the MNIST test, I think it'd be good to remove PyTorch Lightning and use plain PyTorch, which should Lower the maintenance effort.

Copy link

openshift-ci bot commented Jul 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jul 2, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 223021f into project-codeflare:main Jul 2, 2024
8 checks passed
@sutaakar sutaakar deleted the gpu branch July 2, 2024 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants