Skip to content

[no-relnote] Update E2E test suite #1048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025

Conversation

ArangoGutierrez
Copy link
Collaborator

No description provided.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the E2E test suite for the NVIDIA Container Toolkit. Key changes include:

  • Updating license headers to use SPDX identifiers.
  • Refactoring test files to use a global “runner” variable in place of a locally scoped “r” and adding log messages before running container commands.
  • Changing environment variable names for the container image from TOOLKIT_IMAGE to E2E_IMAGE_REPO and introducing E2E_IMAGE_TAG, with corresponding updates in the GitHub workflow.

Reviewed Changes

Copilot reviewed 46 out of 48 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/e2e/runner.go Updated license header to use SPDX identifiers.
tests/e2e/nvidia-container-toolkit_test.go Replaced local runner “r” with global “runner” along with added log messages.
tests/e2e/installer.go Updated license header to use SPDX identifiers.
tests/e2e/e2e_test.go Refactored env variable usage, introduced ImageRepo/ImageTag, and streamlined test env setup.
tests/e2e/README.md New file documenting the E2E suite but the env variable names may need updating.
.github/workflows/e2e.yaml Adjusted environment variables for image repository and tag accordingly.
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (2)

tests/e2e/nvidia-container-toolkit_test.go:59

  • The log message duplicates the '--gpus' flag; consider revising it to accurately reflect the command arguments (e.g. 'By("Running docker run with --runtime=nvidia --gpus=all")').
By("Running docker run with --gpus=all --runtime=nvidia --gpus all")

tests/e2e/README.md:62

  • The code now uses 'E2E_IMAGE_REPO' and 'E2E_IMAGE_TAG' for the container image instead of 'TOOLKIT_IMAGE'; please update the documentation accordingly.
| `TOOLKIT_IMAGE` | ✔ | `nvcr.io/nvidia/cuda:12.4.0-runtime-ubi9` | Image that will be pulled & executed. |

@ArangoGutierrez ArangoGutierrez force-pushed the updated_e2e branch 2 times, most recently from bf24666 to f5c3347 Compare April 24, 2025 19:13
@ArangoGutierrez ArangoGutierrez self-assigned this Apr 24, 2025
@ArangoGutierrez ArangoGutierrez force-pushed the updated_e2e branch 4 times, most recently from 5280561 to b372840 Compare April 25, 2025 09:55
@ArangoGutierrez
Copy link
Collaborator Author

PR is ready for review @elezar

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the end‑to‑end test suite by refactoring test runners, updating copyright headers, and standardizing environment variable usage.

  • Updated SPDX headers across test files.
  • Replaced inline runner instances with a global runner variable and refactored test commands to use formatted image references.
  • Added new test environment setup logic and improved the GitHub Actions workflow.

Reviewed Changes

Copilot reviewed 46 out of 48 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/e2e/runner.go Updated header information with SPDX details.
tests/e2e/nvidia-container-toolkit_test.go Refactored test cases: replaced local runner with global 'runner' and added logging messages.
tests/e2e/installer.go Updated header information with SPDX details.
tests/e2e/e2e_test.go Introduced comprehensive environment setup and refactored flag usage.
tests/e2e/README.md Added documentation for the E2E test suite.
.github/workflows/e2e.yaml Updated environment variable usage and test command targets.
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (1)

tests/e2e/e2e_test.go:149

  • [nitpick] The function name getIntEnvVar suggests an integer return type, but it returns a string; consider renaming it for clarity.
func getIntEnvVar(key string, defaultValue int) string {

@tariq1890
Copy link
Contributor

Looks like this PR has unresolved merge conflicts

@ArangoGutierrez
Copy link
Collaborator Author

Looks like this PR has unresolved merge conflicts

Rebased

@ArangoGutierrez ArangoGutierrez force-pushed the updated_e2e branch 2 times, most recently from ca67999 to f9792b7 Compare May 5, 2025 09:27
@@ -51,3 +51,4 @@ jobs:
uses: ./.github/workflows/e2e.yaml
with:
version: ${{ needs.variables.outputs.version }}
distribution: ubuntu20.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're working on #602 I think we should just set the distribution in the test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -82,8 +85,15 @@ jobs:
chmod 600 "$e2e_ssh_key"
export E2E_SSH_KEY="$e2e_ssh_key"

make -f tests/e2e/Makefile test
make -f tests/e2e/Makefile test-e2e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We're already in the e2e folder. I think just test is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -12,34 +13,14 @@
# See the License for the specific language governing permissions and
# limitations under the License.

GO_CMD ?= go
.PHONY: test-e2e ginkgo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.PHONY: test-e2e ginkgo
.PHONY: test ginkgo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


E2E_RUNTIME ?= docker
ginkgo:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
ginkgo:
ginkgo: $(CURDIR)/bin/ginkgo

so that we don't ALWAYS install ginkgo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

-ssh-user=$(E2E_SSH_USER) \
-remote-host=$(E2E_SSH_HOST) \
-remote-port=$(E2E_SSH_PORT)
test-e2e: ginkgo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test-e2e: ginkgo
test: ginkgo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

---

## 1 Scope & Goals
This repository contains a **Ginkgo v2 / Gomega** test harness that exercises an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be folder and not repository?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agree

Comment on lines 41 to 42
* Environment discovery happens once in `TestMain` → `getTestEnv()`; parameters
are therefore immutable for the duration of the run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the kind of thing that can get out of sync with the actual implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getTestEnv is called at Main, not during a BeforeSuite, so it is only run once.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the E2E test suite by refactoring the test harness settings, updating license headers to use SPDX tags, and enhancing workflow steps. Key changes include renaming and consolidating environment variables used for runner configuration, refactoring test commands to use a global runner instead of local instances, and updating the GitHub workflow to archive Ginkgo logs.

Reviewed Changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/e2e/runner.go Updated license header to SPDX format
tests/e2e/nvidia-container-toolkit_test.go Refactored runner usage from a local variable to a shared global runner
tests/e2e/installer.go Updated license header to SPDX format
tests/e2e/e2e_test.go Replaced flag parsing with environment variable validation and added new global configurations
tests/e2e/README.md Added detailed documentation for the updated test suite
.github/workflows/e2e.yaml Updated environment variable names and added a step to archive logs
Files not reviewed (2)
  • tests/e2e/Makefile: Language not supported
  • tests/go.mod: Language not supported
Comments suppressed due to low confidence (1)

tests/e2e/e2e_test.go:155

  • [nitpick] The function name getIntEnvVar implies an integer return type; consider renaming it to reflect that it returns a string, or update it to return an int for clarity.
func getIntEnvVar(key string, defaultValue int) string {

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's not make these adjustments if they're not required. We don't have consistnt SPDX usage anyway and if we update it in this file, that should be as a single commit / PR that is out of scope fo this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Expect(err).ToNot(HaveOccurred())
Expect(ldconfigOut).To(ContainSubstring("/usr/local/cuda/compat"))
})

It("should NOT work with nvidia-container-runtime-hook", func(ctx context.Context) {
ldconfigOut, _, err := r.Run("docker run --rm -i -e NVIDIA_DISABLE_REQUIRE=true --runtime=runc --gpus all nvcr.io/nvidia/cuda:12.8.0-base-ubi8 bash -c \"ldconfig -p | grep libcuda.so.1\"")
By("Running docker run with -e NVIDIA_DISABLE_REQUIRE=true --gpus all")
ldconfigOut, _, err := runner.Run(fmt.Sprintf("docker run --rm -i -e NVIDIA_DISABLE_REQUIRE=true --runtime=runc --gpus all %s bash -c \"ldconfig -p | grep libcuda.so.1\"", cudaImage))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how I feel about injecting the cudaImage into the string. This makes the actual command we're running a log more difficult to read. This change also seems orthogonal to what we're trying to acheive in this PR -- which is to align the MECHANICS of the tests with those of the other projects we maintain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I'll return to the full string as before.

@@ -198,19 +174,22 @@ var _ = Describe("docker", Ordered, ContinueOnFailure, func() {
})

It("should work with the nvidia runtime in legacy mode", func(ctx context.Context) {
ldconfigOut, _, err := r.Run("docker run --rm -i -e NVIDIA_DISABLE_REQUIRE=true --runtime=nvidia --gpus all nvcr.io/nvidia/cuda:12.8.0-base-ubi8 bash -c \"ldconfig -p | grep libcuda.so.1\"")
By("Running docker run with -e NVIDIA_DISABLE_REQUIRE=true --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we introducing a new context here? The It() statement above is decribing what we're testing and all this By statement is doing is repeating the command that we're running below. What value does this add?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea of the by is to be more verbose during each step

------------------------------
docker when Running the cuda-deviceQuery sample should support automatic CDI spec generation
/home/runner/_work/nvidia-container-toolkit/nvidia-container-toolkit/tests/e2e/nvidia-container-toolkit_test.go:130
  STEP: Running docker run with --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=all @ 05/06/25 05:40:14.795
• [7.565 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support the --gpus flag using the nvidia-container-runtime
/home/runner/_work/nvidia-container-toolkit/nvidia-container-toolkit/tests/e2e/nvidia-container-toolkit_test.go:137
  STEP: Running docker run with --runtime=nvidia --gpus all @ 05/06/25 05:40:22.36
• [6.087 seconds]
------------------------------
docker when Running the cuda-deviceQuery sample should support the --gpus flag using the nvidia-container-runtime-hook
/home/runner/_work/nvidia-container-toolkit/nvidia-container-toolkit/tests/e2e/nvidia-container-toolkit_test.go:144
  STEP: Running docker run with --gpus all @ 05/06/25 05:40:28.448
• [6.062 seconds]

the STEP line is the reflect of the By, I think it makes reading the test logs much easier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the ginkgo documentation on By() Generally you should try to keep your Its short and to the point. This is not always possible, however, especially in the context of integration tests that capture complex or lengthy workflows. I decided to remove this as per our discussion

* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
* SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to runner.go.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will return to regular LICENSE header

return strconv.Itoa(defaultValue)
}

return strconv.Itoa(intValue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're calling Itoa three times here -- twice with the same argument -- is there not a better way to implement this. What about splitting this into two functions:

  1. getEnvVarAs[T](string) (T, error)
  2. getEnvVarOrDefault[T](string, T) T

we could then call getEnvVarAs from getEnvVarOrDefault and return the default value if we return an error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})

// getTestEnv gets the test environment variables
func getTestEnv() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that the envvars here match the ones in the README (or remove the readme section).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 120 to 123
// TODO (@ArangoGutierrez):
// once https://github.com/NVIDIA/nvidia-container-toolkit/pull/602
// is merged, remove this
ImageTag = fmt.Sprintf("%s-ubuntu20.04", ImageTag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is NOT what I meant by my comment. We should append this suffix in the YAML file and not here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment on lines 114 to 118
ImageRepo = os.Getenv("E2E_IMAGE_REPO")
Expect(ImageRepo).NotTo(BeEmpty(), "E2E_IMAGE_REPO environment variable must be set")

ImageTag = os.Getenv("E2E_IMAGE_TAG")
Expect(ImageTag).NotTo(BeEmpty(), "E2E_IMAGE_TAG environment variable must be set")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are only relevant if E2E_INSTALL_CTK == true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 92 to 101
_, _, err := runner.Run("docker pull ubuntu")
Expect(err).ToNot(HaveOccurred())

_, _, err = runner.Run(fmt.Sprintf("docker pull %s", vectorAddImage))
Expect(err).ToNot(HaveOccurred())

_, _, err = runner.Run(fmt.Sprintf("docker pull %s", deviceQueryImage))
Expect(err).ToNot(HaveOccurred())
_, _, err = runner.Run(fmt.Sprintf("docker pull %s", cudaImage))
Expect(err).ToNot(HaveOccurred())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user is only running a subset of tests? The pull of the images should be LOCAL to the tests that use the images. Do you have a motivation for collecting them in one place?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe as a general comment: When a developer is writing tests, we want to avoid having them have to remember that they also have to add the pull checks here when adding tests for new functionality.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

.gitignore Outdated
@@ -11,3 +11,4 @@
/nvidia-ctk
/shared-*
/release-*
/bin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: No newline

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

Comment on lines 100 to 101
REMOTE_HOST=10.0.0.15 \
REMOTE_PORT=22 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not consistent with the actual implementation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this section

Copy link
Member

@elezar elezar May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this file from the repo. The Makefile is currently simple enough to act as an example for how to run the tests. (ok that's no longer true, but I don't see value in updating things in two places when we add functionality).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified the README

Comment on lines 45 to 46
cwd string
packagePath string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe these are used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm yeah, that was used in a previos iteration but I ended up removing the need for it, thanks, I have removed it

)
sshKey string
sshUser string
host string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this PR, but we could consider using sshHost here instead for consistency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArangoGutierrez please see my proposed changes as a commit on top.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
@ArangoGutierrez ArangoGutierrez merged commit 241881f into NVIDIA:main May 14, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants