Skip to content

Add nvidia-cdi-refresh service #1076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

The NVIDIA Container Toolkit ships the command nvidia-ctk cdi generate, which produces a Container Device Interface (CDI) specification file describing every visible NVIDIA GPU and the libraries required inside a container for the given host. The previously generated specification becomes stale when the driver is upgraded, GPUs are added or removed from the node, or MIG partitions are modified.

This pull request introduces a new NVIDIA CDI refresh service to automatically update the NVIDIA Container Device Interface (CDI) specification when relevant system events occur, such as driver installation or module changes. The changes include adding systemd and udev configurations, packaging updates for both Debian and RPM-based distributions, and integration into Docker build processes.

New NVIDIA CDI Refresh Service:

  • Systemd Unit Files:

    • Added nvidia-cdi-refresh.service to refresh the NVIDIA CDI specification file using nvidia-ctk upon execution.
    • Added nvidia-cdi-refresh.path to monitor changes to /lib/modules/%v/modules.dep.bin and trigger the refresh service.
  • Udev Rules:

    • Added 60-nvidia-cdi-refresh.rules to invoke the refresh service on NVIDIA kernel module events or GPU PCI function changes.

Packaging Updates:

  • Debian Packaging:

    • Added a new package nvidia-container-toolkit-cdi-refresh with installation and post-installation scripts to enable the service and reload system configurations. [1] [2] [3]
  • RPM Packaging:

    • Integrated the service and udev rules into the RPM spec file, ensuring proper installation and activation during package installation. [1] [2] [3]

@ArangoGutierrez ArangoGutierrez requested review from elezar and Copilot May 12, 2025 13:28
@ArangoGutierrez ArangoGutierrez self-assigned this May 12, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces an NVIDIA CDI refresh service to automatically update the NVIDIA CDI specification when system events occur, alongside packaging and integration changes for Debian, RPM, and Docker build processes.

  • Added new systemd unit files and udev rules to trigger the CDI refresh service.
  • Updated packaging scripts for both Debian and RPM distributions and modified Dockerfiles to include new deployment artifacts.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packaging/rpm/SPECS/nvidia-container-toolkit.spec Added Source entries and installation steps for the new service components.
packaging/debian/nvidia-container-toolkit-cdi-refresh.postinst Introduced post-installation steps for Debian packaging.
packaging/debian/nvidia-container-toolkit-cdi-refresh.install Listed new deployment files for Debian.
packaging/debian/control Added a new control entry for the CDI refresh service package.
docker/Dockerfile.* Updated Dockerfiles to copy new systemd and udev files.
deployments/udev/60-nvidia-cdi-refresh.rules New udev rules to trigger service on NVIDIA events.
deployments/systemd/nvidia-cdi-refresh.service Defined the one-shot systemd service to refresh the CDI spec.
deployments/systemd/nvidia-cdi-refresh.path Defined the systemd path unit to monitor module changes.
Comments suppressed due to low confidence (1)

packaging/debian/control:33

  • The package name 'nvidia-container-toolkit-cdi-refresh service' contains a space which might cause issues; consider renaming it to 'nvidia-container-toolkit-cdi-refresh-service' for consistency.
Package: nvidia-container-toolkit-cdi-refresh service

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from 09aee84 to c366578 Compare May 12, 2025 13:47

[Path]
# depmod rewrites these exactly once per (un)install
PathChanged=/lib/modules/%v/modules.dep.bin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we match an nvidia pattern here so as to not trigger this for other modules?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, but is not a valid option on systemd.path as seen at https://www.freedesktop.org/software/systemd/man/latest/systemd.path.html#Options . Watching for this file is the more reliable way I found during testing, we can also watch /lib/modules/%v/modules.dep but this file doesn't get updated during module unloads.

# limitations under the License.

[Unit]
Description=Refresh NVIDIA OCI CDI specification file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Description=Refresh NVIDIA OCI CDI specification file
Description=Refresh NVIDIA CDI specification file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

[Service]
Type=oneshot
# The 30-second delay ensures that dependent services or resources are fully initialized.
# If the rationale for this delay is unclear, consider evaluating whether a shorter delay is sufficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# If the rationale for this delay is unclear, consider evaluating whether a shorter delay is sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 2 times, most recently from d2bb23d to 077bd25 Compare May 12, 2025 14:05
@ArangoGutierrez ArangoGutierrez requested a review from elezar May 12, 2025 14:05
[Service]
Type=oneshot
# The 30-second delay ensures that dependent services or resources are fully initialized.
ExecStartPre=/bin/sleep 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we have a 30 second sleep after EVERY event?

Copy link
Collaborator Author

@ArangoGutierrez ArangoGutierrez May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the only event that actually needs this is install/uninstall.
During an apt install cuda-drivers, it can take from 10 to 30 seconds from the first line of this logs to the last line

                                                                                                                                     
depmod....                                                                                                                           
Setting up libnvidia-decode-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                            
Setting up libnvidia-compute-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                           
Setting up libnvidia-encode-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                            
Setting up nvidia-utils-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                      
Setting up nvidia-compute-utils-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                              
Setting up libnvidia-gl-575:amd64 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                
Setting up nvidia-driver-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                     
Setting up cuda-drivers-575 (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                      
Setting up cuda-drivers (575.51.03-0ubuntu1) ...                                                                                                                                                                                                                          
Processing triggers for mailcap (3.70+nmu1ubuntu1) ...                                                                                                                                                                                                                    
Processing triggers for desktop-file-utils (0.26-1ubuntu3) ...                                                                       
Processing triggers for initramfs-tools (0.140ubuntu13.4) ...                                                                                                                                                                                                             
update-initramfs: Generating /boot/initrd.img-5.15.0-136-generic                                                                                                                                                                                                          
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast                                                                                                                                                                                                
Processing triggers for gnome-menus (3.36.0-1ubuntu3) ...                                                                            
Processing triggers for libc-bin (2.35-0ubuntu3.9) ...                                                                               
Processing triggers for man-db (2.10.2-1) ...                                                                                        
Processing triggers for dbus (1.12.20-2ubuntu4.1) ...                                                                                                                                                                                                                     
Scanning processes...                                                                                                                                                                                                                                                     
Scanning processor microcode...                                                                                                                                                                                                                                           
Scanning linux images...                                                                                                             
                                                                                                                                                                                                                                                                          
Running kernel seems to be up-to-date.                                                                                                                                                                                                                                    
                                                                                                                                     
The processor microcode seems to be up-to-date.                                                                                      
                                                                                                                                                                                                                                                                          
No services need to be restarted.                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                          
No containers need to be restarted.                                                                                                                                                                                                                                       
                                                                                                                                     
No user sessions are running outdated binaries.                                                                                      
                                                                                                                                     
No VM guests are running outdated hypervisor (qemu) binaries on this host. 

so 30 seconds is a safe time to wait for the full DEB install to happen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not have an additional package on rpm-based systems?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean adding the service install as part of the regular RPM install script?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the debian packages you added an nvidia-container-toolkit-cdi-refresh package that includes the systemd unit and udef rules. This seems to be missing from the RPM packages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your question, on the DEB side I noticed we have separated specific components into individual packages.
But on the RPM side we only have 1 file (RPM Package def file) so I followed the structure and added the install of the new 3 files in the same RPM def file

packaging
├── debian
│   ├── changelog.old
│   ├── compat
│   ├── control
│   ├── copyright
│   ├── nvidia-container-toolkit-base.install
│   ├── nvidia-container-toolkit-base.postinst
│   ├── nvidia-container-toolkit-cdi-refresh.install
│   ├── nvidia-container-toolkit-cdi-refresh.postinst
│   ├── nvidia-container-toolkit-operator-extensions.install
│   ├── nvidia-container-toolkit.install
│   ├── nvidia-container-toolkit.lintian-overrides
│   ├── nvidia-container-toolkit.postinst
│   ├── nvidia-container-toolkit.postrm
│   ├── prepare
│   └── rules
└── rpm
    ├── SOURCES
    │   └── LICENSE
    └── SPECS
        └── nvidia-container-toolkit.spec

Type=oneshot
# The 30-second delay ensures that dependent services or resources are fully initialized.
ExecStartPre=/bin/sleep 30
ExecStart=/usr/bin/nvidia-ctk cdi generate --output=/var/run/cdi/nvidia.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should this path depend on the installation locations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, yes. umm let me think how to mod this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm after checking we /usr/bin set as default for both deb/rpm, and we don't provide macros to change the install path. So having this hardcoded here looks ok. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then install command installs to:

install -m 755 -t %{buildroot}%{_bindir} nvidia-ctk

If this is always /usr/bin then we don't neet to update this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation, we can leave it as is. - https://docs.fedoraproject.org/en-US/packaging-guidelines/RPMMacros/#macros_installation

@ArangoGutierrez ArangoGutierrez requested a review from elezar May 12, 2025 14:24
@ArangoGutierrez ArangoGutierrez force-pushed the refresh_cdi branch 4 times, most recently from e9a46ed to a641d38 Compare May 13, 2025 09:16
Automatic regeneration of /var/run/cdi/nvidia.yaml
New units:
	•	nvidia-cdi-refresh.service – one-shot wrapper for
			nvidia-ctk cdi generate (adds sleep + required caps).
	•	nvidia-cdi-refresh.path   – fires on driver install/upgrade via
			modules.dep.bin changes.
	•	60-nvidia-cdi-refresh.rules – udev triggers for module add/remove, PCI
			bind/unbind/change, and MIG /dev/nvidia-caps* char-device events.
Packaging
	•	RPM %post reloads udev/systemd and enables the path unit on fresh
			installs.
	•	DEB postinst does the same (configure, skip on upgrade).

Result: CDI spec is always up to date

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants