Skip to content

nvidia-driver-installer addon fails to start (driver fails to install in the container) #6912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Nick-Harvey opened this issue Mar 6, 2020 · 24 comments · Fixed by #13972
Closed
Labels
area/gpu GPU related items co/kvm2-driver KVM2 driver related issues kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@Nick-Harvey
Copy link

Nick-Harvey commented Mar 6, 2020

The exact command to reproduce the issue:
Following the instructions on https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/

Host: Fedora 31
Kernel: 5.5.7-200.fc31.x86_64
Cuda Drivers Host: 10.2
Nvidia Driver Host: Driver Version: 440.64

GPUs:

  • Titan V (vfio-pci driver assigned)
  • Geforce 1080TI (host GPU nvidia driver)

Minikube start:
minikube start --vm-driver kvm2 --kvm-gpu --cpus=12 --memory=25480

Minikube Addons:
minikube addons enable nvidia-gpu-device-plugin
This will fail initially, just edit the dc/nvidia-gpu-device-plugin and increase the mem to 100mi and it'll start fine.

The full output of the command that failed:


minikube addons enable nvidia-driver-installer

The container fails to start. and once you fetch teh logs, it will return:

Configuring kernel sources... DONE
Running Nvidia installer...
/usr/local/nvidia /
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 390.67.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: An error occurred while performing the step: "Building kernel
       modules". See /var/log/nvidia-installer.log for details.


ERROR: An error occurred while performing the step: "Checking to see
       whether the nvidia-drm kernel module was successfully built". See
       /var/log/nvidia-installer.log for details.


ERROR: The nvidia-drm kernel module was not created.


ERROR: The nvidia-drm kernel module failed to build. This kernel module is
       required for the proper operation of DRM-KMS. If you do not need to
       use DRM-KMS, you can try to install this driver package again with
       the '--no-drm' option.


ERROR: Installation has failed.  Please see the file
       '/usr/local/nvidia/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

The output of the minikube logs command:

The minikube logs don't say much but here they are:
minikube-logs.txt

Logs from the nvidia-installer itself:
nvidia-driver-installer-logs.txt

The operating system version:

@Nick-Harvey
Copy link
Author

Nick-Harvey commented Mar 6, 2020

Additional Info...

➜  ~ nvidia-smi
Fri Mar  6 13:54:00 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64       Driver Version: 440.64       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:2D:00.0 Off |                  N/A |
| 18%   31C    P8    12W / 280W |    166MiB / 11175MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7478      G   /usr/libexec/Xorg                             39MiB |
|    0      7509      G   /usr/bin/gnome-shell                         124MiB |
+-----------------------------------------------------------------------------+

Titan V

23:00.0 0300: 10de:1d81 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: 10de:1218
        Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 37
        Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at e000 [size=128]
        Expansion ROM at f5000000 [disabled] [size=512K]
        Capabilities: <access denied>
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau, nvidia_drm, nvidia

23:00.1 0403: 10de:10f2 (rev a1)
        Subsystem: 10de:1218
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 39
        Region 0: Memory at f5080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

Geforce 1080ti

2d:00.0 0300: 10de:1b06 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: 3842:6696
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 106
        Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at f000 [size=128]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia

2d:00.1 0403: 10de:10ef (rev a1)
        Subsystem: 3842:6696
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin B routed to IRQ 102
        Region 0: Memory at f7080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

@Nick-Harvey
Copy link
Author

Just attempted the process again with minikube v 1.8.1 and no luck :(

I didn't see anything in the change notes that would have helped, but it was worth a shot :)

@afbjorklund
Copy link
Collaborator

Can you attach the nvidia-installer.log ?

ERROR: An error occurred while performing the step: "Building kernel
       modules". See /var/log/nvidia-installer.log for details.

@afbjorklund afbjorklund added co/kvm2-driver KVM2 driver related issues triage/needs-information Indicates an issue needs more information in order to work on it. labels Mar 16, 2020
@Nick-Harvey
Copy link
Author

Unfortunately, I can't find a way to get the /var/log/nvidia-installer.log file I've tried:

➜  ~ kubectl exec -it nvidia-driver-installer-w6f97 -n kube-system /bin/bash
error: unable to upgrade connection: container not found ("pause")

long-shot

➜  ~ kubectl logs nvidia-driver-installer-w6f97 -n kube-system              
Error from server (BadRequest): container "pause" in pod "nvidia-driver-installer-w6f97" is waiting to start: PodInitializing

Any idea how to get access to the nvidia-installer.log ? I would love to know myself what Nvidia is saying. I have a sneaking suspicion that it has to do with where the nvidia drivers are installed to on the host, but that's just a guess.

@cmxela
Copy link

cmxela commented Mar 23, 2020

I'm having exactly the same problem with Fedora 31, in my case I have

Host: Fedora 31
Kernel: 5.5.7-200.fc31.x86_64
Cuda Drivers Host: 10.2
Nvidia Driver Host: Driver Version: 440.64

GPUs:

GeForce GT 710 (host GPU nvidia driver)
2 Geforce 1080TI (vfio-pci driver assigned)

@sharifelgamal sharifelgamal added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Mar 25, 2020
@priyawadhwa
Copy link

Hey @Nick-Harvey does the nvidia pod ever have status Running?

I'm wondering because of this log when you tried to run:

$ kubectl logs nvidia-driver-installer-w6f97 -n kube-system       
> pod "nvidia-driver-installer-w6f97" is waiting to start: PodInitializing

The output of kubectl describe po nvidia-xx-xx -n kube-system might be helpful here.

If the pod is Running, could you try:

kubectl exec nvidia-driver-installer-w6f97 -n kube-system -- cat /var/log/nvidia-installer.log

@priyawadhwa priyawadhwa added the kind/support Categorizes issue or PR as a support question. label Apr 8, 2020
@fvdnabee
Copy link

fvdnabee commented May 26, 2020

I'm setting up minikube with the kvm2 driver and gpu passthrough as per https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/.
GPU passthrough is available to the VM (the PCI-E devices are available as hardware to the minikube VM), but I'm failing to install the nvidia drivers inside the VM via

minikube addons enable nvidia-gpu-device-plugin
minikube addons enable nvidia-driver-installer

The nvidia-gpu-device-plugin pod is listed as running, whereas the second pod is stuck on Init (paused). I'm guessing this is the same issue @Nick-Harvey encountered.

kubectl get pods -n kube-system

NAME                               READY   STATUS     RESTARTS   AGE
coredns-66bff467f8-562mh           1/1     Running    0          29m
coredns-66bff467f8-kcxhg           1/1     Running    0          29m
etcd-minikube                      1/1     Running    0          28m
kube-apiserver-minikube            1/1     Running    0          28m
kube-controller-manager-minikube   1/1     Running    0          28m
kube-proxy-mb5lf                   1/1     Running    0          29m
kube-scheduler-minikube            1/1     Running    0          28m
nvidia-driver-installer-c5bvd      0/1     Init:0/1   2          3m22s
nvidia-gpu-device-plugin-57n2l     1/1     Running    0          28m
storage-provisioner                1/1     Running    0          28m

kubectl logs nvidia-gpu-device-plugin-57n2l -n kube-system

2020/05/26 14:26:03 Failed to initialize NVML: could not load NVML library.
2020/05/26 14:26:03 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2020/05/26 14:26:03 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2020/05/26 14:26:03 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

Some additional info on the pods:

`kubectl describe po nvidia-gpu-device-plugin-57n2l -n kube-system`
Name:                 nvidia-gpu-device-plugin-57n2l
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 minikube/192.168.39.201
Start Time:           Tue, 26 May 2020 16:25:57 +0200
Labels:               controller-revision-hash=7f89b4b55b
                      k8s-app=nvidia-gpu-device-plugin
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   172.17.0.4
IPs:
  IP:           172.17.0.4
Controlled By:  DaemonSet/nvidia-gpu-device-plugin
Containers:
  nvidia-gpu-device-plugin:
    Container ID:  docker://4babe01395d832acec7db0c6b449da25b5c8dd1114ed0e91c7d80f1255510f44
    Image:         nvidia/k8s-device-plugin:1.0.0-beta4
    Image ID:      docker-pullable://nvidia/k8s-device-plugin@sha256:94d46bf513cbc43c4d77a364e4bbd409d32d89c8e686e12551cc3eb27c259b90
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/nvidia-device-plugin
      -logtostderr
    State:          Running
      Started:      Tue, 26 May 2020 16:26:03 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        50m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pvcvh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  default-token-pvcvh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pvcvh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     :NoExecute
                 :NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  34m   default-scheduler  Successfully assigned kube-system/nvidia-gpu-device-plugin-57n2l to minikube
  Normal  Pulling    34m   kubelet, minikube  Pulling image "nvidia/k8s-device-plugin:1.0.0-beta4"
  Normal  Pulled     34m   kubelet, minikube  Successfully pulled image "nvidia/k8s-device-plugin:1.0.0-beta4"
  Normal  Created    34m   kubelet, minikube  Created container nvidia-gpu-device-plugin
  Normal  Started    34m   kubelet, minikube  Started container nvidia-gpu-device-plugin
`kubectl describe po nvidia-driver-installer-c5bvd -n kube-system`
Name:         nvidia-driver-installer-c5bvd
Namespace:    kube-system
Priority:     0
Node:         minikube/192.168.39.201
Start Time:   Tue, 26 May 2020 16:51:20 +0200
Labels:       controller-revision-hash=db985bcbc
              k8s-app=nvidia-driver-installer
              pod-template-generation=1
Annotations:  <none>
Status:       Pending
IP:           172.17.0.7
IPs:
  IP:           172.17.0.7
Controlled By:  DaemonSet/nvidia-driver-installer
Init Containers:
  nvidia-driver-installer:
    Container ID:   docker://6e972a48ef014a50b13a12a60a6d0179512d191c24170a5034a04f213eeb8809
    Image:          k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342
    Image ID:       docker-pullable://k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 26 May 2020 16:53:47 +0200
      Finished:     Tue, 26 May 2020 16:54:47 +0200
    Ready:          False
    Restart Count:  2
    Requests:
      cpu:  150m
    Environment:
      NVIDIA_INSTALL_DIR_HOST:       /home/kubernetes/bin/nvidia
      NVIDIA_INSTALL_DIR_CONTAINER:  /usr/local/nvidia
      ROOT_MOUNT_DIR:                /root
    Mounts:
      /dev from dev (rw)
      /root from root-mount (rw)
      /usr/local/nvidia from nvidia-install-dir-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pvcvh (ro)
Containers:
  pause:
    Container ID:   
    Image:          k8s.gcr.io/pause:2.0
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-pvcvh (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  dev:
    Type:          HostPath (bare host directory volume)
    Path:          /dev
    HostPathType:  
  nvidia-install-dir-host:
    Type:          HostPath (bare host directory volume)
    Path:          /home/kubernetes/bin/nvidia
    HostPathType:  
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  default-token-pvcvh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-pvcvh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  4m12s                default-scheduler  Successfully assigned kube-system/nvidia-driver-installer-c5bvd to minikube
  Normal   Pulling    4m11s                kubelet, minikube  Pulling image "k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342"
  Normal   Pulled     4m4s                 kubelet, minikube  Successfully pulled image "k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342"
  Normal   Created    105s (x3 over 4m4s)  kubelet, minikube  Created container nvidia-driver-installer
  Normal   Started    105s (x3 over 4m4s)  kubelet, minikube  Started container nvidia-driver-installer
  Normal   Pulled     105s (x2 over 3m2s)  kubelet, minikube  Container image "k8s.gcr.io/minikube-nvidia-driver-installer@sha256:492d46f2bc768d6610ec5940b6c3c33c75e03e201cc8786e04cc488659fd6342" already present on machine
  Warning  BackOff    42s (x4 over 2m)     kubelet, minikube  Back-off restarting failed container

Any suggestions on how to debug this further?

@ysow
Copy link

ysow commented Jun 7, 2020

Hello @fvdnabee did you find a solution ? I'm having the same issue as you.

@fvdnabee
Copy link

fvdnabee commented Jun 9, 2020

@ysow It is my experience that gpu passthrough does not work with minikube kvm's driver. I managed to get GPU support working in minikube by using the none driver, as explained here https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/#using-the-none-driver

@Gsantomaggio
Copy link

Gsantomaggio commented Jun 18, 2020

Same problem here, I can start minikube:

minikube start --driver kvm2 --kvm-gpu

but I get the same errors above.

Minikube version 1.11.0

With docker works:

 docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Thu Jun 18 21:03:32 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |

configuration:

GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=0
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on iommu=pt pci-stub.ids=10de:1cb1"
GRUB_CMDLINE_LINUX=""

@fvdnabee I can't make it running not even with none drive :( <<<---- SOLVED!

I can run it using the driver none.

@priyawadhwa priyawadhwa removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 8, 2020
@priyawadhwa
Copy link

Hey @Nick-Harvey does this work with the none driver for you?

@sharifelgamal sharifelgamal added area/gpu GPU related items kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. kind/support Categorizes issue or PR as a support question. labels Sep 30, 2020
@sharifelgamal
Copy link
Collaborator

The none driver workaround is just that, a workaround. I suspect this should also work with the docker and kvm2 drivers.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@skol101
Copy link

skol101 commented Dec 22, 2021

Hello,
I've got similar issue: looks like addon uses very old NVIDIA GPU driver 390.67 which doesn't support RTX 3090/3080.
How can I install the addon with the new NVIDIA driver?

@k8s-ci-robot
Copy link
Contributor

@skol101: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sharifelgamal
Copy link
Collaborator

We don't currently have the bandwidth to properly maintain the nvidia addon, but would be happy to accept PRs to fix it.

Reopening for visibility.

@sharifelgamal sharifelgamal reopened this Dec 22, 2021
@sharifelgamal sharifelgamal added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Dec 22, 2021
@skol101
Copy link

skol101 commented Dec 22, 2021

@skol101
Copy link

skol101 commented Dec 23, 2021

I got it, the issue is that RTX30X0 aren't supported in vGPU mode https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-generic-linux-kvm/index.html#updates-release-update-1

Bloody nvidia.

@ckjoris
Copy link

ckjoris commented Mar 16, 2022

hey @skol101, how did you figure out its due to grid/vgpu unsupported ? I know for sure that rtx30 series support passthrough for a single VM, like for KVM.
I'm looking at the nvidia-installer.log, seems that driver is just failling to compile...
by the way, one can find nvidia-installer.log by doing minikube ssh, then less /home/kubernetes/bin/nvidia/nvidia-installer.log

@joaquinfdez
Copy link

Could it be due to vGPU is not supported for the GeForce series?

@joysn71
Copy link

joysn71 commented Aug 26, 2023

The issue is in the nvidia-driver-installer Docker image. Found using the following command:
minikube kubectl -- cluster-info dump.

When the container is using the image it runs through the driver install process, and this process does not find python3 in the container which is created from the image. Too see the following issue: #15123

From cluster-info dump:

==== START logs for container nvidia-driver-installer of pod kube-system/nvidia-driver-installer-wftvf ====
+ NVIDIA_DRIVER_VERSION=510.60.02
+ NVIDIA_DRIVER_DOWNLOAD_URL_DEFAULT=https://us.download.nvidia.com/XFree86/Linux-x86_64/510.60.02/NVIDIA-Linux-x86_64-510.60.02.run
+ NVIDIA_DRIVER_DOWNLOAD_URL=https://us.download.nvidia.com/XFree86/Linux-x86_64/510.60.02/NVIDIA-Linux-x86_64-510.60.02.run
+ NVIDIA_INSTALL_DIR_HOST=/home/kubernetes/bin/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
++ basename https://us.download.nvidia.com/XFree86/Linux-x86_64/510.60.02/NVIDIA-Linux-x86_64-510.60.02.run
+ NVIDIA_INSTALLER_RUNFILE=NVIDIA-Linux-x86_64-510.60.02.run
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
++ uname -r
+ KERNEL_VERSION=5.10.57
++ cut -d . -f 1
+++ uname -r
++ echo 5.10.57
+ MAJOR_KERNEL_VERSION=5
+ set +x
KERNEL_VERSION: 5.10.57
Checking cached version
Cache file /usr/local/nvidia/.cache not found.
Downloading kernel sources...
/usr/src /
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M  6  111M    6 7455k    0     0  11.1M      0  0:00:09 --:--:--  0:00:09 11.0M^M 32  111M   32 35.8M    0  >
/
Downloading kernel sources... DONE.
Configuring installation directories...
/usr/local/nvidia /
Updating container's ld cache...
Updating container's ld cache... DONE.
/
Configuring installation directories... DONE.
Downloading Nvidia installer...
/usr/local/nvidia /
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M  4  313M    4 14.7M    0     0  17.5M      0  0:00:17 --:--:--  0:00:17 17.5M^M 14  313M   14 45.3M    0  >
/
Downloading Nvidia installer... DONE.
Configuring kernel sources...
/usr/src/linux-5.10.57 /
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX     scripts/kconfig/lexer.lex.c
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
#
# configuration written to .config
#
  SYSTBL  arch/x86/include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/include/generated/asm/unistd_64_x32.h
  SYSTBL  arch/x86/include/generated/asm/syscalls_64.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_32.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_x32.h
  HOSTCC  arch/x86/tools/relocs_32.o
  HOSTCC  arch/x86/tools/relocs_64.o
  HOSTCC  arch/x86/tools/relocs_common.o
  HOSTLD  arch/x86/tools/relocs
  HOSTCC  scripts/genksyms/genksyms.o
  YACC    scripts/genksyms/parse.tab.[ch]
  HOSTCC  scripts/genksyms/parse.tab.o
  LEX     scripts/genksyms/lex.lex.c
  HOSTCC  scripts/genksyms/lex.lex.o
  HOSTLD  scripts/genksyms/genksyms
  HOSTCC  scripts/selinux/genheaders/genheaders
  HOSTCC  scripts/selinux/mdp/mdp
  HOSTCC  scripts/kallsyms
  HOSTCC  scripts/recordmcount
  HOSTCC  scripts/sorttable
  HOSTCC  scripts/asn1_compiler
  HOSTCC  scripts/extract-cert
  WRAP    arch/x86/include/generated/uapi/asm/bpf_perf_event.h
  WRAP    arch/x86/include/generated/uapi/asm/errno.h
  WRAP    arch/x86/include/generated/uapi/asm/fcntl.h
  WRAP    arch/x86/include/generated/uapi/asm/ioctl.h
  WRAP    arch/x86/include/generated/uapi/asm/ioctls.h
  WRAP    arch/x86/include/generated/uapi/asm/ipcbuf.h
  WRAP    arch/x86/include/generated/uapi/asm/param.h
  WRAP    arch/x86/include/generated/uapi/asm/poll.h
  WRAP    arch/x86/include/generated/uapi/asm/resource.h
...
  LD       /usr/src/linux-5.10.57/tools/objtool/objtool-in.o
  LINK     /usr/src/linux-5.10.57/tools/objtool/objtool
  DESCEND  bpf/resolve_btfids
  MKDIR     /usr/src/linux-5.10.57/tools/bpf/resolve_btfids//libbpf

Auto-detecting system features:
...                        libelf: [ ESC[32monESC[m  ]
...                          zlib: [ ESC[32monESC[m  ]
...                           bpf: [ ESC[32monESC[m  ]

  GEN      /usr/src/linux-5.10.57/tools/bpf/resolve_btfids/libbpf/bpf_helper_defs.h
/usr/bin/env: 'python3': No such file or directory
Makefile:182: recipe for target '/usr/src/linux-5.10.57/tools/bpf/resolve_btfids/libbpf/bpf_helper_defs.h' failed
make[3]: *** [/usr/src/linux-5.10.57/tools/bpf/resolve_btfids/libbpf/bpf_helper_defs.h] Error 127
make[3]: *** Deleting file '/usr/src/linux-5.10.57/tools/bpf/resolve_btfids/libbpf/bpf_helper_defs.h'
Makefile:44: recipe for target '/usr/src/linux-5.10.57/tools/bpf/resolve_btfids//libbpf/libbpf.a' failed
make[2]: *** [/usr/src/linux-5.10.57/tools/bpf/resolve_btfids//libbpf/libbpf.a] Error 2
Makefile:71: recipe for target 'bpf/resolve_btfids' failed
make[1]: *** [bpf/resolve_btfids] Error 2
make: *** [tools/bpf/resolve_btfids] Error 2
Makefile:1947: recipe for target 'tools/bpf/resolve_btfids' failed
==== END logs for container nvidia-driver-installer of pod kube-system/nvidia-driver-installer-wftvf ====
==== START logs for container pause of pod kube-system/nvidia-driver-installer-wftvf ====
Request log error: the server rejected our request for an unknown reason (get pods nvidia-driver-installer-wftvf)
==== END logs for container pause of pod kube-system/nvidia-driver-installer-wftvf ====

I installed package python3.8 using apt inside the image/container from the image, and symlinked the binary python3.8 to python3 in /usr/bin, but this did not help. And I am not sure how I can "update" the modified image for minikube to consider it.

UPDATE: Finally was able to update the nvidia-driver-installer image and Minikube used it for Pod creation.
The following can be seen:

Configuring kernel sources... DONE
Running Nvidia installer...
/usr/local/nvidia /
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.60.02..................................................................................................................................................................>

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pk>

/
Running Nvidia installer... DONE.
Updated cached version as:
CACHE_KERNEL_VERSION=5.10.57
CACHE_NVIDIA_DRIVER_VERSION=510.60.02
Verifying Nvidia installation...
No devices were found
==== END logs for container nvidia-driver-installer of pod kube-system/nvidia-driver-installer-2j8pf ====

Meaning: nvidia-smi does not see a GPU.

$ lspci
00:08.0 Class 00ff: 1af4:1002
00:01.2 Class 0c03: 8086:7020
00:01.0 Class 0601: 8086:7000
00:04.0 Class 0200: 1af4:1000
00:07.0 Class 0300: 10de:13b1
00:00.0 Class 0600: 8086:1237
00:01.3 Class 0680: 8086:7113
00:03.0 Class 0200: 1af4:1000
00:01.1 Class 0101: 8086:7010
00:06.0 Class 0100: 1af4:1001
00:09.0 Class 00ff: 1af4:1005
00:02.0 Class 0300: 1013:00b8
00:05.0 Class 0100: 1000:0012

Device 10de:13b1 is the Nvidia GPU.
Verified with host:

$ lspci -nn | grep -i nvidia
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GLM [Quadro M1000M] [10de:13b1] (rev a2)
01:00.1 Audio device [0403]: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] [10de:0fbc] (rev a1)

Drivers are not loaded on host, I "isolated" the GPU:

$ driverctl set-override 0000:01:00.0 vfio-pci
$ driverctl set-override 0000:01:00.1 vfio-pci

$ lsmod | grep nvi
$ 

Kernel of Minikube tells this:

[    0.031444] Kernel command line: BOOT_IMAGE=/boot/bzImage root=/dev/sr0 loglevel=3 console=ttyS0 noembed nomodeset norestore waitusb=10 random.trust_cpu=on hw_rng_model=virtio systemd.legacy_systemd_cgroup_controller=yes initrd=/boot/i
[    0.031533] You have booted with nomodeset. This means your GPU drivers are DISABLED
[    0.031533] Any video related functionality will be severely degraded, and you may not even be able to suspend the system properly
[    0.031534] Unless you actually understand what nomodeset does, you should reboot without enabling it

[ 2730.152938] nvidia: loading out-of-tree module taints kernel.
[ 2730.152949] nvidia: module license 'NVIDIA' taints kernel.
[ 2730.152952] Disabling lock debugging due to kernel taint
[ 2730.182825] nvidia-nvlink: Nvlink Core is being initialized, major device number 246

[ 2730.205911] nvidia 0000:00:07.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 2730.408610] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.60.02  Wed Mar 16 11:24:05 UTC 2022
[ 2730.446742] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 2730.460034] nvidia-uvm: Loaded the UVM driver, major device number 244.
[ 2730.465089] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  510.60.02  Wed Mar 16 11:17:28 UTC 2022
[ 2730.468473] [drm] [nvidia-drm] [GPU ID 0x00000007] Loading driver
[ 2730.468486] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:07.0 on minor 0
[ 2730.472409] [drm] [nvidia-drm] [GPU ID 0x00000007] Unloading driver
[ 2730.480904] nvidia-modeset: Unloading
[ 2730.493549] nvidia-uvm: Unloaded the UVM driver.
[ 2730.527836] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246
[ 2733.985980] nvidia-nvlink: Nvlink Core is being initialized, major device number 246

[ 2734.005647] nvidia 0000:00:07.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[ 2734.206592] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  510.60.02  Wed Mar 16 11:24:05 UTC 2022
[ 2734.216055] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  510.60.02  Wed Mar 16 11:17:28 UTC 2022
[ 2734.223876] [drm] [nvidia-drm] [GPU ID 0x00000007] Loading driver
[ 2734.223882] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:07.0 on minor 0
[ 2734.442251] NVRM: GPU 0000:00:07.0: Failed to copy vbios to system memory.
[ 2734.442436] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x30:0xffff:963)
[ 2734.442784] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
[ 2734.656248] NVRM: GPU 0000:00:07.0: Failed to copy vbios to system memory.
[ 2734.656611] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x30:0xffff:963)
[ 2734.658200] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
[ 2735.703378] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 2735.714448] nvidia-uvm: Loaded the UVM driver, major device number 244.
[ 2735.747514] NVRM: GPU 0000:00:07.0: Failed to copy vbios to system memory.
[ 2735.747692] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x30:0xffff:963)
[ 2735.747986] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
[ 2735.973516] NVRM: GPU 0000:00:07.0: Failed to copy vbios to system memory.
[ 2735.973885] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x30:0xffff:963)
[ 2735.975086] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
[ 2753.431768] NVRM: GPU 0000:00:07.0: Failed to copy vbios to system memory.
[ 2753.431897] NVRM: GPU 0000:00:07.0: RmInitAdapter failed! (0x30:0xffff:963)
[ 2753.432260] NVRM: GPU 0000:00:07.0: rm_init_adapter failed, device minor number 0
[ 2753.663460] NVRM: GPU 0000:00:07.0: Failed to copy vbios to system memory.

So, KVM2 provides the GPU it seems, but the kernel drivers can not use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gpu GPU related items co/kvm2-driver KVM2 driver related issues kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.