-
Notifications
You must be signed in to change notification settings - Fork 5k
nvidia-driver-installer addon fails to start (driver fails to install in the container) #6912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Additional Info...
Titan V
Geforce 1080ti
|
Just attempted the process again with minikube v 1.8.1 and no luck :( I didn't see anything in the change notes that would have helped, but it was worth a shot :) |
Can you attach the nvidia-installer.log ?
|
Unfortunately, I can't find a way to get the
long-shot
Any idea how to get access to the nvidia-installer.log ? I would love to know myself what Nvidia is saying. I have a sneaking suspicion that it has to do with where the nvidia drivers are installed to on the host, but that's just a guess. |
I'm having exactly the same problem with Fedora 31, in my case I have Host: Fedora 31 GPUs: GeForce GT 710 (host GPU nvidia driver) |
Hey @Nick-Harvey does the nvidia pod ever have status I'm wondering because of this log when you tried to run:
The output of If the pod is
|
I'm setting up minikube with the kvm2 driver and gpu passthrough as per https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/.
The nvidia-gpu-device-plugin pod is listed as running, whereas the second pod is stuck on Init (paused). I'm guessing this is the same issue @Nick-Harvey encountered.
Some additional info on the pods: `kubectl describe po nvidia-gpu-device-plugin-57n2l -n kube-system`
`kubectl describe po nvidia-driver-installer-c5bvd -n kube-system`
Any suggestions on how to debug this further? |
Hello @fvdnabee did you find a solution ? I'm having the same issue as you. |
@ysow It is my experience that gpu passthrough does not work with minikube kvm's driver. I managed to get GPU support working in minikube by using the none driver, as explained here https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/#using-the-none-driver |
Same problem here, I can start minikube:
but I get the same errors above. Minikube version 1.11.0 With docker works:
configuration:
@fvdnabee I can't make it running not even with none drive :( <<<---- SOLVED! I can run it using the driver none. |
Hey @Nick-Harvey does this work with the none driver for you? |
The none driver workaround is just that, a workaround. I suspect this should also work with the docker and kvm2 drivers. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hello, |
@skol101: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We don't currently have the bandwidth to properly maintain the nvidia addon, but would be happy to accept PRs to fix it. Reopening for visibility. |
How can I submit a PR if the source Docker is located in a different repo? The driver version is set in this file https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/minikube/entrypoint.sh |
I got it, the issue is that RTX30X0 aren't supported in vGPU mode https://docs.nvidia.com/grid/latest/grid-vgpu-release-notes-generic-linux-kvm/index.html#updates-release-update-1 Bloody nvidia. |
hey @skol101, how did you figure out its due to grid/vgpu unsupported ? I know for sure that rtx30 series support passthrough for a single VM, like for KVM. |
Could it be due to vGPU is not supported for the GeForce series? |
The issue is in the nvidia-driver-installer Docker image. Found using the following command: When the container is using the image it runs through the driver install process, and this process does not find python3 in the container which is created from the image. Too see the following issue: #15123 From cluster-info dump:
I installed package python3.8 using apt inside the image/container from the image, and symlinked the binary python3.8 to python3 in /usr/bin, but this did not help. And I am not sure how I can "update" the modified image for minikube to consider it. UPDATE: Finally was able to update the nvidia-driver-installer image and Minikube used it for Pod creation.
Meaning:
Device
Drivers are not loaded on host, I "isolated" the GPU:
Kernel of Minikube tells this:
So, KVM2 provides the GPU it seems, but the kernel drivers can not use it. |
Uh oh!
There was an error while loading. Please reload this page.
The exact command to reproduce the issue:
Following the instructions on https://minikube.sigs.k8s.io/docs/tutorials/nvidia_gpu/
Host: Fedora 31
Kernel: 5.5.7-200.fc31.x86_64
Cuda Drivers Host: 10.2
Nvidia Driver Host: Driver Version: 440.64
GPUs:
Minikube start:
minikube start --vm-driver kvm2 --kvm-gpu --cpus=12 --memory=25480
Minikube Addons:
minikube addons enable nvidia-gpu-device-plugin
This will fail initially, just edit the
dc/nvidia-gpu-device-plugin
and increase the mem to100mi
and it'll start fine.The full output of the command that failed:
minikube addons enable nvidia-driver-installer
The container fails to start. and once you fetch teh logs, it will return:
The output of the
minikube logs
command:The minikube logs don't say much but here they are:
minikube-logs.txt
Logs from the nvidia-installer itself:
nvidia-driver-installer-logs.txt
The operating system version:
The text was updated successfully, but these errors were encountered: