-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error code CUDA driver version is insufficient for CUDA runtime version in v22.9.0 #415
Comments
Images list
|
@carlwang87 For running cuda-vectorAdd sample, use this image |
Follow the document toolkit configuration, it failed either.
The content is not same as the document from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd
|
The k3s 1.24 I used also encountered config The toml file has not been modified. However, when I change CONTAINERD_CONFIG to config.toml.tmpl, it works, this file will be created and modified. However, after restarting k3s, this file will overwrite config.toml, resulting in the lack of many necessary k3s configurations As far as I know, k3s config.toml is generated by k3s. If need to customize it, need to modify config.toml.tmpl By the way, k3s here is a single node Did I do something wrong? |
Without pre-installed NVIDIA Container Toolkit and gpu driver, I followed the gpu-operator(v22.9.0) installation guide in k3s(v1.24.3+k3s1) to deploy gpu operator successfully, but I ran the samples from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#running-sample-gpu-applications, it failed. I have to add runtimeClassName: nvidia in the pod spec,so I wonder that how these samples ran succeessfully without runtimeClassName: nvidia in k3s cluster? Have you test these samples on k3s cluster? |
@carlwang87 There seems to be typo in the documentation. Boolean value for
|
@shivamerla
Same issue talked is found here: |
not sure why you still had to specify |
@shivamerla First of all, thank you for still helping me.
|
@carlwang87 It is set by the container-toolkit component deployed with gpu-operator. Can you describe the |
@shivamerla logs from pod logs:
From the logs, I find there is a warning that Support for |
What version of k3s does the gpu operator use when testing this point? The 1.24 version we use has this problem. In addition, when I do not set CONTAINERD_CONFIG , /etc/containerd/config.toml will be changed by default , the change is correct at this time, but it does not meet our expectations, because the k3s containerd configuration file is in/var/lib/router/k3s/agent/etc/containerd/config.toml |
@shivamerla K3S Version: GPU Operator: So, Could you please help me to find the root cause out on this environment?
GPU Pod:
You can do anything on this environment, please help me, thank you very much. |
@carlwang87 looks like in this config v1 containerd format is used. We need to pass env
|
I haved the same error with the Gpu Operator example but If I try with following example all works fine apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
runtimeClassName: nvidia
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 |
With k3s you need to update the |
Thank you, I was missing |
@shivamerla are the env vars required when using k3s? I was having the same issue (k3s v1.27.2+k3s1) until I added them. env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all |
Same issue with gpu-operator v23.3.2 |
Same issue continue on k3s: v1.30.4+k3s1 + gpu operator: v24.6.2 + manual install GPU driver in host first. I've tried adding NVIDIA_VISIBLE_DEVICES + NVIDIA_DRIVER_CAPABILITIES in test pod. And adding CONTAINERD_USE_LEGACY_CONFIG to gpu operator HR. Here is my setup (I follow https://docs.k3s.io/advanced#nvidia-container-runtime-support):
HR:
TEST POD:
Only manually adding runtimeClassName works. |
The issue is still reproduced in gpu-operator v22.9.0.
kubectl --kubeconfig -n gpu logs cuda-vectoradd
Environment infomation
config.toml content
cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
In the host, the
/etc/nvidia-container-runtime/host-files-for-container.d
is not found.cuda-vectoradd pod yaml
when I add
runtimeClassName: nvidia
in Pod spec, it works.issue: #408
Dose gpu-operator support on k3s cluster environment?
@shivamerla @cdesiniotis Could you please help me out? Thank you very much.
The text was updated successfully, but these errors were encountered: