Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error code CUDA driver version is insufficient for CUDA runtime version in v22.9.0 #415

Closed
carlwang87 opened this issue Oct 10, 2022 · 20 comments

Comments

@carlwang87
Copy link

carlwang87 commented Oct 10, 2022

The issue is still reproduced in gpu-operator v22.9.0.

kubectl --kubeconfig -n gpu logs cuda-vectoradd

Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]

Environment infomation

OS Version: Red Hat Enterprise Linux release 8.4
kernel: 4.18.0-305.el8.x86_64

K3S Version: v1.24.3+k3s1

GPU Operator Version: v22.9.0
CUDA Version: 11.7.1-base-ubi8

Driver Pre-installed: No
Driver Version:515.65.01-rhel8.4

Container-Toolkit Pre-installed: No
Container-Toolkit Version: v1.11.0-ubi8

GPU Type: Tesla P100
cuda-sample: cuda-sample:vectoradd-cuda11.7.1-ubi8

config.toml content
cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

In the host, the /etc/nvidia-container-runtime/host-files-for-container.d is not found.

cuda-vectoradd pod yaml

cat << EOF | kubectl --kubeconfig /work/k3s.yaml create -n hsc-gpu -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia  <<<<<<<<<
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

when I add runtimeClassName: nvidia in Pod spec, it works.

issue: #408

Dose gpu-operator support on k3s cluster environment?

@shivamerla @cdesiniotis Could you please help me out? Thank you very much.

@carlwang87
Copy link
Author

Images list

                "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0",
                "nvcr.io/nvidia/gpu-operator:v22.9.0",
                "nvcr.io/nvidia/cuda:11.7.1-base-ubi8",
                "nvcr.io/nvidia/driver:515.65.01-rhel8.4",
                "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.4.2",
                "nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubi8",
                "nvcr.io/nvidia/k8s-device-plugin:v0.12.3-ubi8",
                "nvcr.io/nvidia/cloud-native/dcgm:3.0.4-1-ubi8",
                "nvcr.io/nvidia/k8s/dcgm-exporter:3.0.4-3.0.0-ubi8",
                "nvcr.io/nvidia/gpu-feature-discovery:v0.6.2-ubi8",
                "nvcr.io/nvidia/cloud-native/k8s-mig-manager:v0.5.0-ubi8",
                "nvcr.io/nvidia/cloud-native/vgpu-device-manager:v0.2.0",
                "nvcr.io/nvidia/kubevirt-gpu-device-plugin:v1.2.1",
                "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1"

@shivamerla
Copy link
Contributor

shivamerla commented Oct 10, 2022

@carlwang87 For running cuda-vectorAdd sample, use this image nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.0. This is the image we have fixed(and used) with GPU Operator to run the sample. K3s needs custom containerd config set with container-toolkit. Please find more details under this section for K3s: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd

@carlwang87
Copy link
Author

carlwang87 commented Oct 11, 2022

Follow the document toolkit configuration, it failed either.

cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia-experimental"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia-experimental".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

The content is not same as the document from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-configuration-for-runtime-containerd

version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

@shivamerla

@SakuraAxy
Copy link

The k3s 1.24 I used also encountered config The toml file has not been modified. However, when I change CONTAINERD_CONFIG to config.toml.tmpl, it works, this file will be created and modified. However, after restarting k3s, this file will overwrite config.toml, resulting in the lack of many necessary k3s configurations

As far as I know, k3s config.toml is generated by k3s. If need to customize it, need to modify config.toml.tmpl
about k3s config.toml.tmpl link

By the way, k3s here is a single node

Did I do something wrong?

@shivamerla

@carlwang87
Copy link
Author

@shivamerla

Without pre-installed NVIDIA Container Toolkit and gpu driver, I followed the gpu-operator(v22.9.0) installation guide in k3s(v1.24.3+k3s1) to deploy gpu operator successfully, but I ran the samples from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#running-sample-gpu-applications, it failed. I have to add runtimeClassName: nvidia in the pod spec,so I wonder that how these samples ran succeessfully without runtimeClassName: nvidia in k3s cluster? Have you test these samples on k3s cluster?

@shivamerla
Copy link
Contributor

@carlwang87 There seems to be typo in the documentation. Boolean value for CONTAINERD_SET_AS_DEFAULT needs to be quoted here. Can you double check if this was done with your install?

helm install -n gpu-operator --create-namespace \
  nvidia/gpu-operator $HELM_OPTIONS \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value="true"

@carlwang87
Copy link
Author

carlwang87 commented Oct 19, 2022

@shivamerla
I set it through value.yaml as below:

  toolkit:
    enabled: true
    repository: nvcr.io/nvidia/k8s
    image: container-toolkit
    version: v1.11.0-ubi8
    env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

Same issue talked is found here:
k3s-io/k3s#4391

@shivamerla
Copy link
Contributor

not sure why you still had to specify runtimeClassName in the sample pod. With CONTAINERD_SET_AS_DEFAULT enabled, we set default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml. Can you confirm that is set in config.toml?

@carlwang87
Copy link
Author

@shivamerla First of all, thank you for still helping me.

With CONTAINERD_SET_AS_DEFAULT enabled, we set default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml

default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml is set manually or by gpu operator? I find it is not set by gpu operator. You mean we need to set it manually in config.toml?

@shivamerla
Copy link
Contributor

@carlwang87 It is set by the container-toolkit component deployed with gpu-operator. Can you describe the container-toolkit pod to confirm these env are applied correct? Also, logs from that pod please.

@carlwang87
Copy link
Author

carlwang87 commented Oct 21, 2022

@shivamerla logs from pod nvidia-container-toolkit-daemonset-cwp28:

env:
image

logs:

time="2022-10-20T12:36:25Z" level=info msg="Starting nvidia-toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Parsing arguments"
time="2022-10-20T12:36:25Z" level=info msg="Verifying Flags"
time="2022-10-20T12:36:25Z" level=info msg=Initializing
time="2022-10-20T12:36:25Z" level=info msg="Installing toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2022-10-20T12:36:25Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2022-10-20T12:36:25Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2022-10-20T12:36:25Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2022-10-20T12:36:25Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2022-10-20T12:36:25Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2022-10-20T12:36:25Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2022-10-20T12:36:25Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2022-10-20T12:36:25Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib64/libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/lib64/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/lib64/libnvidia-container-go.so.1.11.0' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.11.0'"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2022-10-20T12:36:25Z" level=info msg="Finding library libnvidia-ml.so (root=/run/nvidia/driver)"
time="2022-10-20T12:36:25Z" level=info msg="Checking library candidate '/run/nvidia/driver/usr/lib64/libnvidia-ml.so'"
time="2022-10-20T12:36:25Z" level=info msg="Resolved link: '/run/nvidia/driver/usr/lib64/libnvidia-ml.so' => '/run/nvidia/driver/usr/lib64/libnvidia-ml.so.515.65.01'"
time="2022-10-20T12:36:25Z" level=info msg="Using library root /run/nvidia/driver/usr/lib64"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable 'nvidia-container-runtime.experimental' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing 'nvidia-container-runtime.experimental' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.experimental'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2022-10-20T12:36:25Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2022-10-20T12:36:25Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2022-10-20T12:36:25Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2022-10-20T12:36:25Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2022-10-20T12:36:25Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
time="2022-10-20T12:36:25Z" level=info msg="Setting up runtime"
time="2022-10-20T12:36:25Z" level=info msg="Starting 'setup' for containerd"
time="2022-10-20T12:36:25Z" level=info msg="Parsing arguments: [/usr/local/nvidia/toolkit]"
time="2022-10-20T12:36:25Z" level=info msg="Successfully parsed arguments"
time="2022-10-20T12:36:25Z" level=info msg="Loading config: /runtime/config-dir/config.toml"
time="2022-10-20T12:36:25Z" level=info msg="Successfully loaded config"
time="2022-10-20T12:36:25Z" level=info msg="Config version: 1"
time="2022-10-20T12:36:25Z" level=warning msg="Support for containerd config version 1 is deprecated"
time="2022-10-20T12:36:25Z" level=info msg="Updating config"
time="2022-10-20T12:36:25Z" level=info msg="Successfully updated config"
time="2022-10-20T12:36:25Z" level=info msg="Flushing config"
time="2022-10-20T12:36:25Z" level=info msg="Successfully flushed config"
time="2022-10-20T12:36:25Z" level=info msg="Sending SIGHUP signal to containerd"
time="2022-10-20T12:36:25Z" level=info msg="Successfully signaled containerd"
time="2022-10-20T12:36:25Z" level=info msg="Completed 'setup' for containerd"
time="2022-10-20T12:36:25Z" level=info msg="Waiting for signal"
[root@localhost ~]#

From the logs, I find there is a warning that Support for containerd config version 1 is deprecated.

logs from nfd worker:
image

@SakuraAxy
Copy link

What version of k3s does the gpu operator use when testing this point? The 1.24 version we use has this problem. In addition, when I do not set CONTAINERD_CONFIG , /etc/containerd/config.toml will be changed by default , the change is correct at this time, but it does not meet our expectations, because the k3s containerd configuration file is in/var/lib/router/k3s/agent/etc/containerd/config.toml

@shivamerla

@carlwang87
Copy link
Author

@shivamerla
I have set up a K3S cluster environment with GPU. In this cluster, it can reproduce the issue, no default_runtime_name=nvidia in /var/lib/rancher/k3s/agent/etc/containerd/config.toml.

K3S Version:
v1.25.3+k3s1

GPU Operator:
v22.9.0

So, Could you please help me to find the root cause out on this environment?

ssh: 104.207.150.41
user: root
password: 8gF?cnkLnnh9gLh]

GPU Pod:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml  -n gpu-operator get pod

image

You can do anything on this environment, please help me, thank you very much.

@shivamerla
Copy link
Contributor

@carlwang87 looks like in this config v1 containerd format is used. We need to pass env CONTAINERD_USE_LEGACY_CONFIG as "true". By default container-toolkit assumes v2 format. After this setting i see that default_runtime is set as nvidia.

[root@vultr ~]# cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
version = 1

[plugins]

  [plugins.cri]
    enable_selinux = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    sandbox_image = "rancher/mirrored-pause:3.6"
    stream_server_address = "127.0.0.1"
    stream_server_port = "10010"

    [plugins.cri.cni]
      bin_dir = "/var/lib/rancher/k3s/data/2ef87ff954adbb390309ce4dc07500f29c319f84feec1719bfb5059c8808ec6a/bin"
      conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

    [plugins.cri.containerd]
      disable_snapshot_annotations = true
      snapshotter = "overlayfs"

      [plugins.cri.containerd.default_runtime]
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runc.v2"

        [plugins.cri.containerd.default_runtime.options]
          BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
          Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

      [plugins.cri.containerd.runtimes]

        [plugins.cri.containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
            SystemdCgroup = false

        [plugins.cri.containerd.runtimes.nvidia-experimental]
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
            Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
            SystemdCgroup = false

        [plugins.cri.containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.runc.options]
            SystemdCgroup = false

  [plugins.opt]
    path = "/var/lib/rancher/k3s/agent/containerd"

@Francesko90
Copy link

I haved the same error with the Gpu Operator example but If I try with following example all works fine

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: compute,utility
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1

@klueska
Copy link
Contributor

klueska commented Jan 3, 2023

With k3s you need to update the config.toml.tmpl with the default runtime, not config.toml directly.

@xmolitann
Copy link

I haved the same error with the Gpu Operator example but If I try with following example all works fine

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: compute,utility
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1

Thank you, I was missing runtimeClassName: nvidia in my case.

@mihirsamdarshi
Copy link

@shivamerla are the env vars required when using k3s? I was having the same issue (k3s v1.27.2+k3s1) until I added them.

      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

@gseidlerhpe
Copy link

Same issue with gpu-operator v23.3.2

@grapemix
Copy link

Same issue continue on k3s: v1.30.4+k3s1 + gpu operator: v24.6.2 + manual install GPU driver in host first.

I've tried adding NVIDIA_VISIBLE_DEVICES + NVIDIA_DRIVER_CAPABILITIES in test pod. And adding CONTAINERD_USE_LEGACY_CONFIG to gpu operator HR.

Here is my setup (I follow https://docs.k3s.io/advanced#nvidia-container-runtime-support):
k3s config.toml

# File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  



[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/var/lib/rancher/k3s/agent/etc/containerd/certs.d"






[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  SystemdCgroup = true

HR:

  values:
    operator:
      cleanupCRD: true
      upgradeCRD: true
    nfd:
      enabled: false
    psa:
      enabled: true
    driver:
      enabled: false
      repository: registry.skysolutions.fi/library/nvidia
      version: 550.90.07
    toolkit:
      enabled: true
      env:
        - name: CONTAINERD_CONFIG
          value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
        - name: CONTAINERD_SOCKET
          value: /run/k3s/containerd/containerd.sock
        - name: CONTAINERD_RUNTIME_CLASS
          value: nvidia
        - name: CONTAINERD_SET_AS_DEFAULT
          value: "true"
        - name: CONTAINERD_USE_LEGACY_CONFIG
          value: "true"
    devicePlugin:
      config:
        create: true
        name: time-slicing-config
        default: any
        data:
          any: |-
            version: v1
            flags:
              migStrategy: none
            sharing:
              timeSlicing:
                renameByDefault: false
                failRequestsGreaterThanOne: false
                resources:
                  - name: nvidia.com/gpu
                    replicas: 4

TEST POD:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.present
              operator: In
              values:
                - "true"

  #runtimeClassName: nvidia
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04"
    env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    resources:
      limits:
        nvidia.com/gpu: 1

Only manually adding runtimeClassName works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants