nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added. #401

nonpolarity · 2022-08-30T05:09:41Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Yes
Are you running Kubernetes v1.13+?
1.25.0-00
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
No docker, but containerd
Do you have i2c_core and ipmi_msghandler loaded on the nodes?

u18-1:~$ lsmod | grep ipmi_msghandler
ipmi_msghandler       102400  1 ipmi_devintf
u18-1:~$ lsmod | grep i2c
u18-1:~$

Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
yes

1. Issue or feature description

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added.

2. Steps to reproduce the issue

No matter online installation or offline installation, these daemonsets are not installed.

helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.11.1.tgz
curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.11.1/deployments/gpu-operator/values.yaml
helm install --wait gpu-operator -n gpu-operator --create-namespace gpu-operator-v1.11.1.tgz -f values.yaml

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods --all-namespaces

u18-1:~$ kubectl get pods --all-namespaces
NAMESPACE      NAME                                                          READY   STATUS             RESTARTS        AGE
gpu-operator   gpu-operator-56c9cf6799-8qbdv                                 1/1     Running            2 (7m9s ago)    14m
gpu-operator   gpu-operator-node-feature-discovery-master-65c9bd48c4-ssqsm   1/1     Running            1 (7m10s ago)   14m
gpu-operator   gpu-operator-node-feature-discovery-worker-lnfpx              1/1     Running            1 (7m10s ago)   14m
gpu-operator   gpu-operator-node-feature-discovery-worker-x7jj5              0/1     CrashLoopBackOff   6 (2m16s ago)   14m
kube-system    calico-kube-controllers-58dbc876ff-drqqb                      1/1     Running            2 (7m10s ago)   79m
kube-system    calico-node-8nm5z                                             1/1     Running            2 (7m10s ago)   79m
kube-system    calico-node-wlcfd                                             1/1     Running            1 (52m ago)     71m
kube-system    coredns-565d847f94-bpgjm                                      1/1     Running            2 (7m10s ago)   79m
kube-system    coredns-565d847f94-gsrz2                                      1/1     Running            2 (7m10s ago)   79m
kube-system    etcd-wechen3-u18-1                                            1/1     Running            2 (7m10s ago)   79m
kube-system    kube-apiserver-wechen3-u18-1                                  1/1     Running            2 (7m10s ago)   79m
kube-system    kube-controller-manager-wechen3-u18-1                         1/1     Running            2 (7m10s ago)   79m
kube-system    kube-proxy-6z769                                              1/1     Running            2 (7m10s ago)   79m
kube-system    kube-proxy-w7m2q                                              1/1     Running            1 (52m ago)     71m
kube-system    kube-scheduler-wechen3-u18-1                                  1/1     Running            2 (7m10s ago)   79m
u18-1:~$

kubernetes daemonset status: kubectl get ds --all-namespaces

u18-1:~$ kubectl get ds --all-namespaces
NAMESPACE      NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
gpu-operator   gpu-operator-node-feature-discovery-worker   2         2         1       2            1           <none>                   19m
kube-system    calico-node                                  2         2         2       2            2           kubernetes.io/os=linux   83m
kube-system    kube-proxy                                   2         2         2       2            2           kubernetes.io/os=linux   84m
u18-1:~$

If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

u18-1:~$ kubectl get ds --all-namespaces
NAMESPACE      NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
gpu-operator   gpu-operator-node-feature-discovery-worker   2         2         1       2            1           <none>                   15m
kube-system    calico-node                                  2         2         2       2            2           kubernetes.io/os=linux   80m
kube-system    kube-proxy                                   2         2         2       2            2           kubernetes.io/os=linux   81m
u18-1:~$

If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

u18-1:~$ kubectl describe pod -n gpu-operator gpu-operator-node-feature-discovery-worker-x7jj5
Name:             gpu-operator-node-feature-discovery-worker-x7jj5
Namespace:        gpu-operator
Priority:         0
Service Account:  node-feature-discovery
Node:             wechen3-u18-2/10.0.0.5
Start Time:       Tue, 30 Aug 2022 04:43:01 +0000
Labels:           app.kubernetes.io/instance=gpu-operator
                 app.kubernetes.io/name=node-feature-discovery
                 controller-revision-hash=555d8fdb59
                 pod-template-generation=1
                 role=worker
Annotations:      cni.projectcalico.org/containerID: 10a3392f8d7921ae4005079ae160438a2f309b7887350417ea3f172d33c40acb
                 cni.projectcalico.org/podIP: 192.168.75.89/32
                 cni.projectcalico.org/podIPs: 192.168.75.89/32
Status:           Running
IP:               192.168.75.89
IPs:
 IP:           192.168.75.89
Controlled By:  DaemonSet/gpu-operator-node-feature-discovery-worker
Containers:
 worker:
   Container ID:  containerd://c5c2e4cc79ff0d17db7de10d1870f6c9d69fc73f68b36cfd12c03c5767465d01
   Image:         k8s.gcr.io/nfd/node-feature-discovery:v0.10.1
   Image ID:      k8s.gcr.io/nfd/node-feature-discovery@sha256:4aebf17c8b72ee91cb468a6f21dd9f0312c1fcfdf8c86341f7aee0ec2d5991d7
   Port:          <none>
   Host Port:     <none>
   Command:
     nfd-worker
   Args:
     --server=gpu-operator-node-feature-discovery-master:8080
   State:          Waiting
     Reason:       CrashLoopBackOff
   Last State:     Terminated
     Reason:       Error
     Exit Code:    1
     Started:      Tue, 30 Aug 2022 04:54:36 +0000
     Finished:     Tue, 30 Aug 2022 04:55:36 +0000
   Ready:          False
   Restart Count:  6
   Environment:
     NODE_NAME:   (v1:spec.nodeName)
   Mounts:
     /etc/kubernetes/node-feature-discovery from nfd-worker-conf (ro)
     /etc/kubernetes/node-feature-discovery/features.d/ from features-d (ro)
     /etc/kubernetes/node-feature-discovery/source.d/ from source-d (ro)
     /host-boot from host-boot (ro)
     /host-etc/os-release from host-os-release (ro)
     /host-sys from host-sys (ro)
     /host-usr/lib from host-usr-lib (ro)
     /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-prwhl (ro)
Conditions:
 Type              Status
 Initialized       True
 Ready             False
 ContainersReady   False
 PodScheduled      True
Volumes:
 host-boot:
   Type:          HostPath (bare host directory volume)
   Path:          /boot
   HostPathType:
 host-os-release:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/os-release
   HostPathType:
 host-sys:
   Type:          HostPath (bare host directory volume)
   Path:          /sys
   HostPathType:
 host-usr-lib:
   Type:          HostPath (bare host directory volume)
   Path:          /usr/lib
   HostPathType:
 source-d:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/kubernetes/node-feature-discovery/source.d/
   HostPathType:
 features-d:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/kubernetes/node-feature-discovery/features.d/
   HostPathType:
 nfd-worker-conf:
   Type:      ConfigMap (a volume populated by a ConfigMap)
   Name:      gpu-operator-node-feature-discovery-worker-conf
   Optional:  false
 kube-api-access-prwhl:
   Type:                    Projected (a volume that contains injected data from multiple sources)
   TokenExpirationSeconds:  3607
   ConfigMapName:           kube-root-ca.crt
   ConfigMapOptional:       <nil>
   DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                            node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                            node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                            node.kubernetes.io/not-ready:NoExecute op=Exists
                            node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                            node.kubernetes.io/unreachable:NoExecute op=Exists
                            node.kubernetes.io/unschedulable:NoSchedule op=Exists
                            nvidia.com/gpu=present:NoSchedule
Events:
 Type     Reason     Age                  From               Message
 ----     ------     ----                 ----               -------
 Normal   Scheduled  17m                  default-scheduler  Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-x7jj5 to wechen3-u18-2
 Normal   Pulled     11m (x5 over 17m)    kubelet            Container image "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1" already present on machine
 Normal   Created    11m (x5 over 17m)    kubelet            Created container worker
 Normal   Started    11m (x5 over 17m)    kubelet            Started container worker
 Warning  BackOff    2m7s (x37 over 15m)  kubelet            Back-off restarting failed container
u18-1:~$

Output of running a container on the GPU machine: docker run -it alpine echo foo
NA
Docker configuration file: cat /etc/docker/daemon.json
NA
Docker runtime configuration: docker info | grep runtime
NA
NVIDIA shared directory: ls -la /run/nvidia
NA
NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
NA
NVIDIA driver directory: ls -la /run/nvidia/driver
NA
kubelet logs journalctl -u kubelet > kubelet.logs

The text was updated successfully, but these errors were encountered:

nonpolarity · 2022-08-30T05:24:46Z

kubelet logs journalctl -u kubelet > kubelet.logs
kubelet.log

shivamerla · 2022-09-06T14:56:45Z

@nonpolarity looks like a CNI issue here. NFD worker pod is not able to communicate with NFD master. GPU Operator requires certain PCI labels from NFD to deploy operands.

gpu-operator   gpu-operator-node-feature-discovery-worker-x7jj5              0/1     CrashLoopBackOff   6 (2m16s ago)   14m

everflux · 2022-09-13T20:28:57Z

Kubernetes 1.25.0 no longer supports RuntimeClass node.k8s.io/v1beta1 as stated in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25

Solution: Migrate manifests and API clients to use the node.k8s.io/v1 API version, available since v1.20 (about 2 years).

shivamerla · 2022-09-13T20:39:40Z

@everflux note that RuntimeClass issue is not related to this particular issue reported here as none of the components got added in the first place. But RuntimeClass issue would have happened next with K8s 1.25. The fix for RuntimeClass API change is staged for next release of operator by end of this month.

ktsakalozos mentioned this issue Sep 17, 2022

GPU Support broken in 1.25 canonical/microk8s#3452

Closed

FischerLGLN mentioned this issue Nov 16, 2022

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

Closed

16 tasks

sfxworks mentioned this issue Dec 7, 2022

BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation #458

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added. #401

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added. #401

nonpolarity commented Aug 30, 2022 •

edited

Loading

nonpolarity commented Aug 30, 2022

shivamerla commented Sep 6, 2022

everflux commented Sep 13, 2022

shivamerla commented Sep 13, 2022

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added. #401

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added. #401

Comments

nonpolarity commented Aug 30, 2022 • edited Loading

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

nonpolarity commented Aug 30, 2022

shivamerla commented Sep 6, 2022

everflux commented Sep 13, 2022

shivamerla commented Sep 13, 2022

nonpolarity commented Aug 30, 2022 •

edited

Loading