Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added. #401

Open
16 tasks done
nonpolarity opened this issue Aug 30, 2022 · 4 comments

Comments

@nonpolarity
Copy link

nonpolarity commented Aug 30, 2022

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
    Yes
  • Are you running Kubernetes v1.13+?
    1.25.0-00
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
    No docker, but containerd
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
u18-1:~$ lsmod | grep ipmi_msghandler
ipmi_msghandler       102400  1 ipmi_devintf
u18-1:~$ lsmod | grep i2c
u18-1:~$
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
    yes

1. Issue or feature description

nvidia-driver-daemonset, nvidia-container-toolkit-daemonset and nvidia-device-plugin-daemonset not added.

2. Steps to reproduce the issue

No matter online installation or offline installation, these daemonsets are not installed.

helm fetch https://helm.ngc.nvidia.com/nvidia/charts/gpu-operator-v1.11.1.tgz
curl -sO https://raw.githubusercontent.com/NVIDIA/gpu-operator/v1.11.1/deployments/gpu-operator/values.yaml
helm install --wait gpu-operator -n gpu-operator --create-namespace gpu-operator-v1.11.1.tgz -f values.yaml

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
u18-1:~$ kubectl get pods --all-namespaces
NAMESPACE      NAME                                                          READY   STATUS             RESTARTS        AGE
gpu-operator   gpu-operator-56c9cf6799-8qbdv                                 1/1     Running            2 (7m9s ago)    14m
gpu-operator   gpu-operator-node-feature-discovery-master-65c9bd48c4-ssqsm   1/1     Running            1 (7m10s ago)   14m
gpu-operator   gpu-operator-node-feature-discovery-worker-lnfpx              1/1     Running            1 (7m10s ago)   14m
gpu-operator   gpu-operator-node-feature-discovery-worker-x7jj5              0/1     CrashLoopBackOff   6 (2m16s ago)   14m
kube-system    calico-kube-controllers-58dbc876ff-drqqb                      1/1     Running            2 (7m10s ago)   79m
kube-system    calico-node-8nm5z                                             1/1     Running            2 (7m10s ago)   79m
kube-system    calico-node-wlcfd                                             1/1     Running            1 (52m ago)     71m
kube-system    coredns-565d847f94-bpgjm                                      1/1     Running            2 (7m10s ago)   79m
kube-system    coredns-565d847f94-gsrz2                                      1/1     Running            2 (7m10s ago)   79m
kube-system    etcd-wechen3-u18-1                                            1/1     Running            2 (7m10s ago)   79m
kube-system    kube-apiserver-wechen3-u18-1                                  1/1     Running            2 (7m10s ago)   79m
kube-system    kube-controller-manager-wechen3-u18-1                         1/1     Running            2 (7m10s ago)   79m
kube-system    kube-proxy-6z769                                              1/1     Running            2 (7m10s ago)   79m
kube-system    kube-proxy-w7m2q                                              1/1     Running            1 (52m ago)     71m
kube-system    kube-scheduler-wechen3-u18-1                                  1/1     Running            2 (7m10s ago)   79m
u18-1:~$
  • kubernetes daemonset status: kubectl get ds --all-namespaces
u18-1:~$ kubectl get ds --all-namespaces
NAMESPACE      NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
gpu-operator   gpu-operator-node-feature-discovery-worker   2         2         1       2            1           <none>                   19m
kube-system    calico-node                                  2         2         2       2            2           kubernetes.io/os=linux   83m
kube-system    kube-proxy                                   2         2         2       2            2           kubernetes.io/os=linux   84m
u18-1:~$
  • If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
u18-1:~$ kubectl get ds --all-namespaces
NAMESPACE      NAME                                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
gpu-operator   gpu-operator-node-feature-discovery-worker   2         2         1       2            1           <none>                   15m
kube-system    calico-node                                  2         2         2       2            2           kubernetes.io/os=linux   80m
kube-system    kube-proxy                                   2         2         2       2            2           kubernetes.io/os=linux   81m
u18-1:~$
  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
u18-1:~$ kubectl describe pod -n gpu-operator gpu-operator-node-feature-discovery-worker-x7jj5
Name:             gpu-operator-node-feature-discovery-worker-x7jj5
Namespace:        gpu-operator
Priority:         0
Service Account:  node-feature-discovery
Node:             wechen3-u18-2/10.0.0.5
Start Time:       Tue, 30 Aug 2022 04:43:01 +0000
Labels:           app.kubernetes.io/instance=gpu-operator
                 app.kubernetes.io/name=node-feature-discovery
                 controller-revision-hash=555d8fdb59
                 pod-template-generation=1
                 role=worker
Annotations:      cni.projectcalico.org/containerID: 10a3392f8d7921ae4005079ae160438a2f309b7887350417ea3f172d33c40acb
                 cni.projectcalico.org/podIP: 192.168.75.89/32
                 cni.projectcalico.org/podIPs: 192.168.75.89/32
Status:           Running
IP:               192.168.75.89
IPs:
 IP:           192.168.75.89
Controlled By:  DaemonSet/gpu-operator-node-feature-discovery-worker
Containers:
 worker:
   Container ID:  containerd://c5c2e4cc79ff0d17db7de10d1870f6c9d69fc73f68b36cfd12c03c5767465d01
   Image:         k8s.gcr.io/nfd/node-feature-discovery:v0.10.1
   Image ID:      k8s.gcr.io/nfd/node-feature-discovery@sha256:4aebf17c8b72ee91cb468a6f21dd9f0312c1fcfdf8c86341f7aee0ec2d5991d7
   Port:          <none>
   Host Port:     <none>
   Command:
     nfd-worker
   Args:
     --server=gpu-operator-node-feature-discovery-master:8080
   State:          Waiting
     Reason:       CrashLoopBackOff
   Last State:     Terminated
     Reason:       Error
     Exit Code:    1
     Started:      Tue, 30 Aug 2022 04:54:36 +0000
     Finished:     Tue, 30 Aug 2022 04:55:36 +0000
   Ready:          False
   Restart Count:  6
   Environment:
     NODE_NAME:   (v1:spec.nodeName)
   Mounts:
     /etc/kubernetes/node-feature-discovery from nfd-worker-conf (ro)
     /etc/kubernetes/node-feature-discovery/features.d/ from features-d (ro)
     /etc/kubernetes/node-feature-discovery/source.d/ from source-d (ro)
     /host-boot from host-boot (ro)
     /host-etc/os-release from host-os-release (ro)
     /host-sys from host-sys (ro)
     /host-usr/lib from host-usr-lib (ro)
     /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-prwhl (ro)
Conditions:
 Type              Status
 Initialized       True
 Ready             False
 ContainersReady   False
 PodScheduled      True
Volumes:
 host-boot:
   Type:          HostPath (bare host directory volume)
   Path:          /boot
   HostPathType:
 host-os-release:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/os-release
   HostPathType:
 host-sys:
   Type:          HostPath (bare host directory volume)
   Path:          /sys
   HostPathType:
 host-usr-lib:
   Type:          HostPath (bare host directory volume)
   Path:          /usr/lib
   HostPathType:
 source-d:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/kubernetes/node-feature-discovery/source.d/
   HostPathType:
 features-d:
   Type:          HostPath (bare host directory volume)
   Path:          /etc/kubernetes/node-feature-discovery/features.d/
   HostPathType:
 nfd-worker-conf:
   Type:      ConfigMap (a volume populated by a ConfigMap)
   Name:      gpu-operator-node-feature-discovery-worker-conf
   Optional:  false
 kube-api-access-prwhl:
   Type:                    Projected (a volume that contains injected data from multiple sources)
   TokenExpirationSeconds:  3607
   ConfigMapName:           kube-root-ca.crt
   ConfigMapOptional:       <nil>
   DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/master:NoSchedule
                            node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                            node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                            node.kubernetes.io/not-ready:NoExecute op=Exists
                            node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                            node.kubernetes.io/unreachable:NoExecute op=Exists
                            node.kubernetes.io/unschedulable:NoSchedule op=Exists
                            nvidia.com/gpu=present:NoSchedule
Events:
 Type     Reason     Age                  From               Message
 ----     ------     ----                 ----               -------
 Normal   Scheduled  17m                  default-scheduler  Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-x7jj5 to wechen3-u18-2
 Normal   Pulled     11m (x5 over 17m)    kubelet            Container image "k8s.gcr.io/nfd/node-feature-discovery:v0.10.1" already present on machine
 Normal   Created    11m (x5 over 17m)    kubelet            Created container worker
 Normal   Started    11m (x5 over 17m)    kubelet            Started container worker
 Warning  BackOff    2m7s (x37 over 15m)  kubelet            Back-off restarting failed container
u18-1:~$
  • Output of running a container on the GPU machine: docker run -it alpine echo foo
    NA
  • Docker configuration file: cat /etc/docker/daemon.json
    NA
  • Docker runtime configuration: docker info | grep runtime
    NA
  • NVIDIA shared directory: ls -la /run/nvidia
    NA
  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
    NA
  • NVIDIA driver directory: ls -la /run/nvidia/driver
    NA
  • kubelet logs journalctl -u kubelet > kubelet.logs
@nonpolarity
Copy link
Author

  • kubelet logs journalctl -u kubelet > kubelet.logs
    kubelet.log

@shivamerla
Copy link
Contributor

@nonpolarity looks like a CNI issue here. NFD worker pod is not able to communicate with NFD master. GPU Operator requires certain PCI labels from NFD to deploy operands.

gpu-operator   gpu-operator-node-feature-discovery-worker-x7jj5              0/1     CrashLoopBackOff   6 (2m16s ago)   14m

@everflux
Copy link

Kubernetes 1.25.0 no longer supports RuntimeClass node.k8s.io/v1beta1 as stated in https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25

Solution: Migrate manifests and API clients to use the node.k8s.io/v1 API version, available since v1.20 (about 2 years).

@shivamerla
Copy link
Contributor

@everflux note that RuntimeClass issue is not related to this particular issue reported here as none of the components got added in the first place. But RuntimeClass issue would have happened next with K8s 1.25. The fix for RuntimeClass API change is staged for next release of operator by end of this month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants