Helm install of GPU operator doesn't run daemonset containers and validator containers #434

premmotgi · 2022-11-07T21:52:45Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

Once the helm install command is run for GPU operator, only the discovery pods are running but the gpu-operator daemonset and validation pods are not running.

2. Steps to reproduce the issue

Run the helm install command to install the latest gpu-operator on RHEL 8.4 nodes
Check the running pods using kubectl get pods -n gpu-operator

3. Information to attach (optional if deemed irrelevant)

kubernetes pods status: kubectl get pods --all-namespaces
[root@control01 ~]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default my-release-enterprise-steam-867df478d5-4x296 1/1 Running 0 4d23h
gpu-operator gpu-operator-7878f5869-mfnzc 1/1 Running 0 4d3h
gpu-operator gpu-operator-node-feature-discovery-master-59b4b67f4f-nsgpk 1/1 Running 0 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-7plcj 0/1 CrashLoopBackOff 975 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-8b9kq 0/1 CrashLoopBackOff 975 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-hh2zn 0/1 CrashLoopBackOff 975 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-r5jlv 0/1 CrashLoopBackOff 975 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-s8rlb 0/1 CrashLoopBackOff 974 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-sc9x2 0/1 CrashLoopBackOff 975 4d3h
gpu-operator gpu-operator-node-feature-discovery-worker-v9j7c 0/1 CrashLoopBackOff 975 4d3h
kubernetes daemonset status: kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
~]# kubectl logs gpu-operator-node-feature-discovery-worker-v9j7c -n gpu-operator
1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
1 nfd-worker.go:156] NodeName: 'worker03.robin.ai.lab'
1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
1 nfd-worker.go:461] worker (re-)configuration successfully completed
1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
1 component.go:36] [core]parsed scheme: ""
1 component.go:36] [core]scheme "" not registered, fallback to default scheme
1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080 0 }] }
1 component.go:36] [core]ClientConn switching balancer to "pick_first"
1 component.go:36] [core]Channel switches to new LB policy "pick_first"
1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
1 component.go:36] [core]Channel Connectivity change to CONNECTING
1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
1 component.go:36] [core]Channel Connectivity change to CONNECTING
1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
1 component.go:36] [core]Channel Connectivity change to CONNECTING
1 component.go:36] [core]Channel Connectivity change to SHUTDOWN
1 component.go:36] [core]Subchannel Connectivity change to SHUTDOWN
1 main.go:64] failed to connect: context deadline exceeded

[root@control01 ~]# kubectl logs gpu-operator-node-feature-discovery-worker-7plcj -n gpu-operator
I1107 21:59:22.763062 1 nfd-worker.go:155] Node Feature Discovery Worker v0.10.1
I1107 21:59:22.763113 1 nfd-worker.go:156] NodeName: 'worker08.robin.ai.lab'
I1107 21:59:22.763500 1 nfd-worker.go:423] configuration file "/etc/kubernetes/node-feature-discovery/nfd-worker.conf" parsed
I1107 21:59:22.763555 1 nfd-worker.go:461] worker (re-)configuration successfully completed
I1107 21:59:22.763586 1 base.go:126] connecting to nfd-master at gpu-operator-node-feature-discovery-master:8080 ...
I1107 21:59:22.763618 1 component.go:36] [core]parsed scheme: ""
I1107 21:59:22.763627 1 component.go:36] [core]scheme "" not registered, fallback to default scheme
I1107 21:59:22.763646 1 component.go:36] [core]ccResolverWrapper: sending update to cc: {[{gpu-operator-node-feature-discovery-master:8080 0 }] }
I1107 21:59:22.763662 1 component.go:36] [core]ClientConn switching balancer to "pick_first"
I1107 21:59:22.763666 1 component.go:36] [core]Channel switches to new LB policy "pick_first"
I1107 21:59:22.763682 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:59:22.763701 1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:59:22.763784 1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1107 21:59:42.765933 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
I1107 21:59:42.765964 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1107 21:59:42.766005 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1107 21:59:43.766068 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 21:59:43.766079 1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 21:59:43.766123 1 component.go:36] [core]Channel Connectivity change to CONNECTING
W1107 22:00:03.766610 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...
I1107 22:00:03.766635 1 component.go:36] [core]Subchannel Connectivity change to TRANSIENT_FAILURE
I1107 22:00:03.766666 1 component.go:36] [core]Channel Connectivity change to TRANSIENT_FAILURE
I1107 22:00:05.644606 1 component.go:36] [core]Subchannel Connectivity change to CONNECTING
I1107 22:00:05.644626 1 component.go:36] [core]Subchannel picks a new address "gpu-operator-node-feature-discovery-master:8080" to connect
I1107 22:00:05.644723 1 component.go:36] [core]Channel Connectivity change to CONNECTING
I1107 22:00:22.765866 1 component.go:36] [core]Channel Connectivity change to SHUTDOWN
I1107 22:00:22.765903 1 component.go:36] [core]Subchannel Connectivity change to SHUTDOWN
F1107 22:00:22.765921 1 main.go:64] failed to connect: context deadline exceeded

Output of running a container on the GPU machine: docker run -it alpine echo foo
[root@worker08 ~]# docker run -it alpine echo foo
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
213ec9aee27d: Already exists
Digest: sha256:bc41182d7ef5ffc53a40b044e725193bc10142a1243f395ee852a8d9730fc2ad
Status: Downloaded newer image for alpine:latest
foo
Docker configuration file: cat /etc/docker/daemon.json
Docker runtime configuration: docker info | grep runtime
NVIDIA shared directory: ls -la /run/nvidia
NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory: ls -la /run/nvidia/driver
kubelet logs journalctl -u kubelet > kubelet.logs

The text was updated successfully, but these errors were encountered:

shivamerla · 2022-11-08T01:49:00Z

@premmotgi looks like NFD worker pods are not able to connect to master pods.

W1107 21:59:42.765933 1 component.go:41] [core]grpc: addrConn.createTransport failed to connect to {gpu-operator-node-feature-discovery-master:8080 gpu-operator-node-feature-discovery-master:8080 0 }. Err: connection error: desc = "transport: Error while dialing dial tcp 172.19.100.15:8080: i/o timeout". Reconnecting...

This is required for the NFD workers to pass required GPU labels to be applied on the node. GPU Operator will depend on these labels to create additional operand pods. Which CNI are you using and can you check for the CNI errors causing this?

premmotgi · 2022-11-08T16:22:01Z

@shivamerla Thanks for your quick response. I am using Calico CNI. I checked if its working fine and seems like the CNI doesnt have any issues. Below is the output:

[root@control01 ~]# kubectl create deployment pingtest --image=busybox --replicas=3 -- sleep infinity
deployment.apps/pingtest created
[root@control01 ~]# kubectl get pods --selector=app=pingtest --output=wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pingtest-64f9cb6b84-4c9fc 1/1 Running 0 8s 172.21.78.225 worker08.robin.ai.lab
pingtest-64f9cb6b84-7wwpm 1/1 Running 0 8s 172.21.78.224 worker08.robin.ai.lab
pingtest-64f9cb6b84-brfng 1/1 Running 0 8s 172.21.78.229 worker08.robin.ai.lab
[root@control01 ~]# kubectl exec -ti pingtest-64f9cb6b84-4c9fc -- sh
/ # ping 172.21.78.224 -c 4
PING 172.21.78.224 (172.21.78.224): 56 data bytes
64 bytes from 172.21.78.224: seq=0 ttl=63 time=0.330 ms
64 bytes from 172.21.78.224: seq=1 ttl=63 time=0.053 ms
64 bytes from 172.21.78.224: seq=2 ttl=63 time=0.045 ms
64 bytes from 172.21.78.224: seq=3 ttl=63 time=0.066 ms

--- 172.21.78.224 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.045/0.123/0.330 ms
/ #

premmotgi · 2022-11-08T16:23:35Z

Previously I had GPU drivers installed on the nodes. I did uninstall and did nvidia purge to remove all installation before installing gpu-operator. Is there any issue with installing gpu-operator after uninstalling the drivers?

FischerLGLN · 2022-11-16T20:35:07Z

I am having the same problem.
k8s 1.25.4 on Ubuntu 20.04 cluster-api kubeadm Nvidia V100

FischerLGLN · 2022-11-16T20:41:47Z

Looks like calico related: #401 (comment)
Had no problem with a different cluster with nfd and flannel backend.

premmotgi · 2023-10-17T18:39:06Z

This issue is solved after using docker-shim

premmotgi closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

premmotgi commented Nov 7, 2022 •

edited

Loading

shivamerla commented Nov 8, 2022

premmotgi commented Nov 8, 2022

premmotgi commented Nov 8, 2022

FischerLGLN commented Nov 16, 2022

FischerLGLN commented Nov 16, 2022

premmotgi commented Oct 17, 2023

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

Helm install of GPU operator doesn't run daemonset containers and validator containers #434

Comments

premmotgi commented Nov 7, 2022 • edited Loading

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

shivamerla commented Nov 8, 2022

premmotgi commented Nov 8, 2022

premmotgi commented Nov 8, 2022

FischerLGLN commented Nov 16, 2022

FischerLGLN commented Nov 16, 2022

premmotgi commented Oct 17, 2023

premmotgi commented Nov 7, 2022 •

edited

Loading