containerd getting restarted on gpu node reboots #594

anoopsinghnegi · 2023-10-06T09:59:13Z

1. Quick Debug Information

OS/Version: RHEL8.8
Kernel Version:4.18.0-477.27.1.el8_8.x86_64
Container Runtime Type/Versio: Containerd
K8s Flavor/Version: v1.26.2
GPU Operator Version: 23.6.1

2. Issue or feature description

In our setup whenever the GPU node gets restarted then the stateful workloads (pods consuming PVC using CSI driver) go into crashlookback state when the node gets ready, we suspect containerd restart by gpu-operator(container-toolkit pod) causing this issue and we are continuing our investigation.

We have observed whenever the nvidia-container-toolkit pod starts it configures the host’s containerd config.toml file with nvidia runtime and then restarts the containerd service on the host.

We think, since there are no changes in containerd configurations on the node (for e.g. in restart scenario )then why gpu-operator(container-toolkit pod) restarts the containerd service when nvidia runtime information already persisted?

Can't we put a check in container-toolkit pod not to restart containerd service if the configurations are unchanged? This means when the pod comes up check the configurations and if it finds all the configurations are correct and no modification is needed then don't restart containerd. restart containerd only when required.

Also, the same behaviour is there for nvidia-driver pod, when it comes up it cleans up the node, removes the drivers and modules and re-installs, I think this can also be avoided. instead of doing cleanup, it should check for all the necessary files, driver-version, kernel-version etc. and driver health, if there is a need then only perform cleanup and install the driver. this will help to bring up gpu-operator in less time.

shivamerla · 2023-10-13T05:45:05Z

Thanks for the inputs @anoopsinghnegi we will look into avoid containerd restarts and driver unload whenever not necessary. Since the driver container bind mounts the container path /run/nvidia/driver onto the host, everytime we restart the container, it will be stale and we have to re-install and mount again. But kernel module unload we can potentially avoid.

oneironautjack · 2024-01-03T14:24:34Z

We are having the exact same issue. When changing the value in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml to root = "/" the value is reset back to root = /run/nvidia/driver whenever the driver is reloaded. The setting is required to work around this issue:
#625

wawa0210 · 2024-04-03T07:52:44Z

Thanks for the inputs @anoopsinghnegi we will look into avoid containerd restarts and driver unload whenever not necessary. Since the driver container bind mounts the container path /run/nvidia/driver onto the host, everytime we restart the container, it will be stale and we have to re-install and mount again. But kernel module unload we can potentially avoid.

We are also very confused about this. When the node is restarted, nvidia driver will be uninstalled and then reinstalled, as will the kernel module.

Applications that are using the GPU on that node at this time will be blocked until nvidia-driver is ready again. If the kernel module unload can be avoided when the node is restarted, it will be more friendly to the application.

Is there any new progress in this area in the community?

anoopsinghnegi changed the title ~~containerd getting restarted on gpu node restart~~ containerd getting restarted on gpu node reboots Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containerd getting restarted on gpu node reboots #594

containerd getting restarted on gpu node reboots #594

anoopsinghnegi commented Oct 6, 2023

shivamerla commented Oct 13, 2023

oneironautjack commented Jan 3, 2024 •

edited

Loading

wawa0210 commented Apr 3, 2024 •

edited

Loading

containerd getting restarted on gpu node reboots #594

containerd getting restarted on gpu node reboots #594

Comments

anoopsinghnegi commented Oct 6, 2023

1. Quick Debug Information

2. Issue or feature description

shivamerla commented Oct 13, 2023

oneironautjack commented Jan 3, 2024 • edited Loading

wawa0210 commented Apr 3, 2024 • edited Loading

oneironautjack commented Jan 3, 2024 •

edited

Loading

wawa0210 commented Apr 3, 2024 •

edited

Loading