Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd getting restarted on gpu node reboots #594

Open
anoopsinghnegi opened this issue Oct 6, 2023 · 3 comments
Open

containerd getting restarted on gpu node reboots #594

anoopsinghnegi opened this issue Oct 6, 2023 · 3 comments

Comments

@anoopsinghnegi
Copy link

1. Quick Debug Information

  • OS/Version: RHEL8.8
  • Kernel Version:4.18.0-477.27.1.el8_8.x86_64
  • Container Runtime Type/Versio: Containerd
  • K8s Flavor/Version: v1.26.2
  • GPU Operator Version: 23.6.1

2. Issue or feature description

In our setup whenever the GPU node gets restarted then the stateful workloads (pods consuming PVC using CSI driver) go into crashlookback state when the node gets ready, we suspect containerd restart by gpu-operator(container-toolkit pod) causing this issue and we are continuing our investigation.

We have observed whenever the nvidia-container-toolkit pod starts it configures the host’s containerd config.toml file with nvidia runtime and then restarts the containerd service on the host.

We think, since there are no changes in containerd configurations on the node (for e.g. in restart scenario )then why gpu-operator(container-toolkit pod) restarts the containerd service when nvidia runtime information already persisted?

Can't we put a check in container-toolkit pod not to restart containerd service if the configurations are unchanged? This means when the pod comes up check the configurations and if it finds all the configurations are correct and no modification is needed then don't restart containerd. restart containerd only when required.

Also, the same behaviour is there for nvidia-driver pod, when it comes up it cleans up the node, removes the drivers and modules and re-installs, I think this can also be avoided. instead of doing cleanup, it should check for all the necessary files, driver-version, kernel-version etc. and driver health, if there is a need then only perform cleanup and install the driver. this will help to bring up gpu-operator in less time.

@anoopsinghnegi anoopsinghnegi changed the title containerd getting restarted on gpu node restart containerd getting restarted on gpu node reboots Oct 6, 2023
@shivamerla
Copy link
Contributor

Thanks for the inputs @anoopsinghnegi we will look into avoid containerd restarts and driver unload whenever not necessary. Since the driver container bind mounts the container path /run/nvidia/driver onto the host, everytime we restart the container, it will be stale and we have to re-install and mount again. But kernel module unload we can potentially avoid.

@oneironautjack
Copy link

oneironautjack commented Jan 3, 2024

We are having the exact same issue. When changing the value in /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml to root = "/" the value is reset back to root = /run/nvidia/driver whenever the driver is reloaded. The setting is required to work around this issue:
#625

@wawa0210
Copy link

wawa0210 commented Apr 3, 2024

Thanks for the inputs @anoopsinghnegi we will look into avoid containerd restarts and driver unload whenever not necessary. Since the driver container bind mounts the container path /run/nvidia/driver onto the host, everytime we restart the container, it will be stale and we have to re-install and mount again. But kernel module unload we can potentially avoid.

We are also very confused about this. When the node is restarted, nvidia driver will be uninstalled and then reinstalled, as will the kernel module.

Applications that are using the GPU on that node at this time will be blocked until nvidia-driver is ready again. If the kernel module unload can be avoided when the node is restarted, it will be more friendly to the application.

Is there any new progress in this area in the community?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants