-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
containerd getting restarted on gpu node reboots #594
Comments
Thanks for the inputs @anoopsinghnegi we will look into avoid |
We are having the exact same issue. When changing the value in |
We are also very confused about this. When the node is restarted, nvidia driver will be uninstalled and then reinstalled, as will the kernel module. Applications that are using the GPU on that node at this time will be blocked until nvidia-driver is ready again. If the kernel module unload can be avoided when the node is restarted, it will be more friendly to the application. Is there any new progress in this area in the community? |
1. Quick Debug Information
2. Issue or feature description
In our setup whenever the GPU node gets restarted then the stateful workloads (pods consuming PVC using CSI driver) go into crashlookback state when the node gets ready, we suspect containerd restart by gpu-operator(container-toolkit pod) causing this issue and we are continuing our investigation.
We have observed whenever the nvidia-container-toolkit pod starts it configures the host’s containerd config.toml file with nvidia runtime and then restarts the containerd service on the host.
We think, since there are no changes in containerd configurations on the node (for e.g. in restart scenario )then why gpu-operator(container-toolkit pod) restarts the containerd service when nvidia runtime information already persisted?
Can't we put a check in container-toolkit pod not to restart containerd service if the configurations are unchanged? This means when the pod comes up check the configurations and if it finds all the configurations are correct and no modification is needed then don't restart containerd. restart containerd only when required.
Also, the same behaviour is there for nvidia-driver pod, when it comes up it cleans up the node, removes the drivers and modules and re-installs, I think this can also be avoided. instead of doing cleanup, it should check for all the necessary files, driver-version, kernel-version etc. and driver health, if there is a need then only perform cleanup and install the driver. this will help to bring up gpu-operator in less time.
The text was updated successfully, but these errors were encountered: