-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to launch pods in EKS (containerd) #143
Comments
👋 hey there! can you confirm that the deployment is privileged? if it is (which should be the default I dont have any eks access so not sure what else would be funky there but assuming you have a privileged pod you "should" be able to do whatever of course. generally I'd try to kill the controller (so it lets you mess about without re-reconciling -- also there is a label you can add to topology to tell controller to not reconcile: |
Hey, thanks! Got the privilege issue sorted (basically
|
Oh this is what gives us pain... But before we get to the image pull, can you elaborate for my understanding what was the issue with the privileged and profiles? As for the image move we have been trying to chat with various people from cloud providers to have an option to allow copying the image that was pulled to a worker, but it seems unless a cloud provider lets you configure the containerd runtime, there is not much we can do... At the same time, the image "move" albeit being super annoying shouldn't stop the image from being pulled into a launcher, the launcher will have to pull it again, instead of copying it from the worker. |
as in kubectl flag or something else? the deployment the manager spawns should be privileged by default so there "shouldn't" be any other things you need to do unless there is some aws/eks specific thing you had to do (and if so please elaborate more for future me/folks :))
not exactly, no. you can mount a docker config which could include whatever auth you want. otherwise this setup is intentional so that we never have pull secrets on the launcher and so that we can use the host nodes to cache the images. this whole layer nonsense really blows the spot though. but... like roman said this shouldn't stop things though (assuming image is publicly pullable), it will just stop us from using the host nodes as the cache/letting k8s pull stuff work, the launcher will then just fail back to pulling the image directly (like from its docker daemon I mean). if you re looking for the docker daemon config you set it like |
quickly replying here: Privileged profile: I had been fiddling with pod security standards as per https://kubernetes.io/docs/concepts/security/pod-security-standards/. I ended up applying the privileged profile to the clab namespace. Also when debugging with Pull secrets: The registry is private (has to be in this case). The puller works fine with the pull secrets provided, it's the moving over to Docker from the launcher that fails. I found |
re priv -- ok, so you had to set pss stuff which probably had some tighter stuff by default for eks I assume; but after this, the launcher/docker should run fine right? (assuming this since the pod should be privileged) yeah the regarding secrets/daemon config: blah, yeah probably this is my bad since I was thinking you could stuff auth stuff in daemon config but probably never tested since I just use pull through thing anyway. I can add spec.imagePull.dockerConfig and have it be basically same setup as the daemon config -- then you can set auth stuff that way. in the meantime could you try to pop on the launcher and do a docker login and pull your image just to make sure all that part works? thanks! |
Privileged profile: yes, that's correct. All working fine now on that front.
Now I'm facing a different problem when the launcher tries to bring up VXLAN tunnels. I think what's happening is that if the payload container (i.e., the node that the launcher spins up) fails the first time the launcher doesn't clean up after itself and I deleted the leftover interface before the launcher got to that step (e.g.,
|
Interesting, thanks for a detailed report. I never saw anything remotely similar to what you're having, but maybe this was because I hadn't had topologies that were not run to completion. Typically what I do when a topo is all messed up is I delete the whole namespace where the topology was deployed. That is, I deploy my topo in its own namespace, and that makes all c9s resources to be namespaced. |
nice, copy, thanks!
continues to be the bane of our existence, and now yours too maybe 🤣
cool, will try to work that up over the next week or so. not hard, just need some time to do it! if you are up for a pr that would be cool too -- basically would just follow same pattern as the daemon config, so shouldn't be too terrible. im not 100% im tracking on the remaining, but if/when the launcher fails that container should restart so I dont think there should be any cruft left. if you are talking about the services (vx- blah) then those should be reconciled every time the topology is reconciled (and deleted via owner ref when topo is deleted) so those shouldn't have cruft either. I guess the general thing here is: the launcher should crash if it cant continue, and that crash should let us have a new container w/ fresh stuff. if thats not the case we can fix that but I think/hope that should work. by deleting the pod you are doing the same thing that the launcher would do when it crashes so I think/hope that will be ok. maybe the pod didn't restart because you changed entry point or something while troubleshooting? but yeah if you can document and get another issue rolling that would be cool! then we can keep this open to track the docker config bits! thanks @flaminidavid ! |
I'm running into some more EKS fun given the particularities of my setup. I'll report back when those are sorted (and open another issue for the interface cleanup on failure, it's still happening, although I believe the container failures are caused by something else). Thank you! |
@flaminidavid fyi the docker config stuff is in v0.1.0 now -- so that should be one less problem for ya at least 🙃 |
Thanks a million. I got my setup to work consistently in EKS, still using a local registry for now. The last tranche of launcher failures was caused by a background process that identified some router container processes as a threat (can't talk specifics here). Just in case anyone else runs into this and rabbit holes debugging containers booting, container processes, and VXLAN setup: it wasn't any of that. The cleanup probably works well in any other situation. I think we can close this one. |
Good to know @flaminidavid |
I'd say so, yes. Aside from the |
sweet, ty. I will have to try onboard a simple Nokia SR Linux topology there to make sure I capture the docs steps for others to have a smoother, documented experience. I will close this now. |
Just confirming: overriding Docker config via |
Looks like the launcher fails to run Docker due to missing privileges. Oddly enough the capabilities as reported by containerd look fine. $UID says 0.
Originally tried a few clabernetes versions (as I was suspecting some capabilities issue) and legacy iptables, no luck. I'm attaching a debug session showing where it fails and the capabilities applied to the containers. Any pointers appreciated.
docker.txt
node.txt
The text was updated successfully, but these errors were encountered: