Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to launch pods in EKS (containerd) #143

Closed
flaminidavid opened this issue Apr 30, 2024 · 16 comments
Closed

Failure to launch pods in EKS (containerd) #143

flaminidavid opened this issue Apr 30, 2024 · 16 comments

Comments

@flaminidavid
Copy link

Looks like the launcher fails to run Docker due to missing privileges. Oddly enough the capabilities as reported by containerd look fine. $UID says 0.

Originally tried a few clabernetes versions (as I was suspecting some capabilities issue) and legacy iptables, no luck. I'm attaching a debug session showing where it fails and the capabilities applied to the containers. Any pointers appreciated.

docker.txt
node.txt

@carlmontanari
Copy link
Contributor

👋

hey there! can you confirm that the deployment is privileged? if it is (which should be the default if you are on latest/greatest nvm I can read. also I dont remember why 0.0.30 is a pre release, but guess that shouldn't matter!) then I suspect I have nothing useful to add!

I dont have any eks access so not sure what else would be funky there but assuming you have a privileged pod you "should" be able to do whatever of course.

generally I'd try to kill the controller (so it lets you mess about without re-reconciling -- also there is a label you can add to topology to tell controller to not reconcile: clabernetes/ignoreReconcile) and then change entry point of launcher to sleep infinity and exec on and play around and see if anything else interesting pops up

@flaminidavid
Copy link
Author

flaminidavid commented May 1, 2024

Hey, thanks! Got the privilege issue sorted (basically --profile=netadmin), but it still fails exporting the image. I tried changing the discard_unpacked_layers setting on the node but nothing changed. Looks like it finds the image after the puller downloads it but cannot move it over. Is there any way to mount the pull secret on the launcher? (it works just fine if I do it manually).

INFO |               clabernetes | image "IMAGE" is present, begin copy to docker daemon...
INFO |               clabernetes | time="2024-05-01T14:32:12Z" level=fatal 
msg="failed to resolve reference \"IMAGE\": unexpected status from HEAD request 
to https://REGISTRY: 401 Unauthorized"

WARN |               clabernetes | image re-pull failed, this can happen when containerd 
sets `discard_unpacked_layers` sets to true and we don't have appropriate pull 
secrets for pulling the image. will continue attempting image pull through but 
this may fail, error: exit status 1

@hellt
Copy link
Member

hellt commented May 1, 2024

Oh this is what gives us pain...

But before we get to the image pull, can you elaborate for my understanding what was the issue with the privileged and profiles?
I would like to document this

As for the image move we have been trying to chat with various people from cloud providers to have an option to allow copying the image that was pulled to a worker, but it seems unless a cloud provider lets you configure the containerd runtime, there is not much we can do...

At the same time, the image "move" albeit being super annoying shouldn't stop the image from being pulled into a launcher, the launcher will have to pull it again, instead of copying it from the worker.

@carlmontanari
Copy link
Contributor

--profile=netadmin

as in kubectl flag or something else? the deployment the manager spawns should be privileged by default so there "shouldn't" be any other things you need to do unless there is some aws/eks specific thing you had to do (and if so please elaborate more for future me/folks :))

Is there any way to mount the pull secret on the launcher

not exactly, no. you can mount a docker config which could include whatever auth you want. otherwise this setup is intentional so that we never have pull secrets on the launcher and so that we can use the host nodes to cache the images. this whole layer nonsense really blows the spot though. but... like roman said this shouldn't stop things though (assuming image is publicly pullable), it will just stop us from using the host nodes as the cache/letting k8s pull stuff work, the launcher will then just fail back to pulling the image directly (like from its docker daemon I mean).

if you re looking for the docker daemon config you set it like spec.imagePull.dockerDaemonConfig and it points to some secret in the namespace of your topology then we just mount it for you.

@flaminidavid
Copy link
Author

flaminidavid commented May 1, 2024

quickly replying here:

Privileged profile: I had been fiddling with pod security standards as per https://kubernetes.io/docs/concepts/security/pod-security-standards/. I ended up applying the privileged profile to the clab namespace. Also when debugging with kubectl debug it needs passing the profile, e.g., kubectl debug -n c9s-blah --profile= [ sysadmin | netadmin ].

Pull secrets: The registry is private (has to be in this case). The puller works fine with the pull secrets provided, it's the moving over to Docker from the launcher that fails. I found spec.imagePull.dockerDaemonConfig by looking at the code, the issue with passing Docker daemon config is that there's nowhere to provide auth creds in there (that I know of).

@carlmontanari
Copy link
Contributor

re priv -- ok, so you had to set pss stuff which probably had some tighter stuff by default for eks I assume; but after this, the launcher/docker should run fine right? (assuming this since the pod should be privileged)

yeah the discard_unpacked_layers thing is the bane of our existence. you mentioned you tried to set it but it still fails? it "shouldn't" if that is set nicely. do you manage the nodes? did containerd get restarted and the image re-pulled? I imagine if the image was present even if it was all set nicely the image would have to get re-pulled so it has all the layers.

regarding secrets/daemon config: blah, yeah probably this is my bad since I was thinking you could stuff auth stuff in daemon config but probably never tested since I just use pull through thing anyway. I can add spec.imagePull.dockerConfig and have it be basically same setup as the daemon config -- then you can set auth stuff that way. in the meantime could you try to pop on the launcher and do a docker login and pull your image just to make sure all that part works?

thanks!

@flaminidavid
Copy link
Author

Privileged profile: yes, that's correct. All working fine now on that front.

discard_unpacked_layers: correct, I tried but still failed with the same message. Restarted containerd and all. Deleted the images so they would be pulled clean again. Maybe I messed something else up.

spec.imagePull.dockerConfig: this would be quite nice to have. I worked around it by spawning a local (insecure) registry inside the c9s-blah namespace and using my own puller.

Now I'm facing a different problem when the launcher tries to bring up VXLAN tunnels. I think what's happening is that if the payload container (i.e., the node that the launcher spins up) fails the first time the launcher doesn't clean up after itself and vx-... interfaces are left behind. This causes it to fail the second time with file exists errors when it tries to create the same interface (or some dependency).

I deleted the leftover interface before the launcher got to that step (e.g., ip link del clab-ee8bc3a2, see below) and it sort of worked (VXLAN setup didn't fail) but the node still exited after booting (there was connectivity between the nodes for a brief moment). There were some bridge interfaces left behind that I didn't try to delete. I ended up deleting the pod via kubectl so it would be recreated from scratch, this worked. I'll keep testing, this was just a 2-node setup. I suppose we need a different issue for this new problem anyway.

10: clab-ee8bc3a2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether aa:c1:ab:e9:d7:a8 brd ff:ff:ff:ff:ff:ff
    alias vx-blah-eth1

@hellt
Copy link
Member

hellt commented May 2, 2024

Interesting, thanks for a detailed report. I never saw anything remotely similar to what you're having, but maybe this was because I hadn't had topologies that were not run to completion.

Typically what I do when a topo is all messed up is I delete the whole namespace where the topology was deployed. That is, I deploy my topo in its own namespace, and that makes all c9s resources to be namespaced.
So when things go south, I remove the whole namespace and try again. Maybe a useless advice, but something I'd try first =)

@carlmontanari
Copy link
Contributor

Privileged profile: yes, that's correct. All working fine now on that front.

nice, copy, thanks!

discard_unpacked_layers

continues to be the bane of our existence, and now yours too maybe 🤣

spec.imagePull.dockerConfig

cool, will try to work that up over the next week or so. not hard, just need some time to do it! if you are up for a pr that would be cool too -- basically would just follow same pattern as the daemon config, so shouldn't be too terrible.

im not 100% im tracking on the remaining, but if/when the launcher fails that container should restart so I dont think there should be any cruft left. if you are talking about the services (vx- blah) then those should be reconciled every time the topology is reconciled (and deleted via owner ref when topo is deleted) so those shouldn't have cruft either.

I guess the general thing here is: the launcher should crash if it cant continue, and that crash should let us have a new container w/ fresh stuff. if thats not the case we can fix that but I think/hope that should work. by deleting the pod you are doing the same thing that the launcher would do when it crashes so I think/hope that will be ok. maybe the pod didn't restart because you changed entry point or something while troubleshooting?

but yeah if you can document and get another issue rolling that would be cool! then we can keep this open to track the docker config bits! thanks @flaminidavid !

@flaminidavid
Copy link
Author

I'm running into some more EKS fun given the particularities of my setup. I'll report back when those are sorted (and open another issue for the interface cleanup on failure, it's still happening, although I believe the container failures are caused by something else). Thank you!

@carlmontanari
Copy link
Contributor

@flaminidavid fyi the docker config stuff is in v0.1.0 now -- so that should be one less problem for ya at least 🙃

@flaminidavid
Copy link
Author

Thanks a million. I got my setup to work consistently in EKS, still using a local registry for now. The last tranche of launcher failures was caused by a background process that identified some router container processes as a threat (can't talk specifics here). Just in case anyone else runs into this and rabbit holes debugging containers booting, container processes, and VXLAN setup: it wasn't any of that. The cleanup probably works well in any other situation. I think we can close this one.

@hellt
Copy link
Member

hellt commented May 25, 2024

Good to know @flaminidavid
since you had a good run with EKS do you think we can go as far as say that c9s runs on EKS without manual tinkering considering it is a clean EKS deployment and priv. pods can run there?

@flaminidavid
Copy link
Author

I'd say so, yes.

Aside from the discard_unpacked_layers issue that we should be able to work around by passing Docker config everything else seems to work fine in general. "Just make sure nothing else is messing with the cluster" would be my advice 🙃.

@hellt
Copy link
Member

hellt commented May 25, 2024

sweet, ty.

I will have to try onboard a simple Nokia SR Linux topology there to make sure I capture the docs steps for others to have a smoother, documented experience.

I will close this now.

@flaminidavid
Copy link
Author

Just confirming: overriding Docker config via spec.imagePull.dockerConfig worked just fine (and switched back pullThroughOverride from never to auto). This is working with AWS ECR, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants