Allow sidecars to terminate gracefully #1131

pierretasci · 2019-07-26T21:00:07Z

Expected Behavior

A sidecar might need to complete any pending work before being torn down (eg, uploading metadata).

Actual Behavior

Sidecars are replaced with a nop image as soon as a TaskRun completes preventing any cleanup work.

Steps to Reproduce the Problem

Create a TaskRun
Using an AdmissionsController to inject a sidecar that streams output from the Task to another service.

We would expect all of the output to be uploaded as the task runs but whatever chunk isn't uploaded before the sidecar is torn down is presumably lost.

Additional Info

There are ways around this but it seems to me that the very least, a sidecar should receive the kill signal so it can handle shutdown however it needs to. A timeout can always been inserted to ensure it doesn't hang forever.

The text was updated successfully, but these errors were encountered:

dibyom · 2019-07-26T21:24:34Z

/kind bug
/cc @sbwsg

ghost · 2019-07-29T14:30:13Z

There was some discussion of this problem on the original design proposal. I think the main issue raised was that you can't guarantee an arbitrary sidecar has a kill binary available, so exec'ing into a container and running kill on a pid won't always work. I'll do some reading on how kubernetes kills containers in a pod and see what it would take to do same for Tekton.

@pierretasci Do you ideas for ways this could be implemented that would support arbitrary containers?

pierretasci · 2019-07-29T15:03:24Z

Yeah, I read through the discussion on the original proposal and I think it makes sense. I do think that it wouldn't be particularly harmful to send a SIGTERM or SIGKILL into the container arbitrarily with the contract that if you don't handle shutdown in XX seconds then we will perform our nop container swap.

If the container doesn't have shutdown logic, no harm done it seems. At the worst, it would introduce an XX second delay to completing each task.

I am of course open to other ideas but I do think stateful sidecars (logging is another use case that comes to mind) will want a shutdown mechanism.

ghost · 2019-07-30T19:44:13Z

I do think that it wouldn't be particularly harmful to send a SIGTERM or SIGKILL into the container

I totally agree, I'm just not entirely sure yet how this would be implemented if the container doesn't include a kill binary.

ghost · 2019-08-05T16:24:02Z

I've been skimming through the k8s codebase looking for the spot where the container termination flow is implemented and it looks as though the precise implementation of the grace period is delegated to the container runtime.

ghost · 2019-08-27T14:15:21Z

Another alternative here would be to introduce an optional contract for sidecars: If you want to receive a SIGTERM / shutdown warning then your sidecar must contain a kill binary on $PATH. If a sidecar provides this then Tekton can execute a similar container shutdown procedure to Kubernetes' before switching out the sidecar container's image to nop. Sidecars that don't implement the contract will simply be stopped without warning.

This would provide enough of a stopgap until the sidecars KEP lands I think.

pierretasci · 2019-08-27T17:02:55Z

Yeah, that is sort-of what I was getting at above. If there is a kill binary, we provide graceful shutdown, otherwise, the timeout and hot-swap seems more than a reasonable contract.

I suspect the timeout will actually be enough for most sidecars but having a dedicated way to shut down if it is required will help cover edge cases better.

ghost · 2019-08-28T17:34:46Z

OK cool so then the steps to implement are something like:

Exec kill in sidecar to send SIGTERM to PID 1
After termination grace period (30s default) exec kill again to send SIGKILL to PID 1
If exec'ing kill doesn't work in step 1 or the sidecar doesn't stop after step 2 do a "nop swap" of the sidecar's image field and forcefully shut it down.

Sounds doable! I'll try to find some time this week or next to implement.

Mohjive · 2020-03-10T19:50:30Z

There’s no need for a ‘kill’ binary in the container. It’s possible to send SIGTERM to the container process 1: https://www.ctl.io/developers/blog/post/gracefully-stopping-docker-containers/
It’s up to that process to handle the signal, e.g https://gobyexample.com/signals if image is just a Golang binary.
If the container doesn’t exit within the grace period it should be killed, and as last resort removed (and replaced with nop image)

Mohjive · 2020-03-10T20:05:38Z

Also consider new sidecar lifecycle in k8s future version (maybe 1.19): kubernetes/enhancements#753

ghost · 2020-03-11T12:57:14Z

Many thanks @Mohjive . You mention that there's no need for a kill binary in the Pod's container but then you forgot to demonstrate how to send SIGTERM without it 😆 . Assume I'm a Linux noob. How do I send SIGTERM to a process without kill? I can't run docker stop using the kubernetes controller, or can I?

Thanks for the pointer to the sidecar lifecycle KEP. It's definitely on our radar. We'd be very happy to remove the nop image hack and replace it with that feature once it's merged and k8s have committed to its support.

Mohjive · 2020-03-11T17:40:05Z

Docker/k8s isn’t Linux, so you have to apply other way of thinking when using containers. Containers doesn’t run an operating system, so no init process or daemons (it’s possible to have an init process in a container, but that’s a special use case).

I’m fairly new to k8s, but deleting a pod will send SIGTERM to all containers in that pod and after a grace period it will send SIGKILL unless the container has exited. There might be other ways as well, but someone else has to explain that. :)

When a container receives a SIGTERM it will be received by process 1 in the container. Process 1 is the one defined by ‘CMD’ argument in dockerfile, the one command specified in ‘kubectl run’ or other way, e.g. in your crd. See https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html for more details.
And then process 1 has to handle the signal, as I posted above.

ghost · 2020-03-11T18:50:16Z

deleting a pod will send SIGTERM to all containers in that pod and after a grace period it will send SIGKILL unless the container has exited. There might be other ways as well, but someone else has to explain that. :)

True we could delete the pod. We do want to keep it around for its logs though. At least until the user deletes the TaskRun that spawned it. Or if we have some other established mechanism for persisting the logs outside of Kubernetes such as through the Tekton Result proposal.

From my perspective the sidecar lifecycle KEP seems like the ideal solution. I'd much prefer the platform exposed a mechanism for this rather than us layering in workarounds.

Mohjive · 2020-03-11T20:24:05Z

Ok, so I did some quick research and what I found is that it should be possible to run docker cli (if the k8s cluster uses docker daemon shim) or containerd cli (https://github.com/projectatomic/containerd/blob/master/docs/cli.md) directly on the executor, so if there exists a Tekton container in the job pod it could monitor the job container and use the right cli to stop the sidecar containers, when the job has exited.

vdemeester · 2020-03-12T07:39:47Z

So, as @sbwsg said, killing/deleting the Pod should be considered as "not possible", at least for now. We already do that when cancelling a TaskRun and I feel that was a mistake…

Ok, so I did some quick research and what I found is that it should be possible to run docker cli (if the k8s cluster uses docker daemon shim) or containerd cli (https://github.com/projectatomic/containerd/blob/master/docs/cli.md) directly on the executor, so if there exists a Tekton container in the job pod it could monitor the job container and use the right cli to stop the sidecar containers, when the job has exited.

We can't rely on a specific runtime cli, as we have absolutely no guarantee that the runtime is the one used by the Kubernetes it is deployed on. It also means we wouldn't support some runtime, and it breaks absraction… Tekton doesn't have to know what runtime Kubernetes uses.

What "could" be done though, is to do more or less the same as what we are doing for our steps, aka wrapping the sidecar command with ours (similar to entrypoint), and have this wrapper manage the sidecar process:

starting it
acting in case of failure (aka the process died, we should "exit" with the same error code and message, to pop the information up)
kill gently with SIGTERM (aka "please stop my little process") the process and exit with 0 if all went well

The trick for the wrapper, is to know when to to "gently kill" the sidecar process. We could do the same as we do with the entrypoint wrapper, using a file to communicate.

shahnewazrifat · 2020-07-08T07:44:00Z

Currently, the sidecar is not behaving like a sidecar should.
The sidecar should be able to choose when to kill itself. In my use case, I'm trying to process (conversion of format etc) and upload some test results even if the test step failed.

So, if there was no Nop image hack in place, my sidecar would just complete its task and the container status will be terminated. Isn't it better to leave it up to the user on how and when he wants to terminate his sidecar?

If it's absolutely necessary for Tekton to take this responsibility to kill the sidecar, can we at least have an option to disable it?

Like inside the sidecar definition, a flag like:
forceTermination: false ?

Which will disable the replacing mechanism with Nop image.

Some more insight why I think the newly introduced "Finally" wasn't good enough for me to do the same:
Finally is a separate task. We're trying to avoid using PVC and moved to a single task design for our pipeline since PVC creation can take time. With Finally, if I want to share disk, I must use shared workspace.

Which forces us to modify certain dockerfiles WORKDIR and even COPY commands to make sure they are within the /workspace. It can also get messy since the workspace already have the original repo code cloned to it and now we're copying something unrelated inside.

Thanks!

shahnewazrifat · 2020-07-15T11:04:53Z

Any thoughts @vdemeester / @sbwsg
Sorry about the noise :D

I can contribute to this fix if you agree
I'd prefer ForceSidecarTermination bool flag in TaskRunSpec to give the control over to the user to disable Nop image swapping. Which will be true by default

ghost · 2020-07-15T13:04:54Z

Hey, no apology necessary, it sometimes takes us a while to give new designs a proper look.

I am not opposed to the idea of introducing a mechanism to disable the "nop image swap" and let sidecars terminate themselves.

There is no workaround that I can think of which would work today without code changes. The reason this is so tricky is because of the requirement for the Sidecar to run to completion even if a Step failed. If this wasn't a constraint then you could program your last Step to wait for a signal from the Sidecar before the Step and Task end. But unfortunately Steps after the failed one will simply not run so the Task will end immediately, killing the Sidecar no matter what.

Looking at the code as it currently stands Sidecar is an alias of Step, so we'd need to make Sidecar its own struct to add this new Sidecar-specific field. I like the idea that this is configured on a per-sidecar basis, at least in the initial version. Keeps the scope of change relatively small I think.

So the best next step would be to write a short design doc and present it to the API Working Group to propose the idea. If you can't make it to a WG feel free to write the doc and I'd be happy to communicate it out.

shahnewazrifat · 2020-07-15T15:43:13Z

Thank you so much @sbwsg !
Really appreciate it.

I'll follow your advice and ping you back if needed

skaegi · 2020-07-27T17:46:05Z

@shahnewazrifat when you added a lifecyle / preStop to your sidecar what did you see? Did it just not get called at all when we do the nop image swap.

shahnewazrifat · 2020-07-28T06:14:05Z

@skaegi Precisely. It never got called. Here's the explanation from Kubernetes docs. Looks like it only gets triggered under certain conditions:

PreStop

This hook is called immediately before a container is terminated due to an API request or management event such as liveness probe failure, preemption, resource contention and others. A call to the preStop hook fails if the container is already in terminated or completed state. It is blocking, meaning it is synchronous, so it must complete before the call to delete the container can be sent. No parameters are passed to the handler.

A more detailed description of the termination behavior can be found in Termination of Pods.

Interesting part being:

A call to the preStop hook fails if the container is already in terminated or completed state.

tekton-robot · 2020-10-26T06:24:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

ghost · 2020-10-26T13:22:09Z

/remove-lifecycle stale
/lifecycle frozen

Open TEP: tektoncd/community#191

lbernick · 2023-07-27T19:42:04Z

Kubernetes is introducing support for sidecar containers in alpha in the 1.28 release (kubernetes/enhancements#753). Once this feature is more stable, it could help us address this issue by replacing our logic that terminates sidecars using the nop image.

tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jul 26, 2019

ghost self-assigned this Jul 29, 2019

ghost mentioned this issue Aug 5, 2019

Specify sidecar containers to run inside a TaskRun's Pod #980

Closed

ghost mentioned this issue Sep 24, 2019

Sidecars doesn't get terminated when the binary is in the nop image #1347

Closed

shahnewazrifat mentioned this issue Jul 26, 2020

Fixes: #1131, Enable graceful sidecar kill #3013

Closed

4 tasks

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2020

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 26, 2020

ghost removed their assignment Jun 21, 2021

bobcatfish added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 15, 2021

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

afrittoli mentioned this issue Nov 23, 2023

Keep Pod on cancel promotion to beta tracking issue #7413

Open

4 tasks

pritidesai mentioned this issue Jan 25, 2024

Adopt Kubernetes native Sidecar Containers #7617

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sidecars to terminate gracefully #1131

Allow sidecars to terminate gracefully #1131

pierretasci commented Jul 26, 2019

dibyom commented Jul 26, 2019

ghost commented Jul 29, 2019

pierretasci commented Jul 29, 2019

ghost commented Jul 30, 2019

ghost commented Aug 5, 2019

ghost commented Aug 27, 2019 •

edited by ghost

Loading

pierretasci commented Aug 27, 2019

ghost commented Aug 28, 2019

Mohjive commented Mar 10, 2020 •

edited

Loading

Mohjive commented Mar 10, 2020

ghost commented Mar 11, 2020

Mohjive commented Mar 11, 2020 •

edited

Loading

ghost commented Mar 11, 2020

Mohjive commented Mar 11, 2020

vdemeester commented Mar 12, 2020

shahnewazrifat commented Jul 8, 2020

shahnewazrifat commented Jul 15, 2020 •

edited

Loading

ghost commented Jul 15, 2020

shahnewazrifat commented Jul 15, 2020

skaegi commented Jul 27, 2020

shahnewazrifat commented Jul 28, 2020 •

edited

Loading

tekton-robot commented Oct 26, 2020

ghost commented Oct 26, 2020

lbernick commented Jul 27, 2023

Allow sidecars to terminate gracefully #1131

Allow sidecars to terminate gracefully #1131

Comments

pierretasci commented Jul 26, 2019

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

dibyom commented Jul 26, 2019

ghost commented Jul 29, 2019

pierretasci commented Jul 29, 2019

ghost commented Jul 30, 2019

ghost commented Aug 5, 2019

ghost commented Aug 27, 2019 • edited by ghost Loading

pierretasci commented Aug 27, 2019

ghost commented Aug 28, 2019

Mohjive commented Mar 10, 2020 • edited Loading

Mohjive commented Mar 10, 2020

ghost commented Mar 11, 2020

Mohjive commented Mar 11, 2020 • edited Loading

ghost commented Mar 11, 2020

Mohjive commented Mar 11, 2020

vdemeester commented Mar 12, 2020

shahnewazrifat commented Jul 8, 2020

shahnewazrifat commented Jul 15, 2020 • edited Loading

ghost commented Jul 15, 2020

shahnewazrifat commented Jul 15, 2020

skaegi commented Jul 27, 2020

shahnewazrifat commented Jul 28, 2020 • edited Loading

tekton-robot commented Oct 26, 2020

ghost commented Oct 26, 2020

lbernick commented Jul 27, 2023

ghost commented Aug 27, 2019 •

edited by ghost

Loading

Mohjive commented Mar 10, 2020 •

edited

Loading

Mohjive commented Mar 11, 2020 •

edited

Loading

shahnewazrifat commented Jul 15, 2020 •

edited

Loading

shahnewazrifat commented Jul 28, 2020 •

edited

Loading