Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker exec always returns 0 - ignores exit code of process #5692

Closed
corrieb opened this issue Jul 11, 2017 · 19 comments · Fixed by #7969
Closed

Docker exec always returns 0 - ignores exit code of process #5692

corrieb opened this issue Jul 11, 2017 · 19 comments · Fixed by #7969
Assignees
Labels
component/portlayer/execution component/tether kind/defect Behavior that is inconsistent with what's intended priority/p0 source/customer Reported by a customer, directly or via an intermediary team/foundation

Comments

@corrieb
Copy link
Contributor

corrieb commented Jul 11, 2017

User Statement:

As a container developer, I need to know the exit code of a process so that I can determine success or failure of the command I've run via docker exec.

Details:

Docker exec is the most obvious and cleanest way to test if a container has come up correctly. This is particularly important in multi-container deployments as you may need to block one container starting until another has come up.

Eg.

while true; do
   docker exec -it db mysqladmin --user=$DB_USER --password=$DB_PASSWORD version
   if [ $? -eq 0 ]; then
      break
   fi
   sleep 5
done

As it stands - with master build 12207 - docker exec always returns 0. This is not consistent with the behavior of docker engine. Makes no difference whether -it is used or not.

Obviously in the case of -d, we wouldn't expect this to work because docker exec exits before the process completes.

There is a simple testcase for this:

$ more Dockerfile 
FROM debian
COPY exit.sh /

$ more exit.sh 
#!/bin/bash
echo returning $1
exit $1

$ docker run --name test -d bensdoings/exit sleep 1000
b7b22cfb2271ee4d0e8b398b7df52869b7e791eddaf54913fc7367d5b1312722
$ docker exec test /exit.sh 5
returning 5
$ echo $?
0

Acceptance Criteria:

It should be relatively easy to test for this. We should add a regression test based on the above

Making this high priority. Given that we haven't implemented health check, the ability to determine health by running docker exec is an important capability.

@corrieb corrieb added component/portlayer/execution component/tether kind/defect Behavior that is inconsistent with what's intended priority/p0 labels Jul 11, 2017
@corrieb
Copy link
Contributor Author

corrieb commented Jul 11, 2017

@caglar10ur believes this may be fixed. Will re-try tomorrow, but leave the bug open

@caglar10ur
Copy link
Contributor

not just a belief but also following is what I get with current master :)

[vagrant@devbox:/opt/go/src/github.com/vmware/vic(master)] docker exec -it a sh -c 'exit 4'
[vagrant@devbox:/opt/go/src/github.com/vmware/vic(master)] echo $?
4
[vagrant@devbox:/opt/go/src/github.com/vmware/vic(master)]

@corrieb
Copy link
Contributor Author

corrieb commented Jul 11, 2017

@caglar10ur if I knew of a specific commit this was tied to, I'd have more confidence :) Seeing is believing

@anchal-agrawal
Copy link
Contributor

Build 12207 (https://ci.vcna.io/vmware/vic/12207) is at addc8b1.

@corrieb
Copy link
Contributor Author

corrieb commented Jul 12, 2017

@caglar10ur I built the latest master myself and got the same result. Tested on ESX and it works. Tested on vSphere and it doesn't. That's why we were seeing different results.

@corrieb corrieb changed the title Docker exec always returns 0 - ignores exit code of process Docker exec always returns 0 - ignores exit code of process (works on ESX, not vSphere) Jul 12, 2017
@corrieb corrieb changed the title Docker exec always returns 0 - ignores exit code of process (works on ESX, not vSphere) Docker exec always returns 0 - ignores exit code of process Jul 12, 2017
@mdubya66 mdubya66 added impact/doc/note Requires creation of or changes to an official release note priority/p2 and removed priority/p0 labels Aug 2, 2017
@mdubya66
Copy link
Contributor

mdubya66 commented Aug 2, 2017

Belief is this is tied to vSphere host sync delay.

@stuclem
Copy link
Contributor

stuclem commented Sep 12, 2017

An attempt at a release note entry:


  • docker exec always returns 0 and ignores the exit code of processes. #5692
    docker exec always returns 0, even if you specify -it. This is potentially due to a delay in vSphere host synchronization.

@corrieb @anchal-agrawal @mdubya66 is this OK? Thanks!

@stuclem stuclem closed this as completed Sep 13, 2017
@stuclem stuclem reopened this Sep 13, 2017
@stuclem
Copy link
Contributor

stuclem commented Sep 18, 2017

@caglar10ur can you also please take a look at the release note above?

@stuclem stuclem removed the impact/doc/note Requires creation of or changes to an official release note label Oct 3, 2017
@shadjiiski
Copy link

@stuclem, some additional info: the docker exec functionality is exposed in Admiral in the form of a health configuration for a command-based healthcheck. More technically, a user-defined command is executed in the container and the healthcheck action is successful if the exit code of that command is 0. Because of this issue, command-based healthcheck is always successful for containers provisioned on affected VCH hosts, even if the user-specified command does not exist in the scope of the container.

If we are documenting this as a known issue, we should probably add something about the command-based health configuration as well. If you are not familiar with the feature, some information is available in the GitHub wiki. cc @sergiosagu

@stuclem
Copy link
Contributor

stuclem commented Jan 23, 2018

Thanks @shadjiiski and @sergiosagu. I updated the Release Note as follows:

  • docker exec always returns 0 and ignores the exit code of processes. #5692
    docker exec always returns 0, even if you specify -it. This issue is potentially due to a delay in vSphere host synchronization. If you configure command-based health checks in vSphere Integrated Containers Management Portal, the health checks are always successful for containers that are provisioned on affected VCHs, even if the user-specified command does not exist in the scope of the container. This is because command-based health checks are considered to be successful if the exit code of that command is 0.

Is this OK? Do we need to include this in the Admiral 1.3.0 RNs too?

@shadjiiski
Copy link

@stuclem, thanks for the update, looks good. Yes, please update the Admiral release notes as well.

@stuclem
Copy link
Contributor

stuclem commented Jan 23, 2018

@shadjiiski, done: https://github.com/vmware/admiral/releases/tag/vic_v1.3.0

Thanks!

@hickeng hickeng added source/customer Reported by a customer, directly or via an intermediary priority/p0 and removed priority/p2 labels Apr 26, 2018
@hickeng
Copy link
Member

hickeng commented Apr 26, 2018

This is now blocking a product go-live

@gigawhitlocks gigawhitlocks self-assigned this Apr 26, 2018
@zjs
Copy link
Member

zjs commented Apr 27, 2018

This is now blocking a product go-live

It seems like a user could wrap whatever command they want to run with logic to always print the return code via standard out. Then, whatever logic wants to check the command could look at the last line of exec's standard out instead of exec's RC.

This is inelegant, but it seems like it would unblock things in the short term.

@shadjiiski
Copy link

It seems like a user could wrap whatever command they want to run with logic to always print the return code via standard out

I just want to clarify that this is not going to resolve the healthcheck issue on the Admiral side without additional effort from the Admiral team (that was not planned for the upcoming release). Admiral checks the exit code of the comment and makes no use of its standard output. cc @lazarin, @martin-borisov

Also, I am not sure if VCH now has an equivalent to the native Docker healthcheck, but if it does, I I think you might still hit the exact same issue there. According to the Docker docs the healthcheck also executes a command and checks its exit code.

@hickeng
Copy link
Member

hickeng commented May 3, 2018

Running healthcheck via an exec is a terrible pattern for a cVM and comes with various overheads. It also means that healthcheck will not continue to run while the endpointVM is down.

If integrating with vSphere HA it makes much more sense for the healthcheck process to be dispatched from within the cVM and tied in to application heartbeat support. In this case the healthcheck on the docker API side can then watch for health alerts instead of performing heavyweight polling.

I would highly recommend some longevity/performance testing on the impact of dispatching an exec into a container every few minutes if Admiral is using this mechanism for health checking. IIRC we are not garbage collecting the exec configurations in any aggressive manner which means that list will grow significantly over time. Other than that testing I would not conflate this issue with healthcheck at all.

@gigawhitlocks
Copy link
Contributor

I have a proof of concept fix for this bug stored locally. I'm going to push that to a branch today and @mavery will be taking the lead on turning that stopgap fix into a maintainable redesign of the exec flow.

@matthewavery matthewavery self-assigned this May 9, 2018
@matthewavery
Copy link
Contributor

To move forward on this ticket the first step is to create a design doc for the life cycle of a process for the container. We must design a path forward for handling all what transitions a process takes in the tether/cvm. That is the first step that I am taking for addressing this ticket and I will verify(with @hickeng ) and then link that design here. From there we can look at the potential patch that should work, and design it is such a way to avoid creating more tech debt in a place where we really need to implement order to create stability. cc @mdubya66 @hickeng @gigawhitlocks

@sgairo
Copy link
Contributor

sgairo commented May 9, 2018

Increasing estimate to 5 in order to account for design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/portlayer/execution component/tether kind/defect Behavior that is inconsistent with what's intended priority/p0 source/customer Reported by a customer, directly or via an intermediary team/foundation
Projects
None yet
Development

Successfully merging a pull request may close this issue.