Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container cannot connect to node because it doesn't exist #757

Closed
mat1e opened this issue Oct 16, 2019 · 11 comments · Fixed by #789
Closed

Container cannot connect to node because it doesn't exist #757

mat1e opened this issue Oct 16, 2019 · 11 comments · Fixed by #789
Labels
bug An issue reporting a bug or a PR fixing one.

Comments

@mat1e
Copy link
Member

mat1e commented Oct 16, 2019

We recently updated our version of Jenkins to 2.176.3. And now a connection error with docker-agent randomly block the queue of jobs :

Refusing headers from remote: Unknown client name: docker-00026wu6nor9w

The docker container is ready and try to connect to the Jenkins master but the node doesn't exist yet.

I saw in the code of docker-plugin that the container is created and started before the Jenkins node. While the connection method is JNLP, the commands to download and run the remoting.jar are executed at the start of the container. But at this moment, the node wasn't added to Jenkins master.

Have you ever encountered this error? Is there a solution?

Is it possible to modify provision methods and create the Jenkins node before instanciate the container to fix this issue?

Jenkins version : 2.176.3
docker-plugin version : 1.1.7
docker engine version : 1.13.1

@mat1e
Copy link
Member Author

mat1e commented Oct 16, 2019

We are actually testing "Attach Docker container" wich seems to be a solution. But, in the documentation of the plugin, this functionnality is marked as experimental. Is this still the case ?

@pjdarton
Copy link
Member

pjdarton commented Oct 16, 2019

It's the newest option.
When it was originally introduced it had a bug (#628) that caused it to fail sometimes if the network wasn't perfect between Jenkins & docker, but that's now been fixed (in PR #693).
I am not aware of there being any other problems with it, but it hasn't had the same number of years of use that the other methods have.

I would suggest that you continue to use it and, if you do find bugs, you raise a bug report (providing awesome levels of clarity to make it easy to fix) ... and ideally also follow that with a PR that fixes the issue too.

@pjdarton
Copy link
Member

As for fixing the problem you saw with JNLP, it may be curable by changing the plugin code, but it's a tricky one.
It's tricky because, until the container has started, we don't know everything we need to know about the slave node we return to Jenkins ... but, if the container runs very quickly and has no retry logic inside it then we need to pass the slave node to Jenkins before we start the container.
i.e. both events have reasons why they need to happen before the other one...

@mat1e
Copy link
Member Author

mat1e commented Oct 17, 2019

Thank you for your help. I tried to change code by myself and indeed, it is tricky and have a lot of impacts on others methods of connection.
We tested "Attach Docker container" and it seems to solve our problem. So, we decide to gradually migrate our Docker Templates to this method. I will open bug report and suggest PR if we encounter any problem with it. For now I close this issue.

@mat1e mat1e closed this as completed Oct 17, 2019
@akomakom
Copy link

@pjdarton can you point me in the direction of this fix? I wasn't able to find it in jenkins jira.

When it was originally introduced it had a bug that caused it to fail sometimes if the network wasn't perfect between Jenkins & docker, but that's now been fixed.

@pjdarton
Copy link
Member

I've updated my comment (above) to include links.

@akomakom
Copy link

I've updated my comment (above) to include links.

I've seen other exceptions while using Attach, so I'm sticking with JNLP. For the record, I am using this workaround for this "node doesn't exist" problem with JNLP, and it's (so far) more reliable than Attach.

@pjdarton
Copy link
Member

So, I've been thinking ... and I suspect that there may be a few things we can do to improve this.

  1. Make the container-id field in the slave class settable after it's created (if it isn't already). The setter could throw if it's set twice, but we need to be able to create the slave node instance and then set the container id later.
  2. Ensure that all code that gets the container-id copes with it being null, mark the field as "might be null" etc. That'll affect the garbage-collection code and the UI jelly code, possibly more.
  3. Create the slave node instance without the container-id.
  4. Change the docker plugin's logic so that, for a JNLP-connected slave, the "pre-creation" logic adds the slave node to Jenkins before the container-id is known.
  5. Set the container-id the moment it's known (i.e. immediately after the "docker run" command completes).

In parallel with that, we:

  1. work on a separate PR for the official JNLP docker slave to add support for configurable retries (non-negative integer defaulting to none) and retry-delay (decimal number defaulting to 1 second and permitting both larger and smaller numbers but excluding zero) and, if that gets a favorable reception, we ...
  2. ... enhance this plugin with support for that configuration and have this plugin default to 20 retries half a second apart for newly-configured templates (i.e. existing templates should remain "no retries").
  3. ...and ensure that the new fields have nice help-text explaining what they do ... and ensuring that the "advanced" JNLP options for a custom start command also supports these too.

Disclaimer: When I say "we" I don't mean I'm promising to drive this myself (I've got a ton of not-docker-plugin work to do so I'm going to have to keep my time on this to a minimum) and, as you've got a decent workaround that works for you, I'm not really expecting you to do it either ... but if anyone has the time & inclination to sort this then I am happy to discuss/consult/advise/review changes.

Re: "I've seen other exceptions while using Attach"
Please provide details! While I understand that it isn't ideal for some use-cases, there are plenty of users for whom it is the best option (e.g. it's the only one that's guaranteed to work "out of the box" regardless of network configuration), so if there's issues with it I'd like to be aware of them...

@mat1e
Copy link
Member Author

mat1e commented Dec 20, 2019

@pjdarton, is there a work in progress on an update of JNLP connection ?
I've a workaround on it but I think we don't need to make the container-id field nullable because we get it at the creation.
The script for JNLP is called at the start of the container so we already got the container-id. The update may be to add the node to Jenkins before the container start and not before its creation.

createContainer --> addNode --> startContainer.

I would like your opinion on it.

@pjdarton
Copy link
Member

Yes, I agree that what we need is to "add node" before the container starts ... but for JNLP only (e.g. for SSH we add the node after the container has started and its SSH port is open for business).

As for "progress", my only "progress" on this was to post my thoughts on the subject (above). Been too busy on other things.
I did, at one point, fix the unit-tests though so the main CI build is no longer failing, so if you've got a suggested fix then please submit a PR - it should all stay "green".
...and it should stay green even if you add a SSH-connection test that has no retries and connects just once and connects immediately Jenkins gets the node added. That'll prove that your changes haven't improved JNLP at the expense of SSH.
Ideally, any PR for this would also add a unit-test that demonstrates how the existing code doesn't work so that, if anyone undoes the fix, we get a unit-test failure. At present, despite this issue with the code, the unit-tests are green. I'd guess that, on the cloudbees CI Jenkins, it takes sufficiently long for the CI server's docker daemon to start the container that Jenkins has added the node before the container tries to connect to Jenkins. If we could figure out a unit-test that fails if we don't add the node before we start the container then such a test would be able to prove this issue is fixed...

...but, just to set expectations: I'm off soon, just 2 working days left this year (with those already 100% committed to other tasks), so I'm unlikely to do much until mid-Jan at the earliest.

@mat1e mat1e reopened this Jan 13, 2020
@mat1e
Copy link
Member Author

mat1e commented Jan 13, 2020

@pjdarton,
I have a workaround at this repository.
I would like your opinion on this solution. For the moment, I have not found any other way to do otherwise without impacting attach and jnlp code.
I need to do more tests but this solution is simple and seems to be working.

pjdarton added a commit to pjdarton/docker-plugin that referenced this issue Apr 15, 2020
Refactor DockerTransientNode constructor so the launcher is provided
post-construction.
Refactor DockerComputerConnector.beforeContainerStarted(...) so it takes
a node (that holds a containerId) instead of a containerId.
Refactor DockerTemplate.doProvision(...) so that the DockerTransientNode
is created without a launcher before we start the container.
Enhance DockerComputerJNLPConnector so that it adds the node to Jenkins
immediately before the container is created, ensuring that it'll be
known to Jenkins (just) before the JNLP process starts.
pjdarton added a commit to pjdarton/docker-plugin that referenced this issue Apr 28, 2020
Refactor DockerComputerConnector.beforeContainerStarted(...) and
afterContainerStarted(...) so it takes a node (that holds a containerId)
instead of a containerId.
Refactor DockerTransientNode constructors so the launcher can be
provided post-construction.
Refactor DockerTemplate.doProvision(...) so that the DockerTransientNode
is created without a launcher before we start the container and the
launcher is only set after the container is started.
Enhance DockerComputerJNLPConnector so that it adds the node to Jenkins
immediately before the container is created, ensuring that it'll be
known to Jenkins (just) before the JNLP process starts.
@pjdarton pjdarton added the bug An issue reporting a bug or a PR fixing one. label Aug 13, 2020
pjdarton added a commit that referenced this issue Oct 9, 2020
Refactor DockerComputerConnector.beforeContainerStarted(...) and afterContainerStarted(...) so it takes a node (that holds a containerId) instead of a containerId.
Refactor DockerTransientNode constructors so the launcher can be provided post-construction.
Refactor DockerTemplate.doProvision(...) so that the DockerTransientNode is created without a launcher before we start the container and the launcher is only set after the container is started.
Enhance DockerComputerJNLPConnector so that it adds the node to Jenkins immediately before the container is created, ensuring that it'll be known to Jenkins (just) before the JNLP process starts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An issue reporting a bug or a PR fixing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants