Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Avoid using broken templates/clouds #626

Closed
pjdarton opened this issue Feb 22, 2018 · 1 comment
Closed

RFE: Avoid using broken templates/clouds #626

pjdarton opened this issue Feb 22, 2018 · 1 comment
Assignees

Comments

@pjdarton
Copy link
Member

I've found that docker hosts crash and burn - sometimes they just stop accepting connections, sometimes they accept connections but never answer (hence the need for a readTimeout), sometimes they answer but always fail.
I've also found that it's dead easy to enter in a template that doesn't work and always fails to spin up, e.g. the image can't be pulled.

In both cases, the docker plugin will blindly ignore the failures and continue to try to spin up containers, and will continue to tell Jenkins "yes, I can do this" (i.e. ``DockerCloud.canProvision(Label)``` will return true) even when it's doomed to failure, and this means that Jenkins doesn't ask any other cloud to try, so one failing cloud/template can take out your entire Jenkins server.
We need to fix that - one failing template or DockerCloud shouldn't stop the other templates/clouds from being able to shoulder the load.

Proposal:

  • Add a back-off period (in seconds) to the cloud configuration. Default it to a minute or two.
  • If a template fails, we record the failure and the failure time in a transient field (so it isn't persisted) in the template.
  • When DockerCloud.canProvision(Label) is called, it takes into account the time of the last failure and excludes any templates that failed too recently. This may well mean that we answer "no" when we would otherwise say "yes".
  • When DockerCloud.provision(...) is called, we sort our templates in order of last failure so that templates that have never failed come first and the most recently failed templates are considered last. That way we'll allow all templates to be tried in sequence until we find one that works or end up ruling them all out.
  • More serious issues like timeouts, a failure to log in etc, should be recorded against the cloud in similar (transient) fields and should rule out the entire cloud for a period.
  • The configuration UI should show these issues, allowing the administrator to see the last exception(s) that occurred and when they happened, to guide them in how to fix things. e.g. if a template can't work because we've got invalid (or missing) credentials to the registry, that should show up in the UI for the template. If a docker host locks up and stops responding, causing timeouts when talking to it, that should show up in the UI for the cloud as a whole. (this would tie in nicely with RFE: Add a "Test Template" button to the configuration UI #625)
@pjdarton
Copy link
Member Author

Merged.
It'll be in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant