RFE: Avoid using broken templates/clouds #626

pjdarton · 2018-02-22T17:16:28Z

I've found that docker hosts crash and burn - sometimes they just stop accepting connections, sometimes they accept connections but never answer (hence the need for a readTimeout), sometimes they answer but always fail.
I've also found that it's dead easy to enter in a template that doesn't work and always fails to spin up, e.g. the image can't be pulled.

In both cases, the docker plugin will blindly ignore the failures and continue to try to spin up containers, and will continue to tell Jenkins "yes, I can do this" (i.e. ``DockerCloud.canProvision(Label)``` will return true) even when it's doomed to failure, and this means that Jenkins doesn't ask any other cloud to try, so one failing cloud/template can take out your entire Jenkins server.
We need to fix that - one failing template or DockerCloud shouldn't stop the other templates/clouds from being able to shoulder the load.

Proposal:

Add a back-off period (in seconds) to the cloud configuration. Default it to a minute or two.
If a template fails, we record the failure and the failure time in a transient field (so it isn't persisted) in the template.
When DockerCloud.canProvision(Label) is called, it takes into account the time of the last failure and excludes any templates that failed too recently. This may well mean that we answer "no" when we would otherwise say "yes".
When DockerCloud.provision(...) is called, we sort our templates in order of last failure so that templates that have never failed come first and the most recently failed templates are considered last. That way we'll allow all templates to be tried in sequence until we find one that works or end up ruling them all out.
More serious issues like timeouts, a failure to log in etc, should be recorded against the cloud in similar (transient) fields and should rule out the entire cloud for a period.
The configuration UI should show these issues, allowing the administrator to see the last exception(s) that occurred and when they happened, to guide them in how to fix things. e.g. if a template can't work because we've got invalid (or missing) credentials to the registry, that should show up in the UI for the template. If a docker host locks up and stops responding, causing timeouts when talking to it, that should show up in the UI for the cloud as a whole. (this would tie in nicely with RFE: Add a "Test Template" button to the configuration UI #625)

The text was updated successfully, but these errors were encountered:

pjdarton · 2018-03-27T11:27:03Z

Merged.
It'll be in the next release.

pjdarton mentioned this issue Feb 22, 2018

Prevent over-provisioning #622

Merged

pjdarton self-assigned this Mar 8, 2018

pjdarton mentioned this issue Mar 9, 2018

Avoid broken clouds and templates #634

Merged

pjdarton closed this as completed Mar 27, 2018

pjdarton mentioned this issue Dec 10, 2019

RFE: Avoid using broken templates/clouds jenkinsci/openstack-cloud-plugin#280

Open

pjdarton mentioned this issue Aug 13, 2020

RFE: Add a "Test Template" button to the configuration UI #625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: Avoid using broken templates/clouds #626

RFE: Avoid using broken templates/clouds #626

pjdarton commented Feb 22, 2018

pjdarton commented Mar 27, 2018

RFE: Avoid using broken templates/clouds #626

RFE: Avoid using broken templates/clouds #626

Comments

pjdarton commented Feb 22, 2018

pjdarton commented Mar 27, 2018