You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've found that docker hosts crash and burn - sometimes they just stop accepting connections, sometimes they accept connections but never answer (hence the need for a readTimeout), sometimes they answer but always fail.
I've also found that it's dead easy to enter in a template that doesn't work and always fails to spin up, e.g. the image can't be pulled.
In both cases, the docker plugin will blindly ignore the failures and continue to try to spin up containers, and will continue to tell Jenkins "yes, I can do this" (i.e. ``DockerCloud.canProvision(Label)``` will return true) even when it's doomed to failure, and this means that Jenkins doesn't ask any other cloud to try, so one failing cloud/template can take out your entire Jenkins server.
We need to fix that - one failing template or DockerCloud shouldn't stop the other templates/clouds from being able to shoulder the load.
Proposal:
Add a back-off period (in seconds) to the cloud configuration. Default it to a minute or two.
If a template fails, we record the failure and the failure time in a transient field (so it isn't persisted) in the template.
When DockerCloud.canProvision(Label) is called, it takes into account the time of the last failure and excludes any templates that failed too recently. This may well mean that we answer "no" when we would otherwise say "yes".
When DockerCloud.provision(...) is called, we sort our templates in order of last failure so that templates that have never failed come first and the most recently failed templates are considered last. That way we'll allow all templates to be tried in sequence until we find one that works or end up ruling them all out.
More serious issues like timeouts, a failure to log in etc, should be recorded against the cloud in similar (transient) fields and should rule out the entire cloud for a period.
The configuration UI should show these issues, allowing the administrator to see the last exception(s) that occurred and when they happened, to guide them in how to fix things. e.g. if a template can't work because we've got invalid (or missing) credentials to the registry, that should show up in the UI for the template. If a docker host locks up and stops responding, causing timeouts when talking to it, that should show up in the UI for the cloud as a whole. (this would tie in nicely with RFE: Add a "Test Template" button to the configuration UI #625)
The text was updated successfully, but these errors were encountered:
I've found that docker hosts crash and burn - sometimes they just stop accepting connections, sometimes they accept connections but never answer (hence the need for a readTimeout), sometimes they answer but always fail.
I've also found that it's dead easy to enter in a template that doesn't work and always fails to spin up, e.g. the image can't be pulled.
In both cases, the docker plugin will blindly ignore the failures and continue to try to spin up containers, and will continue to tell Jenkins "yes, I can do this" (i.e. ``DockerCloud.canProvision(Label)``` will return true) even when it's doomed to failure, and this means that Jenkins doesn't ask any other cloud to try, so one failing cloud/template can take out your entire Jenkins server.
We need to fix that - one failing template or DockerCloud shouldn't stop the other templates/clouds from being able to shoulder the load.
Proposal:
DockerCloud.canProvision(Label)
is called, it takes into account the time of the last failure and excludes any templates that failed too recently. This may well mean that we answer "no" when we would otherwise say "yes".DockerCloud.provision(...)
is called, we sort our templates in order of last failure so that templates that have never failed come first and the most recently failed templates are considered last. That way we'll allow all templates to be tried in sequence until we find one that works or end up ruling them all out.The text was updated successfully, but these errors were encountered: