-
Notifications
You must be signed in to change notification settings - Fork 149
EOF/connection reset errors in UI w/ Nomad 0.10.4 #557
Comments
We are seeing this with Nomad 0.10.3 and 0.10.5 |
Same problem on 0.11.0 and 0.11.1 |
We've reverted to using the built-in UIs because of this issue, despite our best efforts we've been unable to debug where the problem lies and couldn't justify investing more time. |
It's a bug in the Nomad API SDK - I believe they might have fixed it recently, let me check |
I believe this is the fix hashicorp/nomad#5970 |
@jippi that seems to be an older bug though and the calls we are seeing returning EOF don't seem to have anything to do with the GC/GcAlloc endpoint, unless there's some underlying logic I am missing? |
I've pushed |
@jippi I've just deployed pr-566 but can see no difference. My initial suspicion was something network-related that may terminate connections prematurely but running Hashi-UI both as a docker container and as a binary on the system directly as well as taking any load-balancing out of the equation did not solve the problem. There is another clue at least in our case that this seems start happening after a short while scrolling up and down on the Services screen for example. It then errors for a while and then start working again for a bit and so on. Some form of rate-limiting was my initial instinct but I couldn't find any sort of mechanism that could be causing that behavior. So I'm still in the dark as to what the actual cause may be. |
Hey 👋 This issue affects all nomad versions from 0.10.3 https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#0103-january-29-2020 where hashicorp/nomad#7002 was introduced FindingsWe started analysing what have changed in nomad releases lately, last reported version with this issue in this thread is 0.10.3 (thanks @MattPumphrey), so we found this: Unfortunately this is breaking change from Nomad side as they put default limits too low:
We think why only some people could face EOF errors is due limit of jobs they are running, in our situation we run quite a lot of them that's why it appeared quickly for us, we also tested the limit with lower amount of jobs with default limits which resulted with no errors, but changing the limits to 50 triggered EOF errors. FixEasiest fix for now is to change nomad'a agent limits to higher or For hashi-ui itself we could add connection limits as configuration or re-use them better to not exhaust them so quickly. |
Hah that's some nice sleuthing @melkorm. Thank you! We've got a pretty large number of jobs so I'm guessing that's why we saw it straight off. Am I being incredibly dumb or is this completely lacking from Nomad documentation? We went over it with a fine tooth comb to make sure we weren't missing any limit settings and I've just done it again now and I still can't find any reference to these configuration parameters. |
@thisisjaid https://www.nomadproject.io/docs/configuration/#limits it's in agent's configuration docs Unfortunately Nomad absolutely fails at versioning their docs and finding out which features landed when ends with going trough changelog :/ Once I crashed whole cluster adding configuration for logs in json format 🙈 as we had lower version than the version when this option was introduced. |
Bah the one damn page I didn't go through as I assumed it was just generic information based on the misleading header ("Overview"). Agreed on doc versioning I've had trouble with that as well with the auto_revert job spec option if I remember correctly. I've recently pointed out some other issues with telemetry config documentation. Nomad docs could generally use some work. Any case, thanks a bunch for tracking this down! This can probably be closed now. |
Has anyone experimented with increasing HTTP and RPC limits without setting them to 0? I suspect it's very environment dependent (number of jobs running, potentially resources allocated to hashi-ui, etc). I have been experiencing this in our prod environment since upgrading to 10.4, and deployed new limits of 500 for each HTTP and RPC today. So far the errors have stopped. I'm curious if this will be a fix, or just a band-aid that will keep the errors from popping up as quickly. I'm hoping it will smooth things over until the number of jobs we have running significantly increases. Would appreciate anyone's thoughts! And great find @melkorm !! |
We still keep this at I think we can close this issue, perhaps we could add something to readme about it if anyone runs into it in the future. |
We have raised Nomad limits, but with too many services, this still happens. The biggest problem is, when there is an outage in production, Nomad is rescheduling many services to new hosts and at the same time our team is tshooting the outage, result is too many connections open to Nomad servers, that the Nomad itself becomes unstable. Is there a way to reuse hashi-ui connections to Nomad, so that there isnt't so many? I mean we can switch to Nomad offical UI, but it lacks a lot of stuff. |
I have the same problem, but i haven't any Nomad, only consul. I increased |
I still see this issue with nomad 1.0.3 and consul 1.9.3 |
We've just upgraded our staging environment to Nomad 0.10.4 and post-upgrade we have started seeing seemingly random errors in HashiUI on the Nomad UI side of the app. The full error is received in the UI is:
Get http://172.17.0.1:4646/v1/job/some-job-name?namespace=default®ion=global&stale=&wait=120000ms: EOF
Network communication across the board seems ok, we've tested manual requests to the API from both within the HashiUI container and outside, we've also tested bypassing the nginx load-balancer that upstreams to the HashiUI app and we get the same result.
Nomad 0.10.4
Consul 1.7.2
docker 17.04.0-ce
hashiui image - jippi/hashi-ui:pr-556
Anyone else seeing anything similar with 0.10.4?
The text was updated successfully, but these errors were encountered: