Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket timeouts with HTTP/1.1 pipelining under load #438

Open
atavistock opened this issue Dec 5, 2024 · 11 comments
Open

Socket timeouts with HTTP/1.1 pipelining under load #438

atavistock opened this issue Dec 5, 2024 · 11 comments

Comments

@atavistock
Copy link

This is a pretty esoteric edge case, but putting this issue in for informational purposes.

In an ongoing effort around performance and load testing I've found that with HTTP 1.1 pipelining under load we're seeing an exception in about 1 in 1_000 requests . Looks like the connection is closed prematurely, but its distinctly not client side reset (ECONNRESET) or due to a timeout threshold (ETIMEDOUT).

 22:50:55.858 [error] ** (RuntimeError) Unable to obtain transport_info: :enotconn
     (bandit 1.6.0) lib/bandit/transport_info.ex:26: anonymous fn/1 in Bandit.TransportInfo.init/1
     (bandit 1.6.0) lib/bandit/pipeline.ex:70: Bandit.Pipeline.build_conn!/5
     (bandit 1.6.0) lib/bandit/pipeline.ex:31: Bandit.Pipeline.run/4
     (bandit 1.6.0) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
     (bandit 1.6.0) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
     (bandit 1.6.0) /deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
     (stdlib 6.0.1) gen_server.erl:2163: :gen_server.try_handle_continue/3
     (stdlib 6.0.1) gen_server.erl:2072: :gen_server.loop/7
@KazW
Copy link

KazW commented Dec 7, 2024

We've observed this as well in a high traffic environment that is basically a cache where most requests are served within 200-400us at a rate of around 5 req/ms. Today we rolled back to cowboy and the errors dropped to 0.

@mtrudel
Copy link
Owner

mtrudel commented Dec 8, 2024

Interesting! This is one of those 'this should never happen' cases. A repro for this would be extremely handy!

@atavistock your stacktrace notes bandit version 1.6.0. Is that because you're running off a branch? As I'm sure you're aware, pipeline support only landed in 1.6.1

@atavistock
Copy link
Author

It was the branch for fixing the HTTP1.1 pipelining problem before it was a new version of 1.6.1, so the version was still 1.6.0 but its the same as 1.6.1

@KazW
Copy link

KazW commented Dec 8, 2024

Dug through logs to find some representative examples of how this error (or one similar to it) shows up in our systems.

We're using EMQX (an MQTT broker) and a Phoenix app as the authentication/authorization webhook backend for it.
For the endpoints EMQX hits, all the data is cached in ETS.

On October 31st, we switched from Cowboy to Bandit 1.5.7.

Here is a typical example, we see an error on the client side and the server side (~17 million occurrences over the month we had Bandit in production):

Client:
2024-12-06T21:18:48.331216+00:00 [error] msg: http_server_query_failed, mfa: emqx_authz_http:authorize/4(117), peername: 10.0.104.96:57952, clientid: 9710-001014, reason: {resource_error,#{msg => "resource not connected",reason => not_connected}}, resource: <<"emqx_authz_http:1">>
2024-12-06T21:18:48.545390+00:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: Connection refused">>, name: <<"emqx_authz_http:1">>
2024-12-06T21:18:48.545978+00:00 [warning] msg: health_check_failed, mfa: emqx_resource_manager:handle_connected_health_check/1(883), id: <<"emqx_authz_http:1">>, status: disconnected
2024-12-06T21:18:48.792359+00:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: Connection refused">>, name: <<"emqx_authz_http:1">>

Server:
21:14:48.585 [error] ** (Bandit.HTTPError) Header read timeout
21:14:48.585 [error] ** (Bandit.HTTPError) Header read timeout

There was a crash once within the month we had Bandit in production, and it was after we upgraded to 1.6.1, here are the logs from it:

Client:
2024-12-06T21:18:50.118833+00:00 [warning] msg: emqx_connector_on_start_already_started, mfa: emqx_bridge_http_connector:start_pool/2(253), pool_name: <<"emqx_authz_http:1">>
2024-12-06T21:18:50.132837+00:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(530), reason: econnrefused, worker: <0.3020.0>
2024-12-06T21:18:50.132839+00:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(530), reason: econnrefused, worker: <0.3022.0>
2024-12-06T21:18:50.135854+00:00 [warning] msg: start_resource_failed, mfa: emqx_resource_manager:start_resource/2(604), id: <<"emqx_authz_http:1">>, reason: econnrefused
2024-12-06T21:18:50.353251+00:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(530), reason: econnrefused, worker: <0.3178.0>
2024-12-06T21:18:50.353820+00:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: Connection refused">>, name: <<"emqx_authn_http:2">>
2024-12-06T21:18:50.354215+00:00 [warning] msg: health_check_failed, mfa: emqx_resource_manager:handle_connected_health_check/1(883), id: <<"emqx_authn_http:2">>, status: disconnected

Server:
21:18:50.133 [error] ** (Bandit.HTTPError) Header read socket error: :closed
21:18:50.134 [error] ** (Bandit.HTTPError) Header read socket error: :closed
21:18:50.133 [error] GenServer #PID<0.24702.0> terminating
** (Bandit.TransportError) Unrecoverable error: closed
(bandit 1.6.1) lib/bandit/http1/socket.ex:437: Bandit.HTTPTransport.Bandit.HTTP1.Socket.socket_error!/1
(bandit 1.6.1) lib/bandit/http1/socket.ex:344: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_headers/4
(bandit 1.6.1) lib/bandit/http1/socket.ex:422: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_on_error/2
(bandit 1.6.1) lib/bandit/pipeline.ex:223: Bandit.Pipeline.handle_error/7
(bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.6.1) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.6.1) /app/deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
(stdlib 6.1) gen_server.erl:2335: :gen_server.try_handle_continue/3
Last message: {:continue, :handle_connection}
21:18:50.134 [error] ** (Bandit.HTTPError) Header read socket error: :closed
21:18:50.134 [error] GenServer #PID<0.24703.0> terminating
** (Bandit.TransportError) Unrecoverable error: closed
(bandit 1.6.1) lib/bandit/http1/socket.ex:437: Bandit.HTTPTransport.Bandit.HTTP1.Socket.socket_error!/1
(bandit 1.6.1) lib/bandit/http1/socket.ex:344: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_headers/4
(bandit 1.6.1) lib/bandit/http1/socket.ex:422: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_on_error/2
(bandit 1.6.1) lib/bandit/pipeline.ex:223: Bandit.Pipeline.handle_error/7
(bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.6.1) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.6.1) /app/deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
(stdlib 6.1) gen_server.erl:2335: :gen_server.try_handle_continue/3
Last message: {:continue, :handle_connection}
21:18:50.134 [error] GenServer #PID<0.24704.0> terminating
** (Bandit.TransportError) Unrecoverable error: closed
(bandit 1.6.1) lib/bandit/http1/socket.ex:437: Bandit.HTTPTransport.Bandit.HTTP1.Socket.socket_error!/1
(bandit 1.6.1) lib/bandit/http1/socket.ex:344: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_headers/4
(bandit 1.6.1) lib/bandit/http1/socket.ex:422: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_on_error/2
(bandit 1.6.1) lib/bandit/pipeline.ex:223: Bandit.Pipeline.handle_error/7
(bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.6.1) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.6.1) /app/deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
(stdlib 6.1) gen_server.erl:2335: :gen_server.try_handle_continue/3
Last message: {:continue, :handle_connection}

@mtrudel
Copy link
Owner

mtrudel commented Dec 18, 2024

@KazW I just fixed the 'one instance' error you indicated above; that should be (and now is) silently eaten since it occurs during Bandit's error fallback routines.

@mtrudel
Copy link
Owner

mtrudel commented Dec 18, 2024

@KazW I don't think your Header timeout errors are related to the original issue that @atavistock filed here. I'd be interested in knowing if you have seen any client side evidence of the header timeout / connection refusal errors when running Cowboy. Cowboy is quiet by default in the face of most errors (read timeouts included), so it's possible that they happen there as well if you're only looking at server logs.

@mtrudel
Copy link
Owner

mtrudel commented Dec 18, 2024

@atavistock do you have any details about what transport (HTTP/HTTPS) and topology (localhost/direct/proxied) is inducing those :enotconn scenarios? They look an awful lot like early client closures to me, but it's possible they could be tied up in handshaking if it were HTTPS connections

@KazW
Copy link

KazW commented Dec 18, 2024

@KazW I don't think your Header timeout errors are related to the original issue that @atavistock filed here. I'd be interested in knowing if you have seen any client side evidence of the header timeout / connection refusal errors when running Cowboy. Cowboy is quiet by default in the face of most errors (read timeouts included), so it's possible that they happen there as well if you're only looking at server logs.

@mtrudel That's why I included the client side logs; all the errors on the client side vanished the moment we reverted to Cowboy.

@webdevcat-me
Copy link

webdevcat-me commented Dec 19, 2024

Hello! I am experiencing an issue with Phoenix that, I believe, is related to Bandit and might be related to this issue? I posted steps to reproduce on the Elixir Forum. The issue I encounter is only with Bandit and only with HTTPS. Cowboy does not exhibit this behavior, nor does Bandit when using just HTTP. Would you like me to submit a separate issue for the behavior I am experiencing?

Edit: The issue I was encountering has disappeared since hpax was bumped to version 1.0.2 yesterday 61d975d 🎉

@atavistock
Copy link
Author

Hi @webdevcat-me -

I think its a different issue. The issue I had only happens specifically when the client is sending requests via HTTP/1.1 pipelining (which is kind of unusual in the real world anyway), under every other case the exception I was seeing does not occur.

@webdevcat-me
Copy link

HTTP/1.1 pipelining

My mistake, @atavistock, I didn't catch that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants