Socket timeouts with HTTP/1.1 pipelining under load #438

atavistock · 2024-12-05T02:52:48Z

This is a pretty esoteric edge case, but putting this issue in for informational purposes.

In an ongoing effort around performance and load testing I've found that with HTTP 1.1 pipelining under load we're seeing an exception in about 1 in 1_000 requests . Looks like the connection is closed prematurely, but its distinctly not client side reset (ECONNRESET) or due to a timeout threshold (ETIMEDOUT).

 22:50:55.858 [error] ** (RuntimeError) Unable to obtain transport_info: :enotconn
     (bandit 1.6.0) lib/bandit/transport_info.ex:26: anonymous fn/1 in Bandit.TransportInfo.init/1
     (bandit 1.6.0) lib/bandit/pipeline.ex:70: Bandit.Pipeline.build_conn!/5
     (bandit 1.6.0) lib/bandit/pipeline.ex:31: Bandit.Pipeline.run/4
     (bandit 1.6.0) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
     (bandit 1.6.0) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
     (bandit 1.6.0) /deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
     (stdlib 6.0.1) gen_server.erl:2163: :gen_server.try_handle_continue/3
     (stdlib 6.0.1) gen_server.erl:2072: :gen_server.loop/7

The text was updated successfully, but these errors were encountered:

KazW · 2024-12-07T03:18:19Z

We've observed this as well in a high traffic environment that is basically a cache where most requests are served within 200-400us at a rate of around 5 req/ms. Today we rolled back to cowboy and the errors dropped to 0.

mtrudel · 2024-12-08T03:09:13Z

Interesting! This is one of those 'this should never happen' cases. A repro for this would be extremely handy!

@atavistock your stacktrace notes bandit version 1.6.0. Is that because you're running off a branch? As I'm sure you're aware, pipeline support only landed in 1.6.1

atavistock · 2024-12-08T04:11:12Z

It was the branch for fixing the HTTP1.1 pipelining problem before it was a new version of 1.6.1, so the version was still 1.6.0 but its the same as 1.6.1

KazW · 2024-12-08T04:55:01Z

Dug through logs to find some representative examples of how this error (or one similar to it) shows up in our systems.

We're using EMQX (an MQTT broker) and a Phoenix app as the authentication/authorization webhook backend for it.
For the endpoints EMQX hits, all the data is cached in ETS.

On October 31st, we switched from Cowboy to Bandit 1.5.7.

Here is a typical example, we see an error on the client side and the server side (~17 million occurrences over the month we had Bandit in production):

Client:
2024-12-06T21:18:48.331216+00:00 [error] msg: http_server_query_failed, mfa: emqx_authz_http:authorize/4(117), peername: 10.0.104.96:57952, clientid: 9710-001014, reason: {resource_error,#{msg => "resource not connected",reason => not_connected}}, resource: <<"emqx_authz_http:1">>
2024-12-06T21:18:48.545390+00:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: Connection refused">>, name: <<"emqx_authz_http:1">>
2024-12-06T21:18:48.545978+00:00 [warning] msg: health_check_failed, mfa: emqx_resource_manager:handle_connected_health_check/1(883), id: <<"emqx_authz_http:1">>, status: disconnected
2024-12-06T21:18:48.792359+00:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: Connection refused">>, name: <<"emqx_authz_http:1">>

Server:
21:14:48.585 [error] ** (Bandit.HTTPError) Header read timeout
21:14:48.585 [error] ** (Bandit.HTTPError) Header read timeout

There was a crash once within the month we had Bandit in production, and it was after we upgraded to 1.6.1, here are the logs from it:

Client:
2024-12-06T21:18:50.118833+00:00 [warning] msg: emqx_connector_on_start_already_started, mfa: emqx_bridge_http_connector:start_pool/2(253), pool_name: <<"emqx_authz_http:1">>
2024-12-06T21:18:50.132837+00:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(530), reason: econnrefused, worker: <0.3020.0>
2024-12-06T21:18:50.132839+00:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(530), reason: econnrefused, worker: <0.3022.0>
2024-12-06T21:18:50.135854+00:00 [warning] msg: start_resource_failed, mfa: emqx_resource_manager:start_resource/2(604), id: <<"emqx_authz_http:1">>, reason: econnrefused
2024-12-06T21:18:50.353251+00:00 [error] msg: http_connector_get_status_failed, mfa: emqx_bridge_http_connector:do_get_status/2(530), reason: econnrefused, worker: <0.3178.0>
2024-12-06T21:18:50.353820+00:00 [warning] msg: alarm_is_activated, mfa: emqx_alarm:do_actions/3(418), message: <<"resource down: Connection refused">>, name: <<"emqx_authn_http:2">>
2024-12-06T21:18:50.354215+00:00 [warning] msg: health_check_failed, mfa: emqx_resource_manager:handle_connected_health_check/1(883), id: <<"emqx_authn_http:2">>, status: disconnected

Server:
21:18:50.133 [error] ** (Bandit.HTTPError) Header read socket error: :closed
21:18:50.134 [error] ** (Bandit.HTTPError) Header read socket error: :closed
21:18:50.133 [error] GenServer #PID<0.24702.0> terminating
** (Bandit.TransportError) Unrecoverable error: closed
(bandit 1.6.1) lib/bandit/http1/socket.ex:437: Bandit.HTTPTransport.Bandit.HTTP1.Socket.socket_error!/1
(bandit 1.6.1) lib/bandit/http1/socket.ex:344: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_headers/4
(bandit 1.6.1) lib/bandit/http1/socket.ex:422: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_on_error/2
(bandit 1.6.1) lib/bandit/pipeline.ex:223: Bandit.Pipeline.handle_error/7
(bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.6.1) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.6.1) /app/deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
(stdlib 6.1) gen_server.erl:2335: :gen_server.try_handle_continue/3
Last message: {:continue, :handle_connection}
21:18:50.134 [error] ** (Bandit.HTTPError) Header read socket error: :closed
21:18:50.134 [error] GenServer #PID<0.24703.0> terminating
** (Bandit.TransportError) Unrecoverable error: closed
(bandit 1.6.1) lib/bandit/http1/socket.ex:437: Bandit.HTTPTransport.Bandit.HTTP1.Socket.socket_error!/1
(bandit 1.6.1) lib/bandit/http1/socket.ex:344: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_headers/4
(bandit 1.6.1) lib/bandit/http1/socket.ex:422: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_on_error/2
(bandit 1.6.1) lib/bandit/pipeline.ex:223: Bandit.Pipeline.handle_error/7
(bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.6.1) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.6.1) /app/deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
(stdlib 6.1) gen_server.erl:2335: :gen_server.try_handle_continue/3
Last message: {:continue, :handle_connection}
21:18:50.134 [error] GenServer #PID<0.24704.0> terminating
** (Bandit.TransportError) Unrecoverable error: closed
(bandit 1.6.1) lib/bandit/http1/socket.ex:437: Bandit.HTTPTransport.Bandit.HTTP1.Socket.socket_error!/1
(bandit 1.6.1) lib/bandit/http1/socket.ex:344: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_headers/4
(bandit 1.6.1) lib/bandit/http1/socket.ex:422: Bandit.HTTPTransport.Bandit.HTTP1.Socket.send_on_error/2
(bandit 1.6.1) lib/bandit/pipeline.ex:223: Bandit.Pipeline.handle_error/7
(bandit 1.6.1) lib/bandit/http1/handler.ex:12: Bandit.HTTP1.Handler.handle_data/3
(bandit 1.6.1) lib/bandit/delegating_handler.ex:18: Bandit.DelegatingHandler.handle_data/3
(bandit 1.6.1) /app/deps/thousand_island/lib/thousand_island/handler.ex:417: Bandit.DelegatingHandler.handle_continue/2
(stdlib 6.1) gen_server.erl:2335: :gen_server.try_handle_continue/3
Last message: {:continue, :handle_connection}

mtrudel · 2024-12-18T19:38:15Z

@KazW I just fixed the 'one instance' error you indicated above; that should be (and now is) silently eaten since it occurs during Bandit's error fallback routines.

mtrudel · 2024-12-18T19:42:45Z

@KazW I don't think your Header timeout errors are related to the original issue that @atavistock filed here. I'd be interested in knowing if you have seen any client side evidence of the header timeout / connection refusal errors when running Cowboy. Cowboy is quiet by default in the face of most errors (read timeouts included), so it's possible that they happen there as well if you're only looking at server logs.

mtrudel · 2024-12-18T19:45:23Z

@atavistock do you have any details about what transport (HTTP/HTTPS) and topology (localhost/direct/proxied) is inducing those :enotconn scenarios? They look an awful lot like early client closures to me, but it's possible they could be tied up in handshaking if it were HTTPS connections

KazW · 2024-12-18T21:56:00Z

@KazW I don't think your Header timeout errors are related to the original issue that @atavistock filed here. I'd be interested in knowing if you have seen any client side evidence of the header timeout / connection refusal errors when running Cowboy. Cowboy is quiet by default in the face of most errors (read timeouts included), so it's possible that they happen there as well if you're only looking at server logs.

@mtrudel That's why I included the client side logs; all the errors on the client side vanished the moment we reverted to Cowboy.

webdevcat-me · 2024-12-19T21:15:36Z

Hello! I am experiencing an issue with Phoenix that, I believe, is related to Bandit and might be related to this issue? I posted steps to reproduce on the Elixir Forum. The issue I encounter is only with Bandit and only with HTTPS. Cowboy does not exhibit this behavior, nor does Bandit when using just HTTP. Would you like me to submit a separate issue for the behavior I am experiencing?

Edit: The issue I was encountering has disappeared since hpax was bumped to version 1.0.2 yesterday 61d975d 🎉

atavistock · 2024-12-19T22:11:29Z

Hi @webdevcat-me -

I think its a different issue. The issue I had only happens specifically when the client is sending requests via HTTP/1.1 pipelining (which is kind of unusual in the real world anyway), under every other case the exception I was seeing does not occur.

webdevcat-me · 2024-12-19T23:21:26Z

HTTP/1.1 pipelining

My mistake, @atavistock, I didn't catch that

mtrudel mentioned this issue Jan 10, 2025

Trace tooling #453

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Socket timeouts with HTTP/1.1 pipelining under load #438

Socket timeouts with HTTP/1.1 pipelining under load #438

atavistock commented Dec 5, 2024

KazW commented Dec 7, 2024

mtrudel commented Dec 8, 2024

atavistock commented Dec 8, 2024

KazW commented Dec 8, 2024

mtrudel commented Dec 18, 2024

mtrudel commented Dec 18, 2024

mtrudel commented Dec 18, 2024

KazW commented Dec 18, 2024 •

edited

Loading

webdevcat-me commented Dec 19, 2024 •

edited

Loading

atavistock commented Dec 19, 2024

webdevcat-me commented Dec 19, 2024

Socket timeouts with HTTP/1.1 pipelining under load #438

Socket timeouts with HTTP/1.1 pipelining under load #438

Comments

atavistock commented Dec 5, 2024

KazW commented Dec 7, 2024

mtrudel commented Dec 8, 2024

atavistock commented Dec 8, 2024

KazW commented Dec 8, 2024

mtrudel commented Dec 18, 2024

mtrudel commented Dec 18, 2024

mtrudel commented Dec 18, 2024

KazW commented Dec 18, 2024 • edited Loading

webdevcat-me commented Dec 19, 2024 • edited Loading

atavistock commented Dec 19, 2024

webdevcat-me commented Dec 19, 2024

KazW commented Dec 18, 2024 •

edited

Loading

webdevcat-me commented Dec 19, 2024 •

edited

Loading