mk_event_add can corrupt the priority bucket queues #6838

PettitWesley · 2023-02-13T06:29:49Z

Fluent Bit Priority Queue and Keepalive Networking Crashes Root Cause and Solutions

Credit

I want to be clear that I only wrote this explainer, @matthewfala discovered this issue.

Background

Stack Trace obtained from issue repro

(gdb) bt
#0  0x00007fd0bba0bca0 in raise () from /lib64/libc.so.6
#1  0x00007fd0bba0d148 in abort () from /lib64/libc.so.6
#2  0x000000000045599e in flb_signal_handler (signal=11) at /tmp/fluent-bit-1.9.10/src/fluent-bit.c:581
#3  <signal handler called>
#4  0x00000000004fd80e in __mk_list_del (prev=0x0, next=0x0) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:87
#5  0x00000000004fd846 in mk_list_del (entry=0x7fd0b4a42a60) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:93
#6  0x00000000004fe703 in prepare_destroy_conn (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:443
#7  0x00000000004fe786 in prepare_destroy_conn_safe (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:469
#8  0x00000000004ff04b in cb_upstream_conn_ka_dropped (data=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:724
#9  0x00000000004e7cf5 in output_thread (data=0x7fd0b612e100) at /tmp/fluent-bit-1.9.10/src/flb_output_thread.c:298
#10 0x0000000000500712 in step_callback (data=0x7fd0b60f4ac0) at /tmp/fluent-bit-1.9.10/src/flb_worker.c:43
#11 0x00007fd0bd6cc44b in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd0bbac752f in clone () from /lib64/libc.so.6

How the priority event loop works

Fluent Bit currently uses a priority event loop to schedule work. Key to understanding this bug are the following facts:

The priority event loop sorts events into different bucket queues based on priority. A bucket queue groups events that all have the same priority; for example, all network socket events. The bucket queues are implemented as monkey lists (mk_list) which are doubly linked lists.
Each Monkey Event (mk_event) has a _priority_head which can link it into a priority bucket queue doubly linked list.
The “event loop” is actually two loops, an inner and outer loop.
- The inner loop is run by flb_event_priority_live_foreach. The inner loop goes over events in the priority buckets, in order of most priority to least priority. The macro has a max_iter argument which is set to FLB_ENGINE_LOOP_MAX_ITER or 10. This means that the inner loop will not iterate over more than 10 events. This means that all events in the priority buckets may not be processed in a single inner loop cycle.
- The outer loop calls flb_event_priority_live_foreach once per iteration. Meaning it processes 10 events per iteration. It then calls clean up code at the end of the iteration. This includes calling flb_upstream_conn_pending_destroy_list which cleans up and frees all flb_upstream_conn objects that are pending deletion in the destroy queue. More on this in the next section.

Relevant Code:

Worker thread event loop with inner and outer loop: https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_output_thread.c#L247
Event loop foreach definition: https://github.com/fluent/fluent-bit/blob/v1.9.10/include/fluent-bit/flb_event_loop.h#L74
Priorities for different types of events: https://github.com/fluent/fluent-bit/blob/v1.9.10/include/fluent-bit/flb_engine_macros.h#L59
Priority Bucket Queue code: https://github.com/fluent/fluent-bit/blob/v1.9.10/include/fluent-bit/flb_bucket_queue.h
Monkey Double Linked List code: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h
Monkey Event: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/include/monkey/mk_core/mk_event.h#L87

Monkey Event Key Background

mk_event_add: Adds the given file descriptor to the epoll interest list. Sets the _priority_head next and prev pointers to NULL.
- https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L101
mk_event_del: Removes the given file from the the epoll interest list. If the _priority_head is set, calls mk_list_del to remove it from the priority bucket queue.
- https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L148

Usages of mk_event_add in the current code

Input plugins: code is virtually the same for each. Some input plugins run a server and each new connection is added once to the event loop.
- ./plugins/in_syslog/syslog_conn.c:133
  - Runs a server that uses accept to create a new socket fd for each new connection and adds a new mk_event for each new fd. Memory is malloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/in_syslog/syslog_conn.c#L133
- ./plugins/in_tcp/tcp_conn.c:317
  - Code is essentially the same as syslog input plugin explained above. The event is initialized with MK_EVENT_NEW. Memory is malloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/in_tcp/tcp_conn.c#L317
- ./plugins/in_forward/fw_conn.c:142
  - Code is essentially the same as syslog input plugin explained above. The event is initialized with MK_EVENT_NEW. Memory is malloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/in_forward/fw_conn.c#L142
- ./plugins/in_mqtt/mqtt_conn.c:99
  - Code is essentially the same as syslog input plugin explained above. Memory is malloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/in_mqtt/mqtt_conn.c#L99
- ./plugins/in_opentelemetry/http_conn.c:236
  - Code is essentially the same as syslog input plugin explained above. The event is initialized with MK_EVENT_NEW. Memory is calloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/in_opentelemetry/http_conn.c#L236
- ./plugins/in_http/http_conn.c:236
  - Code is essentially the same as syslog input plugin explained above. The event is initialized with MK_EVENT_NEW. Memory is calloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/plugins/in_http/http_conn.c#L236
flb_upstream_conn socket file descriptor events. These events always have priority FLB_ENGINE_PRIORITY_NETWORK. Which has the value of 1, the second highest priority. The only events that have more priority (0) are: FLB_ENGINE_PRIORITY_CB_SCHED, FLB_ENGINE_PRIORITY_CB_TIMER, FLB_ENGINE_PRIORITY_SHUTDOWN, FLB_ENGINE_PRIORITY_FLUSH. Events for a single socket fd are added multiple times throughput the course of an HTTP request. If keepalive is enabled, more calls to mk_event_add will be made.
- ./src/flb_network.c:442
  - net_connect_async: adds the event after calling MK_EVENT_ZERO. In theory connecting should be the first time an event is added for any socket. In flb_upstream.c the flb_upstream_conn is calloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_network.c#L442
- ./src/flb_io.c:215 and ./src/flb_io.c:280
  - net_io_write_async: Typical async networking code where the flb_upstream_conn event is added before a coro-yield, and removed afterwards with mk_event_del.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_io.c#L215
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_io.c#L280
- ./src/flb_io.c:360
  - net_io_read_async: flb_upstream_conn event is added before a coro-yield. No code to remove it after resume though.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_io.c#L360
- ./src/tls/flb_tls.c:101
  - io_tls_event_switch: Helper function to add flb_upstream_conn event before a coro-yield. For writes its removed afterwards, for reads it does not.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/tls/flb_tls.c#L101
- ./src/tls/flb_tls.c:399
  - flb_upstream_conn event is added before a coro-yield in flb_tls_session_create. Event is removed afterwards.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/tls/flb_tls.c#L399
- ./src/flb_upstream.c:761
  - flb_upstream_conn_release: A close watch is added for the flb_upstream_conn event for keepalive connections that will be recycled.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L761
Other Network events:
- ./src/flb_network.c:856
  - flb_dns_ares_socket: creates a socket just for async DNS and then creates a new mk_event in the DNS lookup context. AFAICT this is the only place where this event/socket is added to event loop. The flb_dns_lookup_context which contains the event is calloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_network.c#L856
Input collector events: These are added each time an input collector starts, or resumes, and are removed with mk_event_del when the collector pauses.
- ./src/flb_input.c:949
  - collector_start: flb_input_collector event is added to event loop. The flb_input_collector is malloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_input.c#L949
- ./src/flb_input.c:1157
  - flb_input_collector_resume: the flb_input_collector event is re-added when a paused input is resumed. flb_input_collector_pause removes the event.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_input.c#L1157
Log events: these events are part of the setup for Fluent Bit’s own networking. Each file descriptor only has one mk_event_add call.
- ./src/flb_log.c:160
  - flb_log_worker_init: flb_worker event is added for log messages. The worker is calloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_log.c#L160
- ./src/flb_log.c:259
  - flb_log_create: the log pipe event is added to the event loop. The flb_log struct is calloc allocated.
  - https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_log.c#L259

Network Code Key Background

flb_upstream_conn_timeouts: This code checks all flb_upstream_conn to determine if any have timed out. Either connection establishment timeout, or keepalive idle timeout. This code runs via a timer event in the event loop. Timer events have the highest priority and run before network events. This code calls prepare_destroy_conn on the timed out connections.
- https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L793
prepare_destroy_conn: Calls mk_event_del on the flb_upstream_conn event, closes the socket, and then adds it to the destroy_queue. The destroy queue is for connections that are ready to be freed.
- https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L422
flb_upstream_conn_pending_destroy: This code actually destroys and frees all flb_upstream_conn in the destroy_queue. This code runs as part of the “clean up code” part of the event loop outer loop described in the previous section: How the priority event loop works.
flb_upstream_conn events are not removed at the end of an async http request: The http client calls flb_io_net_read which calls out to net_io_read_async which as noted in the previous section does not remove the read event after coro-resume: Usages of mk_event_add in the current code.
When keepalive networking is enabled, an event is added for the socket to monitor if it was closed by the remove server when it is available and no longer used for an active http request: https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L761

Problem Statement

Currently, mk_event_add can corrupt the priority event loop bucket queue when it is called more than once for a specific event/file descriptor.

This can cause a crash directly, or it can indirectly cause the stack-trace noted in the background section.

How mk_event_add can corrupt the bucket queue

The problem is that mk_event_add sets the priority list head and next pointers to NULL. If the event was already added previously and was triggered and is in a bucket queue, this will corrupt the queue.

Code reference: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L142

The diagram above shows the corruption specifically for the case of connection events. This is because the connection event code is the current place where mk_event_add can be called repeatedly for the same file descriptor.

Before the call to mk_event_add the connection event is in the priority queue doubly linked list. It has a link to the next and previous elements, and the next and previous elements have links to it.

After the pointers are set to NULL, there is a corruption in the list. The next and previous events still have pointers to the connection event, but it does not have any pointers to them. If we traverse the list over the next pointers (as happens in the event loop) we will get to the conn event and then hit a NULL reference directly leading to a segmentation fault.

How queue corruption could lead to the reported segfault

This corruption can also explain the stack trace noted in the background.

This could happen if the following events occur:

WIP

Proposed Solution

Fix mk_event_add

mk_event_add should remove the event from the priority queue if it is already present using mk_list_del. It should do this instead of setting the _priority_head references to NULL.

It of course should only call mk_list_del if the event is in the bucket queue. This leads to a problem. The only way to tell if the event is already in a bucket queue is if its _priority_head references are not NULL. However, currently, event memory is not required to be zero’d or initialized. Thus, we have no way of knowing if a non-NULL list reference indicates the event is in a list or if the client code just did not properly initialize it.

This problem leads to the second part of the solution.

Require event initialization

Currently there is a function MK_EVENT_ZERO for initializing events. This should be updated to set the _priority_head references as NULL. All existing usages of monkey events should be updated to use this initialization function. This is relatively small amount of work, as noted in an earlier section there are only 18 usages of mk_event_add in the current code.

Current MK_EVENT_ZERO: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/include/monkey/mk_core/mk_event.h#L119

optional: mk_event_add should not reset event priority

As noted by Leonardo in the comment on this issue, mk_event_add currently resets the event priority to MK_EVENT_PRIORITY_DEFAULT: https://github.com/fluent/fluent-bit/blob/v2.0.9/lib/monkey/mk_core/mk_event_epoll.c#L144

This was likely done so that events always have a default priority. However, if we do the work to ensure that events are always properly initialized using MK_EVENT_ZERO then we can set priority to default in that function as well. This seems like a more ideal code design that we would likely choose if we designed this system from nothing now.

Alternate Solution - check queue in mk_event_add

Another solution would be to iterate over all priority bucket queues in mk_event_add and determine if the event is in the list or not already before deciding if mk_list_del should be called.

This would work and would eliminate the need to update all existing usages of monkey events to properly initialize them. However, this solution is inefficient and wastes CPU cycles. If the code were designed from scratch, we would almost certainly choose to always initialize events properly to solve this problem.

Optional Proposals for consideration

Always call mk_event_del in flb_upstream_conn_release

It seems not ideal that events leftover from a previous usage of connection can remain even after a connection is released. A call to mk_event_del could simply be added at the beginning of flb_upstream_conn_release.

open question: even in the non-keepalive case, could not removing the event lead to a new file descriptor with same number triggering an old event?

Replace max_iter with loop that iterates over all network events (all events higher than given priority)

As noted in How the priority event loop works there is a max_iter that controls the number of iterations of the inner loop for each iteration of the outer loop.

This means that all events in the priority buckets are not run in a single outer loop iteration before the clean up code runs. It should be noted that the inner loop calls epoll_wait at each iteration to pick up any new events that were added to the triggered list.

While this is not an issue as far as we can tell, it is potentially not ideal. We could consider updating the priority event loop inner loop to iterate over all events in the bucket queue that have a given priority or more priority. This can be the network event priority, such that all pending network events are run in a single outer loop iteration before the clean up code runs.

The text was updated successfully, but these errors were encountered:

PettitWesley · 2023-02-13T06:32:42Z

@edsiper @leonardo-albertovich @matthewfala The doc is very long since I try to explain the code for a new dev, but skip to the "Problem Statement" section to understand quickly what needs to be fixed. I think the diagram makes it pretty clear. And then check the "Proposed Solution" for discussion on some difficulties in fixing it.

leonardo-albertovich · 2023-02-13T11:10:52Z

First of all, thank you for the detailed document, I really appreciate the effort you put into it.

I think the right thing to do is ensure all initialization is properly done by the default macros and to check for priority queue linkage in _mk_event_add just like we do in _mk_event_del.

The problem I see with that and something I didn't find in the writeup (could've missed it) is that _mk_event_add seems to reset the events priority to MK_EVENT_PRIORITY_DEFAULT which doesn't seem to be what we want (ie, having fds "surreptiously" switch priorities when they are going through tls handshake or regular i/o) so we might want to keep that in mind.

I don't think the optional approaches are in line with what we strive for when it comes to code clarity so I wouldn't pursue them.

As for the alternate solution I think would not be reliable because :

If you have garbage in the pointers you would end up dereferencing unmapped memory or worse.
It would be expensive, like, really expensive considering how mk_list works.

I think it would be valuable to take a step back after applying the initialization fix and evaluate the priority system because it seems to me that it could be simplified by integrating part of it in the core event loop which would minimize the amount of moving parts and complexity greatly (ie. flb_event_priority_live_foreach is a bit over the top and this has already caused issues with injected events which would just fall in place if we refactored the code).

PettitWesley · 2023-02-13T18:21:24Z

Good point, setting priority to default should happen in the initialization code, not in mk_event_add, I'll add this.

PettitWesley · 2023-02-13T18:22:29Z

I don't think the optional approaches are in line with what we strive for when it comes to code clarity so I wouldn't pursue them.

@leonardo-albertovich this includes the optional change to call mk_event_del in flb_upstream_conn_release?

Full issue explanation and design here: fluent#6838 Signed-off-by: Wesley Pettit <[email protected]>

leonardo-albertovich · 2023-02-14T09:48:46Z

Yes, that's included, it doesn't seem straight enough to ensure that future contributors (or even us after context switching a number of times over weeks or months) would not accidentally introduce a bug and there is also the case of the downstream layer as well (which probably has its own issues honestly) so I am in favor of centralized fixes / rework rather than localized fixes whenever possible.

github-actions · 2023-05-16T01:57:56Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions · 2023-05-21T02:00:58Z

This issue was closed because it has been stalled for 5 days with no activity.

PettitWesley added status: waiting-for-triage bug and removed status: waiting-for-triage labels Feb 13, 2023

PettitWesley self-assigned this Feb 13, 2023

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Feb 13, 2023

mk_event_epoll: mk_event_add: properly remove from priority queue

5065c6c

Full issue explanation and design here: fluent#6838 Signed-off-by: Wesley Pettit <[email protected]>

PettitWesley mentioned this issue Feb 14, 2023

mk_event_add: fix corruption of priority bucket queues [draft PR for AWS distro] #6847

Draft

7 tasks

PettitWesley mentioned this issue Feb 14, 2023

2023 High Impact Issues Notice/Catalogue Ticket aws/aws-for-fluent-bit#542

Open

github-actions bot added the Stale label May 16, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mk_event_add can corrupt the priority bucket queues #6838

mk_event_add can corrupt the priority bucket queues #6838

PettitWesley commented Feb 13, 2023 •

edited

Loading

PettitWesley commented Feb 13, 2023

leonardo-albertovich commented Feb 13, 2023

PettitWesley commented Feb 13, 2023

PettitWesley commented Feb 13, 2023

leonardo-albertovich commented Feb 14, 2023

github-actions bot commented May 16, 2023

github-actions bot commented May 21, 2023

mk_event_add can corrupt the priority bucket queues #6838

mk_event_add can corrupt the priority bucket queues #6838

Comments

PettitWesley commented Feb 13, 2023 • edited Loading

Fluent Bit Priority Queue and Keepalive Networking Crashes Root Cause and Solutions

Credit

Background

Stack Trace obtained from issue repro

How the priority event loop works

Monkey Event Key Background

Usages of mk_event_add in the current code

Network Code Key Background

Problem Statement

How mk_event_add can corrupt the bucket queue

How queue corruption could lead to the reported segfault

Proposed Solution

Fix mk_event_add

Require event initialization

optional: mk_event_add should not reset event priority

Alternate Solution - check queue in mk_event_add

Optional Proposals for consideration

Always call mk_event_del in flb_upstream_conn_release

Replace max_iter with loop that iterates over all network events (all events higher than given priority)

PettitWesley commented Feb 13, 2023

leonardo-albertovich commented Feb 13, 2023

PettitWesley commented Feb 13, 2023

PettitWesley commented Feb 13, 2023

leonardo-albertovich commented Feb 14, 2023

github-actions bot commented May 16, 2023

github-actions bot commented May 21, 2023

PettitWesley commented Feb 13, 2023 •

edited

Loading