Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mk_event_add can corrupt the priority bucket queues #6838

Closed
PettitWesley opened this issue Feb 13, 2023 · 7 comments
Closed

mk_event_add can corrupt the priority bucket queues #6838

PettitWesley opened this issue Feb 13, 2023 · 7 comments
Assignees

Comments

@PettitWesley
Copy link
Contributor

PettitWesley commented Feb 13, 2023

Fluent Bit Priority Queue and Keepalive Networking Crashes Root Cause and Solutions

Credit

I want to be clear that I only wrote this explainer, @matthewfala discovered this issue.

Background

Stack Trace obtained from issue repro

(gdb) bt
#0  0x00007fd0bba0bca0 in raise () from /lib64/libc.so.6
#1  0x00007fd0bba0d148 in abort () from /lib64/libc.so.6
#2  0x000000000045599e in flb_signal_handler (signal=11) at /tmp/fluent-bit-1.9.10/src/fluent-bit.c:581
#3  <signal handler called>
#4  0x00000000004fd80e in __mk_list_del (prev=0x0, next=0x0) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:87
#5  0x00000000004fd846 in mk_list_del (entry=0x7fd0b4a42a60) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:93
#6  0x00000000004fe703 in prepare_destroy_conn (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:443
#7  0x00000000004fe786 in prepare_destroy_conn_safe (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:469
#8  0x00000000004ff04b in cb_upstream_conn_ka_dropped (data=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:724
#9  0x00000000004e7cf5 in output_thread (data=0x7fd0b612e100) at /tmp/fluent-bit-1.9.10/src/flb_output_thread.c:298
#10 0x0000000000500712 in step_callback (data=0x7fd0b60f4ac0) at /tmp/fluent-bit-1.9.10/src/flb_worker.c:43
#11 0x00007fd0bd6cc44b in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd0bbac752f in clone () from /lib64/libc.so.6

How the priority event loop works

Fluent Bit currently uses a priority event loop to schedule work. Key to understanding this bug are the following facts:

  • The priority event loop sorts events into different bucket queues based on priority. A bucket queue groups events that all have the same priority; for example, all network socket events. The bucket queues are implemented as monkey lists (mk_list) which are doubly linked lists.
  • Each Monkey Event (mk_event) has a _priority_head which can link it into a priority bucket queue doubly linked list.
  • The “event loop” is actually two loops, an inner and outer loop.
    • The inner loop is run by flb_event_priority_live_foreach. The inner loop goes over events in the priority buckets, in order of most priority to least priority. The macro has a max_iter argument which is set to FLB_ENGINE_LOOP_MAX_ITER or 10. This means that the inner loop will not iterate over more than 10 events. This means that all events in the priority buckets may not be processed in a single inner loop cycle.
    • The outer loop calls flb_event_priority_live_foreach once per iteration. Meaning it processes 10 events per iteration. It then calls clean up code at the end of the iteration. This includes calling flb_upstream_conn_pending_destroy_list which cleans up and frees all flb_upstream_conn objects that are pending deletion in the destroy queue. More on this in the next section.

Relevant Code:

Monkey Event Key Background

Usages of mk_event_add in the current code

Network Code Key Background

  • flb_upstream_conn_timeouts: This code checks all flb_upstream_conn to determine if any have timed out. Either connection establishment timeout, or keepalive idle timeout. This code runs via a timer event in the event loop. Timer events have the highest priority and run before network events. This code calls prepare_destroy_conn on the timed out connections.
  • prepare_destroy_conn: Calls mk_event_del on the flb_upstream_conn event, closes the socket, and then adds it to the destroy_queue. The destroy queue is for connections that are ready to be freed.
  • flb_upstream_conn_pending_destroy: This code actually destroys and frees all flb_upstream_conn in the destroy_queue. This code runs as part of the “clean up code” part of the event loop outer loop described in the previous section: How the priority event loop works.
  • flb_upstream_conn events are not removed at the end of an async http request: The http client calls flb_io_net_read which calls out to net_io_read_async which as noted in the previous section does not remove the read event after coro-resume: Usages of mk_event_add in the current code.
  • When keepalive networking is enabled, an event is added for the socket to monitor if it was closed by the remove server when it is available and no longer used for an active http request: https://github.com/fluent/fluent-bit/blob/v1.9.10/src/flb_upstream.c#L761

Problem Statement

Currently, mk_event_add can corrupt the priority event loop bucket queue when it is called more than once for a specific event/file descriptor.

This can cause a crash directly, or it can indirectly cause the stack-trace noted in the background section.

How mk_event_add can corrupt the bucket queue

The problem is that mk_event_add sets the priority list head and next pointers to NULL. If the event was already added previously and was triggered and is in a bucket queue, this will corrupt the queue.

Code reference: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L142

Priority Event Loop Corruption

The diagram above shows the corruption specifically for the case of connection events. This is because the connection event code is the current place where mk_event_add can be called repeatedly for the same file descriptor.

Before the call to mk_event_add the connection event is in the priority queue doubly linked list. It has a link to the next and previous elements, and the next and previous elements have links to it.

After the pointers are set to NULL, there is a corruption in the list. The next and previous events still have pointers to the connection event, but it does not have any pointers to them. If we traverse the list over the next pointers (as happens in the event loop) we will get to the conn event and then hit a NULL reference directly leading to a segmentation fault.

How queue corruption could lead to the reported segfault

This corruption can also explain the stack trace noted in the background.

This could happen if the following events occur:

WIP

Proposed Solution

Fix mk_event_add

mk_event_add should remove the event from the priority queue if it is already present using mk_list_del. It should do this instead of setting the _priority_head references to NULL.

It of course should only call mk_list_del if the event is in the bucket queue. This leads to a problem. The only way to tell if the event is already in a bucket queue is if its _priority_head references are not NULL. However, currently, event memory is not required to be zero’d or initialized. Thus, we have no way of knowing if a non-NULL list reference indicates the event is in a list or if the client code just did not properly initialize it.

This problem leads to the second part of the solution.

Require event initialization

Currently there is a function MK_EVENT_ZERO for initializing events. This should be updated to set the _priority_head references as NULL. All existing usages of monkey events should be updated to use this initialization function. This is relatively small amount of work, as noted in an earlier section there are only 18 usages of mk_event_add in the current code.

Current MK_EVENT_ZERO: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/include/monkey/mk_core/mk_event.h#L119

optional: mk_event_add should not reset event priority

As noted by Leonardo in the comment on this issue, mk_event_add currently resets the event priority to MK_EVENT_PRIORITY_DEFAULT: https://github.com/fluent/fluent-bit/blob/v2.0.9/lib/monkey/mk_core/mk_event_epoll.c#L144

This was likely done so that events always have a default priority. However, if we do the work to ensure that events are always properly initialized using MK_EVENT_ZERO then we can set priority to default in that function as well. This seems like a more ideal code design that we would likely choose if we designed this system from nothing now.

Alternate Solution - check queue in mk_event_add

Another solution would be to iterate over all priority bucket queues in mk_event_add and determine if the event is in the list or not already before deciding if mk_list_del should be called.

This would work and would eliminate the need to update all existing usages of monkey events to properly initialize them. However, this solution is inefficient and wastes CPU cycles. If the code were designed from scratch, we would almost certainly choose to always initialize events properly to solve this problem.

Optional Proposals for consideration

Always call mk_event_del in flb_upstream_conn_release

It seems not ideal that events leftover from a previous usage of connection can remain even after a connection is released. A call to mk_event_del could simply be added at the beginning of flb_upstream_conn_release.

open question: even in the non-keepalive case, could not removing the event lead to a new file descriptor with same number triggering an old event?

Replace max_iter with loop that iterates over all network events (all events higher than given priority)

As noted in How the priority event loop works there is a max_iter that controls the number of iterations of the inner loop for each iteration of the outer loop.

This means that all events in the priority buckets are not run in a single outer loop iteration before the clean up code runs. It should be noted that the inner loop calls epoll_wait at each iteration to pick up any new events that were added to the triggered list.

While this is not an issue as far as we can tell, it is potentially not ideal. We could consider updating the priority event loop inner loop to iterate over all events in the bucket queue that have a given priority or more priority. This can be the network event priority, such that all pending network events are run in a single outer loop iteration before the clean up code runs.

@PettitWesley
Copy link
Contributor Author

@edsiper @leonardo-albertovich @matthewfala The doc is very long since I try to explain the code for a new dev, but skip to the "Problem Statement" section to understand quickly what needs to be fixed. I think the diagram makes it pretty clear. And then check the "Proposed Solution" for discussion on some difficulties in fixing it.

@leonardo-albertovich
Copy link
Collaborator

First of all, thank you for the detailed document, I really appreciate the effort you put into it.

I think the right thing to do is ensure all initialization is properly done by the default macros and to check for priority queue linkage in _mk_event_add just like we do in _mk_event_del.

The problem I see with that and something I didn't find in the writeup (could've missed it) is that _mk_event_add seems to reset the events priority to MK_EVENT_PRIORITY_DEFAULT which doesn't seem to be what we want (ie, having fds "surreptiously" switch priorities when they are going through tls handshake or regular i/o) so we might want to keep that in mind.

I don't think the optional approaches are in line with what we strive for when it comes to code clarity so I wouldn't pursue them.

As for the alternate solution I think would not be reliable because :

  1. If you have garbage in the pointers you would end up dereferencing unmapped memory or worse.
  2. It would be expensive, like, really expensive considering how mk_list works.

I think it would be valuable to take a step back after applying the initialization fix and evaluate the priority system because it seems to me that it could be simplified by integrating part of it in the core event loop which would minimize the amount of moving parts and complexity greatly (ie. flb_event_priority_live_foreach is a bit over the top and this has already caused issues with injected events which would just fall in place if we refactored the code).

@PettitWesley
Copy link
Contributor Author

Good point, setting priority to default should happen in the initialization code, not in mk_event_add, I'll add this.

@PettitWesley
Copy link
Contributor Author

I don't think the optional approaches are in line with what we strive for when it comes to code clarity so I wouldn't pursue them.

@leonardo-albertovich this includes the optional change to call mk_event_del in flb_upstream_conn_release?

PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Feb 13, 2023
Full issue explanation and design here:
fluent#6838

Signed-off-by: Wesley Pettit <[email protected]>
@leonardo-albertovich
Copy link
Collaborator

Yes, that's included, it doesn't seem straight enough to ensure that future contributors (or even us after context switching a number of times over weeks or months) would not accidentally introduce a bug and there is also the case of the downstream layer as well (which probably has its own issues honestly) so I am in favor of centralized fixes / rework rather than localized fixes whenever possible.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label May 16, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants