-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mk_event_add can corrupt the priority bucket queues #6838
Comments
@edsiper @leonardo-albertovich @matthewfala The doc is very long since I try to explain the code for a new dev, but skip to the "Problem Statement" section to understand quickly what needs to be fixed. I think the diagram makes it pretty clear. And then check the "Proposed Solution" for discussion on some difficulties in fixing it. |
First of all, thank you for the detailed document, I really appreciate the effort you put into it. I think the right thing to do is ensure all initialization is properly done by the default macros and to check for priority queue linkage in The problem I see with that and something I didn't find in the writeup (could've missed it) is that I don't think the optional approaches are in line with what we strive for when it comes to code clarity so I wouldn't pursue them. As for the alternate solution I think would not be reliable because :
I think it would be valuable to take a step back after applying the initialization fix and evaluate the priority system because it seems to me that it could be simplified by integrating part of it in the core event loop which would minimize the amount of moving parts and complexity greatly (ie. |
Good point, setting priority to default should happen in the initialization code, not in |
@leonardo-albertovich this includes the optional change to call |
Full issue explanation and design here: fluent#6838 Signed-off-by: Wesley Pettit <[email protected]>
Yes, that's included, it doesn't seem straight enough to ensure that future contributors (or even us after context switching a number of times over weeks or months) would not accidentally introduce a bug and there is also the case of the downstream layer as well (which probably has its own issues honestly) so I am in favor of centralized fixes / rework rather than localized fixes whenever possible. |
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the |
This issue was closed because it has been stalled for 5 days with no activity. |
Fluent Bit Priority Queue and Keepalive Networking Crashes Root Cause and Solutions
Credit
I want to be clear that I only wrote this explainer, @matthewfala discovered this issue.
Background
Stack Trace obtained from issue repro
How the priority event loop works
Fluent Bit currently uses a priority event loop to schedule work. Key to understanding this bug are the following facts:
mk_list
) which are doubly linked lists.mk_event
) has a_priority_head
which can link it into a priority bucket queue doubly linked list.flb_event_priority_live_foreach
. The inner loop goes over events in the priority buckets, in order of most priority to least priority. The macro has amax_iter
argument which is set toFLB_ENGINE_LOOP_MAX_ITER
or 10. This means that the inner loop will not iterate over more than 10 events. This means that all events in the priority buckets may not be processed in a single inner loop cycle.flb_event_priority_live_foreach
once per iteration. Meaning it processes 10 events per iteration. It then calls clean up code at the end of the iteration. This includes callingflb_upstream_conn_pending_destroy_list
which cleans up and frees allflb_upstream_conn
objects that are pending deletion in the destroy queue. More on this in the next section.Relevant Code:
Monkey Event Key Background
mk_event_add
: Adds the given file descriptor to the epoll interest list. Sets the_priority_head
next
andprev
pointers toNULL
.mk_event_del
: Removes the given file from the the epoll interest list. If the_priority_head
is set, callsmk_list_del
to remove it from the priority bucket queue.Usages of mk_event_add in the current code
./plugins/in_syslog/syslog_conn.c:133
fd
for each new connection and adds a newmk_event
for each newfd
. Memory is malloc allocated../plugins/in_tcp/tcp_conn.c:317
MK_EVENT_NEW
. Memory is malloc allocated../plugins/in_forward/fw_conn.c:142
MK_EVENT_NEW
. Memory is malloc allocated../plugins/in_mqtt/mqtt_conn.c:99
./plugins/in_opentelemetry/http_conn.c:236
MK_EVENT_NEW
. Memory is calloc allocated../plugins/in_http/http_conn.c:236
MK_EVENT_NEW
. Memory is calloc allocated.flb_upstream_conn
socket file descriptor events. These events always have priorityFLB_ENGINE_PRIORITY_NETWORK
. Which has the value of1
, the second highest priority. The only events that have more priority (0
) are:FLB_ENGINE_PRIORITY_CB_SCHED
,FLB_ENGINE_PRIORITY_CB_TIMER
,FLB_ENGINE_PRIORITY_SHUTDOWN
,FLB_ENGINE_PRIORITY_FLUSH
. Events for a single socketfd
are added multiple times throughput the course of an HTTP request. If keepalive is enabled, more calls tomk_event_add
will be made../src/flb_network.c:442
net_connect_async
: adds the event after callingMK_EVENT_ZERO
. In theory connecting should be the first time an event is added for any socket. Inflb_upstream.c
theflb_upstream_conn
is calloc allocated../src/flb_io.c:215
and./src/flb_io.c:280
net_io_write_async
: Typical async networking code where theflb_upstream_conn
event is added before a coro-yield, and removed afterwards withmk_event_del
../src/flb_io.c:360
net_io_read_async
:flb_upstream_conn
event is added before a coro-yield. No code to remove it after resume though../src/tls/flb_tls.c:101
io_tls_event_switch
: Helper function to addflb_upstream_conn
event before a coro-yield. For writes its removed afterwards, for reads it does not../src/tls/flb_tls.c:399
flb_upstream_conn
event is added before a coro-yield inflb_tls_session_create
. Event is removed afterwards../src/flb_upstream.c:761
flb_upstream_conn_release
: A close watch is added for theflb_upstream_conn
event for keepalive connections that will be recycled../src/flb_network.c:856
flb_dns_ares_socket
: creates a socket just for async DNS and then creates a newmk_event
in the DNS lookup context. AFAICT this is the only place where this event/socket is added to event loop. Theflb_dns_lookup_context
which contains the event is calloc allocated.mk_event_del
when the collector pauses../src/flb_input.c:949
collector_start
:flb_input_collector
event is added to event loop. Theflb_input_collector
is malloc allocated../src/flb_input.c:1157
flb_input_collector_resume
: theflb_input_collector
event is re-added when a paused input is resumed.flb_input_collector_pause
removes the event.mk_event_add
call../src/flb_log.c:160
flb_log_worker_init
:flb_worker
event is added for log messages. The worker is calloc allocated../src/flb_log.c:259
flb_log_create
: the log pipe event is added to the event loop. Theflb_log
struct is calloc allocated.Network Code Key Background
flb_upstream_conn_timeouts
: This code checks allflb_upstream_conn
to determine if any have timed out. Either connection establishment timeout, or keepalive idle timeout. This code runs via a timer event in the event loop. Timer events have the highest priority and run before network events. This code callsprepare_destroy_conn
on the timed out connections.prepare_destroy_conn
: Callsmk_event_del
on theflb_upstream_conn
event, closes the socket, and then adds it to thedestroy_queue
. The destroy queue is for connections that are ready to be freed.flb_upstream_conn_pending_destroy
: This code actually destroys and frees allflb_upstream_conn
in thedestroy_queue
. This code runs as part of the “clean up code” part of the event loop outer loop described in the previous section: How the priority event loop works.flb_upstream_conn
events are not removed at the end of an async http request: The http client callsflb_io_net_read
which calls out tonet_io_read_async
which as noted in the previous section does not remove the read event after coro-resume: Usages of mk_event_add in the current code.Problem Statement
Currently,
mk_event_add
can corrupt the priority event loop bucket queue when it is called more than once for a specific event/file descriptor.This can cause a crash directly, or it can indirectly cause the stack-trace noted in the background section.
How mk_event_add can corrupt the bucket queue
The problem is that
mk_event_add
sets the priority list head and next pointers to NULL. If the event was already added previously and was triggered and is in a bucket queue, this will corrupt the queue.Code reference: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/mk_core/mk_event_epoll.c#L142
The diagram above shows the corruption specifically for the case of connection events. This is because the connection event code is the current place where
mk_event_add
can be called repeatedly for the same file descriptor.Before the call to
mk_event_add
the connection event is in the priority queue doubly linked list. It has a link to the next and previous elements, and the next and previous elements have links to it.After the pointers are set to NULL, there is a corruption in the list. The next and previous events still have pointers to the connection event, but it does not have any pointers to them. If we traverse the list over the
next
pointers (as happens in the event loop) we will get to the conn event and then hit a NULL reference directly leading to a segmentation fault.How queue corruption could lead to the reported segfault
This corruption can also explain the stack trace noted in the background.
This could happen if the following events occur:
WIP
Proposed Solution
Fix mk_event_add
mk_event_add
should remove the event from the priority queue if it is already present usingmk_list_del
. It should do this instead of setting the_priority_head
references to NULL.It of course should only call
mk_list_del
if the event is in the bucket queue. This leads to a problem. The only way to tell if the event is already in a bucket queue is if its_priority_head
references are not NULL. However, currently, event memory is not required to be zero’d or initialized. Thus, we have no way of knowing if a non-NULL list reference indicates the event is in a list or if the client code just did not properly initialize it.This problem leads to the second part of the solution.
Require event initialization
Currently there is a function
MK_EVENT_ZERO
for initializing events. This should be updated to set the_priority_head
references as NULL. All existing usages of monkey events should be updated to use this initialization function. This is relatively small amount of work, as noted in an earlier section there are only 18 usages ofmk_event_add
in the current code.Current
MK_EVENT_ZERO
: https://github.com/fluent/fluent-bit/blob/v1.9.10/lib/monkey/include/monkey/mk_core/mk_event.h#L119optional: mk_event_add should not reset event priority
As noted by Leonardo in the comment on this issue,
mk_event_add
currently resets the event priority toMK_EVENT_PRIORITY_DEFAULT
: https://github.com/fluent/fluent-bit/blob/v2.0.9/lib/monkey/mk_core/mk_event_epoll.c#L144This was likely done so that events always have a default priority. However, if we do the work to ensure that events are always properly initialized using
MK_EVENT_ZERO
then we can set priority to default in that function as well. This seems like a more ideal code design that we would likely choose if we designed this system from nothing now.Alternate Solution - check queue in mk_event_add
Another solution would be to iterate over all priority bucket queues in
mk_event_add
and determine if the event is in the list or not already before deciding ifmk_list_del
should be called.This would work and would eliminate the need to update all existing usages of monkey events to properly initialize them. However, this solution is inefficient and wastes CPU cycles. If the code were designed from scratch, we would almost certainly choose to always initialize events properly to solve this problem.
Optional Proposals for consideration
Always call mk_event_del in flb_upstream_conn_release
It seems not ideal that events leftover from a previous usage of connection can remain even after a connection is released. A call to
mk_event_del
could simply be added at the beginning offlb_upstream_conn_release
.open question: even in the non-keepalive case, could not removing the event lead to a new file descriptor with same number triggering an old event?
Replace max_iter with loop that iterates over all network events (all events higher than given priority)
As noted in How the priority event loop works there is a
max_iter
that controls the number of iterations of the inner loop for each iteration of the outer loop.This means that all events in the priority buckets are not run in a single outer loop iteration before the clean up code runs. It should be noted that the inner loop calls epoll_wait at each iteration to pick up any new events that were added to the triggered list.
While this is not an issue as far as we can tell, it is potentially not ideal. We could consider updating the priority event loop inner loop to iterate over all events in the bucket queue that have a given priority or more priority. This can be the network event priority, such that all pending network events are run in a single outer loop iteration before the clean up code runs.
The text was updated successfully, but these errors were encountered: