-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pkg/stanza] Compare performance with and without buffer in LogEmitter
#35454
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Removing |
I have performed the following two tests with OpenTelemetry Collector Contrib v0.111.0, using Go 1.22.8, on my laptop with Ubuntu 24.04 Run existing benchmark test
|
I believe the above results show that simply removing the buffering from LogEmitter would have a catastrophic effect on the Filelog receiver's performance. Before removing the buffer, we should implement buffering on a different stage. See #35456 for details. |
…nous (#35669) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR changes the `LogEmitter` to accept a synchronous consumer callback function for processing a batch of log entries as an alternative to sending log entry batches to a channel. The components that use the `LogEmitter` (adapter and parser) have been adapted accordingly. Also, in the case of the adapter, the log entries are converted directly, rather than sending them over a channel to the converter and receiving the converted results over a different channel. <!-- Issue number (e.g. #1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes #35453 <!--Describe what testing was performed and which tests were added.--> #### Testing I did some initial performance tests using the `TestLogLargeFiles` load test to see how this change affects the performance. Below are the results: **Before the change (i.e. with async log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 09:02:53 | Sent:17,769,795 logs (179,507/sec) | Received:17,755,188 items (179,346/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 09:06:29 | Sent:42,857,755 logs (216,465/sec) | Received:42,851,987 items (216,424/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 73.1| 78.4| 106| 118| 18249451| 18249451| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 87.5| 98.1| 110| 116| 44358460| 44358460| ``` **After the change (i.e. with sync log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 10:09:51 Agent RAM (RES): 139 MiB, CPU:71.7% | Sent:17,802,561 logs (179,836/sec) | Received:17,788,273 items (179,680/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 10:05:15 Agent RAM (RES): 140 MiB, CPU:95.6% | Sent:42,912,030 logs (216,744/sec) | Received:42,904,306 items (216,689/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 74.8| 78.9| 127| 139| 17984687| 17984687| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 89.3| 100.9| 134| 140| 43376210| 43376210| ``` Those results seem to indicate comparable throughput, but with an increased resource consumption, especially in terms of memory. I also did a test comparing the performance between the synchronous and asynchronous log emitter using the same methodology as in #35454. The results were the following, and indicate an increase in the time it takes for reading the generated log file (see #35454 for details on how the file is generated and the test execution): - Async Log Emitter: ~8s - Sync Log Emitter: ~12s <details> <summary>output-async.log</summary> === Step 3: Thu Oct 10 08:54:23 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 2.209674e+06 === Step 4: Thu Oct 10 08:54:25 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 5.428103e+06 === Step 5: Thu Oct 10 08:54:26 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.337017e+06 === Step 6: Thu Oct 10 08:54:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 9.258843e+06 === Step 7: Thu Oct 10 08:54:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3082428e+07 === Step 8: Thu Oct 10 08:54:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519068e+07 </details> <details> <summary>output-sync.log</summary> === Step 2: Thu Oct 10 08:51:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.580891e+06 === Step 3: Thu Oct 10 08:51:28 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 3.01034e+06 === Step 4: Thu Oct 10 08:51:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 4.434627e+06 === Step 5: Thu Oct 10 08:51:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.416612e+06 === Step 6: Thu Oct 10 08:51:34 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.0496072e+07 === Step 7: Thu Oct 10 08:51:36 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3523882e+07 === Step 8: Thu Oct 10 08:51:37 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.4929707e+07 === Step 9: Thu Oct 10 08:51:39 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519105e+07 </details> --------- Signed-off-by: Florian Bacher <[email protected]> Co-authored-by: Andrzej Stencel <[email protected]>
This changes the Log Emitter to run the `consumerFunc` on the whole batch, instead of splitting the batch into individual entries and calling `consumerFunc` on each of them. This doesn't change much while the Log Emitter has its own `batch` buffer, but if we remove the `batch` buffer (see open-telemetry#35456), this should prevent the performance drop described in open-telemetry#35454.
This changes the Log Emitter to run the `consumerFunc` on the whole batch, instead of splitting the batch into individual entries and calling `consumerFunc` on each of them. This doesn't change much while the Log Emitter has its own `batch` buffer, but if we remove the `batch` buffer (see open-telemetry#35456), this should prevent the performance drop described in open-telemetry#35454.
This changes the Log Emitter to run the `consumerFunc` on the whole batch, instead of splitting the batch into individual entries and calling `consumerFunc` on each of them. This doesn't change much while the Log Emitter has its own `batch` buffer, but if we remove the `batch` buffer (see open-telemetry#35456), this should prevent the performance drop described in open-telemetry#35454.
…nous (open-telemetry#35669) <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> #### Description This PR changes the `LogEmitter` to accept a synchronous consumer callback function for processing a batch of log entries as an alternative to sending log entry batches to a channel. The components that use the `LogEmitter` (adapter and parser) have been adapted accordingly. Also, in the case of the adapter, the log entries are converted directly, rather than sending them over a channel to the converter and receiving the converted results over a different channel. <!-- Issue number (e.g. open-telemetry#1234) or full URL to issue, if applicable. --> #### Link to tracking issue Fixes open-telemetry#35453 <!--Describe what testing was performed and which tests were added.--> #### Testing I did some initial performance tests using the `TestLogLargeFiles` load test to see how this change affects the performance. Below are the results: **Before the change (i.e. with async log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 09:02:53 | Sent:17,769,795 logs (179,507/sec) | Received:17,755,188 items (179,346/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 09:06:29 | Sent:42,857,755 logs (216,465/sec) | Received:42,851,987 items (216,424/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 73.1| 78.4| 106| 118| 18249451| 18249451| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 87.5| 98.1| 110| 116| 44358460| 44358460| ``` **After the change (i.e. with sync log entry batch processing)** ``` === RUN TestLogLargeFiles/filelog-largefiles-2Gb-lifetime 2024/10/08 10:09:51 Agent RAM (RES): 139 MiB, CPU:71.7% | Sent:17,802,561 logs (179,836/sec) | Received:17,788,273 items (179,680/sec) === RUN TestLogLargeFiles/filelog-largefiles-6GB-lifetime 2024/10/08 10:05:15 Agent RAM (RES): 140 MiB, CPU:95.6% | Sent:42,912,030 logs (216,744/sec) | Received:42,904,306 items (216,689/sec) Test |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items| ---------------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:| LogLargeFiles/filelog-largefiles-2Gb-lifetime|PASS | 100s| 74.8| 78.9| 127| 139| 17984687| 17984687| LogLargeFiles/filelog-largefiles-6GB-lifetime|PASS | 200s| 89.3| 100.9| 134| 140| 43376210| 43376210| ``` Those results seem to indicate comparable throughput, but with an increased resource consumption, especially in terms of memory. I also did a test comparing the performance between the synchronous and asynchronous log emitter using the same methodology as in open-telemetry#35454. The results were the following, and indicate an increase in the time it takes for reading the generated log file (see open-telemetry#35454 for details on how the file is generated and the test execution): - Async Log Emitter: ~8s - Sync Log Emitter: ~12s <details> <summary>output-async.log</summary> === Step 3: Thu Oct 10 08:54:23 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 2.209674e+06 === Step 4: Thu Oct 10 08:54:25 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 5.428103e+06 === Step 5: Thu Oct 10 08:54:26 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.337017e+06 === Step 6: Thu Oct 10 08:54:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 9.258843e+06 === Step 7: Thu Oct 10 08:54:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3082428e+07 === Step 8: Thu Oct 10 08:54:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="3744d4cb-5080-427c-8c16-a96cb40a57d4",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519068e+07 </details> <details> <summary>output-sync.log</summary> === Step 2: Thu Oct 10 08:51:27 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.580891e+06 === Step 3: Thu Oct 10 08:51:28 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 3.01034e+06 === Step 4: Thu Oct 10 08:51:29 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 4.434627e+06 === Step 5: Thu Oct 10 08:51:31 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 7.416612e+06 === Step 6: Thu Oct 10 08:51:34 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.0496072e+07 === Step 7: Thu Oct 10 08:51:36 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.3523882e+07 === Step 8: Thu Oct 10 08:51:37 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.4929707e+07 === Step 9: Thu Oct 10 08:51:39 CEST 2024 otelcol_receiver_accepted_log_records{receiver="filelog",service_instance_id="dcf5371b-19eb-47b3-a820-756c1832b448",service_name="otelcontribcol",service_version="0.111.0-dev",transport=""} 1.6519105e+07 </details> --------- Signed-off-by: Florian Bacher <[email protected]> Co-authored-by: Andrzej Stencel <[email protected]>
Component(s)
pkg/stanza
Describe the issue you're reporting
TL;DR Let's measure the performance impact of removing the 100-log buffer from Stanza adapter's LogEmitter.
The problem:
The
LogEmitter
's buffer for Stanza entries (in batch field) is a source of data loss during non-graceful collector shutdowns, as the items in the buffer are not persisted and they are already marked as sent by File consumer.What to do about it:
In an effort to make Filelog receiver (and other Stanza-based receivers) 100% reliable, we want to remove any possibility of data loss. The simplest solution is to remove this buffering and send each log record through the OTLP pipeline separately, as a separate Log Resource. This may have a performance impact though, so the first step is to measure this potential performance impact. After we have this data, we can make further decisions as to next steps.
Expected output:
A comparison (e.g. in form of a table) of collector performance with the current standard build of the collector vs. a build with this buffer removed, or modified to 1 log only, which is probably mostly equal in performance to not having the buffer at all.
Additional context:
This issue is created as a result of discussion in #31074, specifically see #31074 (comment) and later comments.
The text was updated successfully, but these errors were encountered: