-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch processor can deadlock/freeze using Go 1.23 timers #11332
Batch processor can deadlock/freeze using Go 1.23 timers #11332
Comments
This is indeed due to the recent updates to the behaviour of time.Timer. The lines here
Specifically, with the recent update, A quick fix will be
But this fix will need to go in when the go module for batch processor is updated to 1.23. (or use some compile time directive) |
This is a big problem for all our builds, because we use 1.23 for our dockers. See open-telemetry/opentelemetry-collector-releases#638 |
Proposing to downgrade the version of go used to produce the binary to give implementing a solution for open-telemetry/opentelemetry-collector#11332 more time without impacting end users of the binary. Signed-off-by: Alex Boten <[email protected]>
An option to reduce impact on users is to downgrade the version of go used for the artifacts produced by the releases repo: open-telemetry/opentelemetry-collector-releases#685 This doesn't solve the problem but reduces the impact while the solution is being worked on |
An alternative, push to 1.23 the batch processor? |
Another dirty alternative using build directives. Different implementation for batch_processing with tags |
Apparently this bug was also reported in go since it breaks backwards compatibility: golang/go#69312 |
It appears this is fixed in go 1.23.2, as @mx-psi found out the artifacts for the next release will be published w/ 1.23.2 #11334 (comment) |
@jamesmoessis Using the latest go1.23 version should resolve the issue for you |
FYI a fix from OTel Go: You can also build with |
Thankyou all for addressing the issue so quickly 🙇 |
Describe the bug
When compiled with Go 1.23, the batch processor is able to get into the following stuck state, waiting on receiving a stopped timer chan that never sends.
This is the goroutine that is forever waiting for the
timerCh
to send after it's stopped.Given that our batch processor is synchronised on a single shard/goroutine, this causes the entire collector pipeline to deadlock.
We end up with thousands of goroutines that are forever waiting on a chan receive from the shard:
Temporary fix
We were able to fix it by disabling Go 1.23 timers. Looking into the nature of this bug, and also seeing that Go 1.23 had changes to timer channels, I grew suspicious that it was something related to Go 1.23 timers. Since setting
GODEBUG="asynctimerchan=1"
, the issue has not come back in several days (it was happening several times a day before). That leads me to believe that it has fixed the issue, but I have not determined the exact cause.I also noticed that collector upstream is still compiled on Go 1.22, so this issue wouldn't be coming up yet, and it would only come up for people compiling their own collectors. So, this ticket can serve as a future warning, particularly as it's unlikely to be caught in any tests.
Steps to reproduce
This is very difficult to reproduce, we have hundreds of pods running the collector and this only happens to one of the pods every so often.
What did you expect to see?
Collector continuing to run normally.
What did you see instead?
Collector deadlocks, it accepts new requests but fails to process them. This results in a large allocation of stack memory because many goroutines are created by the OTLP receiver, as well as heap growing until the process get OOM killed.
What version did you use?
OpenTelemetry Collector v0.109.0, compiled manually.
I believe this was also happening in 0.110.0, but the stacks I have are 0.109.0.
What config did you use?
Environment
Manually compiled with our own internal distribution of the collector.
Golang 1.23.0.
arm64
running on a
ubuntu:jammy
container.The text was updated successfully, but these errors were encountered: