Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch processor can deadlock/freeze using Go 1.23 timers #11332

Closed
jamesmoessis opened this issue Oct 2, 2024 · 10 comments · Fixed by open-telemetry/opentelemetry-collector-releases#688
Labels
bug Something isn't working
Milestone

Comments

@jamesmoessis
Copy link

jamesmoessis commented Oct 2, 2024

Describe the bug

When compiled with Go 1.23, the batch processor is able to get into the following stuck state, waiting on receiving a stopped timer chan that never sends.

This is the goroutine that is forever waiting for the timerCh to send after it's stopped.

goroutine 65 [chan receive, 19 minutes]:
    go.opentelemetry.io/collector/processor/batchprocessor.(*shard).stopTimer(0x40008c4900)
    go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:250 +0x4c
go.opentelemetry.io/collector/processor/batchprocessor.(*shard).processItem(0x40008c4900, {0x45919a0?, 0x43956371c0?})
    go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:239 +0xcc
go.opentelemetry.io/collector/processor/batchprocessor.(*shard).startLoop(0x40008c4900)
    go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:220 +0x138
created by go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start in goroutine 1
    go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:184 +0x74

Given that our batch processor is synchronised on a single shard/goroutine, this causes the entire collector pipeline to deadlock.

We end up with thousands of goroutines that are forever waiting on a chan receive from the shard:

goroutine 169 [chan send, 19 minutes]:
go.opentelemetry.io/collector/processor/batchprocessor.(*singleShardBatcher).consume(0x45919a0?, {0x41810edb60?, 0x41cad43b00?}, {0x45919a0, 0x4569c487d0})
    go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:276 +0x30
go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).ConsumeTraces(0x4000e12580, {0x50165e8, 0x788df80}, {0x460a8569f0, 0x41d82d52a0})
    go.opentelemetry.io/collector/processor/[email protected]/batch_processor.go:346 +0x6c
<rest of stack omitted>

Temporary fix

We were able to fix it by disabling Go 1.23 timers. Looking into the nature of this bug, and also seeing that Go 1.23 had changes to timer channels, I grew suspicious that it was something related to Go 1.23 timers. Since setting GODEBUG="asynctimerchan=1", the issue has not come back in several days (it was happening several times a day before). That leads me to believe that it has fixed the issue, but I have not determined the exact cause.

I also noticed that collector upstream is still compiled on Go 1.22, so this issue wouldn't be coming up yet, and it would only come up for people compiling their own collectors. So, this ticket can serve as a future warning, particularly as it's unlikely to be caught in any tests.

Steps to reproduce

This is very difficult to reproduce, we have hundreds of pods running the collector and this only happens to one of the pods every so often.

What did you expect to see?

Collector continuing to run normally.

What did you see instead?

Collector deadlocks, it accepts new requests but fails to process them. This results in a large allocation of stack memory because many goroutines are created by the OTLP receiver, as well as heap growing until the process get OOM killed.

What version did you use?
OpenTelemetry Collector v0.109.0, compiled manually.
I believe this was also happening in 0.110.0, but the stacks I have are 0.109.0.

What config did you use?

  batch:
    send_batch_size: 4096
    send_batch_max_size: 10_000

Environment
Manually compiled with our own internal distribution of the collector.
Golang 1.23.0.
arm64
running on a ubuntu:jammy container.

@jamesmoessis jamesmoessis added the bug Something isn't working label Oct 2, 2024
@shivanthzen
Copy link

This is indeed due to the recent updates to the behaviour of time.Timer. The lines here

if b.hasTimer() && !b.timer.Stop() {
		<-b.timer.C
	}

Specifically, with the recent update, time.Stop() will stop clear timer.C and the next line will wait for ever.

A quick fix will be

if b.hasTimer(){
		b.timer.Stop()
	}

But this fix will need to go in when the go module for batch processor is updated to 1.23. (or use some compile time directive)
I do not see any other use for time.Timer in the entire codebase.

@bogdandrutu
Copy link
Member

This is a big problem for all our builds, because we use 1.23 for our dockers. See open-telemetry/opentelemetry-collector-releases#638

codeboten added a commit to codeboten/opentelemetry-collector-releases that referenced this issue Oct 2, 2024
Proposing to downgrade the version of go used to produce the binary to give implementing a solution for open-telemetry/opentelemetry-collector#11332 more time without impacting end users of the binary.

Signed-off-by: Alex Boten <[email protected]>
@codeboten
Copy link
Contributor

An option to reduce impact on users is to downgrade the version of go used for the artifacts produced by the releases repo: open-telemetry/opentelemetry-collector-releases#685

This doesn't solve the problem but reduces the impact while the solution is being worked on

@bogdandrutu
Copy link
Member

bogdandrutu commented Oct 2, 2024

An alternative, push to 1.23 the batch processor?

@shivanthzen
Copy link

shivanthzen commented Oct 2, 2024

Another dirty alternative using build directives. Different implementation for batch_processing with tags
go:build 1.23 and go:build !go1.23

@shivanthzen
Copy link

Apparently this bug was also reported in go since it breaks backwards compatibility: golang/go#69312
Since the issue was backported, updating the go version should be sufficient to fix this.

@codeboten
Copy link
Contributor

It appears this is fixed in go 1.23.2, as @mx-psi found out the artifacts for the next release will be published w/ 1.23.2 #11334 (comment)

@codeboten codeboten added this to the next release milestone Oct 3, 2024
@shivanthzen
Copy link

@jamesmoessis Using the latest go1.23 version should resolve the issue for you

@pellared
Copy link
Member

pellared commented Oct 4, 2024

FYI a fix from OTel Go:

You can also build with GODEBUG=asynctimerchan=1

@jamesmoessis
Copy link
Author

Thankyou all for addressing the issue so quickly 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants