-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: TestTracingGoldenData/otlp-opencensus timed out after 10m #33865
Comments
Pinging code owners for testbed: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Looking into this a bit. Successful runs of this test take about 3 seconds (example). From the panic stack trace, this one in particular looks interesting:
It looks like the actual test functionality succeeds, but shutting down the in-process collector is blocking on this Wait() call. Since this is flaky, it looks like the waitgroup usage must have some kind of bug that is only hit intermittently. Looking further, we see the collector is still running, which is blocking
We can then find the other goroutine that held the lock, which is the opencensus receiver. The receiver reports status during shutdown if the server is closed with an error message that is unexpected.
Reporting status during shutdown is currently broken as filed in open-telemetry/opentelemetry-collector#9824, this is a frequency of that bug. |
Pinging code owners for receiver/opencensus: @open-telemetry/collector-approvers. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
This is the root cause blocking call:
The next step is to investigate if we're simply missing an expected status that's returned when shutting down the receiver, or if there is actually a bug in shutting it down. The referenced core issue is why this test is hitting timeouts, but during a successful test run the status shouldn't need to be reported, from what I can tell. So there may be an actual bug in the opencensus receiver that results in it not shutting down properly, or not properly handling error messages that should be expected. |
Agree with this, it might be missing the check for
will open a PR and try to do few runs until we confirm it's resolved. |
…tiplexer status (#34093) Fixes #33865 --------- Signed-off-by: odubajDT <[email protected]>
Still hitting this even after #34093 was merged: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/10010459335/job/27671703924#step:7:1503 |
…thub.com/jonboulle/clockwork` (#34221) **Description:** - replaces `github.com/tilinna/clock` with `github.com/jonboulle/clockwork` in `chrony receiver` **Link to tracking Issue:** #34190 **Connected PRs:** - #34224 - #34226 **Notes:** The failing check is not caused by changes in this PR, it's a flaky test #33865 --------- Signed-off-by: odubajDT <[email protected]>
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
testbed
Describe the issue you're reporting
Seeing a lot of occurrences of this test timeout in CI:
E.g. https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9761790955/job/26943836035?pr=33856#step:7:904
Note this is different from #27295, #27295 is due to the same port being binded twice, while this one seems to be due to
testbed.(*inProcessCollector).Stop
taking too long (deadlock?)The text was updated successfully, but these errors were encountered: