-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator enters CrashLoopBackOff after multiple errors #3471
Comments
Having the same issue.. |
@itayvolo That's a good observation, and I can confirm that deleting the (This suggests that the issue is that the pod isn't checking for the existence of the service monitor before trying to create it. However, it doesn't explain why the same operator with an identical configuration works on our other 2 clusters.) |
The issues should be fixed by #3447. Can you check in your environments? |
I am also looking for the new release, including fix #3447 🙏🏼. |
@iblancasa my env looks like below, running the latest {
"level": "INFO",
"timestamp": "2024-11-19T08:41:23Z",
"message": "Starting the OpenTelemetry Operator",
"opentelemetry-operator": "0.113.0",
"opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
"operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
"auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
"auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
"auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
"auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
"auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
"auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
"auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
"feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
"build-date": "2024-11-08T16:18:49Z",
"go-version": "go1.22.8",
"go-arch": "amd64",
"go-os": "linux",
"labels-filter": [],
"annotations-filter": [],
"enable-multi-instrumentation": true,
"enable-apache-httpd-instrumentation": true,
"enable-dotnet-instrumentation": true,
"enable-go-instrumentation": false,
"enable-python-instrumentation": true,
"enable-nginx-instrumentation": false,
"enable-nodejs-instrumentation": true,
"enable-java-instrumentation": true,
"create-openshift-dashboard": false,
"zap-message-key": "message",
"zap-level-key": "level",
"zap-time-key": "timestamp",
"zap-level-format": "uppercase"
} Build date was |
Facing the same issue with v0.113.0 |
I meant with a build from |
I was wondering for 2-3 days, why operator was crashing, finally, someone created the issue, today, I was hoping a create issue for myself here. any luck, when we will have new release with the fix? can otel do a patch release for it. Randomly otel operator is crashing in all my cluster with same error log :( . |
Fix for that is to make sure below labels exist on operator Deployment, as it's fixed here app.kubernetes.io/name: "opentelemetry-operator"
control-plane: "controller-manager" edit: edit2: The TL;DR version is, |
Hi @anakineo. Thanks for your testing. I'll take a look to see how this can be improved. Update:
When the OpenTelemetry Operator is deployed using the manifests from this repository, these are the labels that are part of the operator deployment. labels:
app.kubernetes.io/name: opentelemetry-operator
control-plane: controller-manager So I think that what should be adapted is the Helm chart to match this.
Can you provide more info? I'm unable to reproduce. |
@iblancasa Yes, they are from this repository and deployment should follow source. IMO it's better to use recognized labels in this case.
Assume you mean list error causing operator failing to start not producible. Unfortunately I don't have the logs at hand but I recall seeing source failed to sync errors which, to my understanding, seems to relate cache start. Also, as far as I can tell, Thanks for your investigation. |
Just wanted to add another impact story, we're also seeing this problem with 0.113.0. I had to revert the upgrade to maintain stability in production. |
Sorry. I understood the operator was crashing for you after A new release should be done soon with the new operator version. |
|
[edit: We're still getting these errors after upgrading to
Then after some indeterminate amount of time, we start getting the same cache sync-related errors shown in the OP, and the operator crashes. What we've tried:
But the problem persists. The I note that after deleting the operator deployment, the automatically-created service monitor was left lying around, which we deleted by hand before re-installing. I wonder if this has something to do with the problem. |
I think this issue should be re-opened. After some time, sometimes a few hours, sometimes a few days, the operator will crash, and then we get this error in the logs:
|
Component(s)
No response
What happened?
Description
We're running the latest version of the operator —
v0.113.0
viav0.74.2
of the operator Helm chart — but have a number of issues that are causing the operator to crashloop. Until a few versions ago, everything was running fine, and in 2 other clusters, nearly identical installations of the operator and its collectors continue to run without issues. The operator and collectors are installed in each cluster via Argo CD.We've resorted to deleting the operator (including its CRDs) and the collectors from the problematic cluster, then re-installing, as we were concerned about CRDs not having been upgraded correctly, but after re-installation, the problem persists.
At least one of the issues in the logs looks like it should be fixed by #3447, and I believe that fix is included with
v0.113.0
.Steps to Reproduce
Install operator via Helm chart
v0.74.2
, with the followingvalues.yaml
:Expected Result
Operator runs without issue.
Actual Result
Operator enters crashloop.
Kubernetes Version
1.31.2
Operator version
v0.113.0
Collector version
otel/opentelemetry-collector-contrib:0.113.0
Environment information
Environment
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: