Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator enters CrashLoopBackOff after multiple errors #3471

Closed
dhess opened this issue Nov 18, 2024 · 18 comments
Closed

Operator enters CrashLoopBackOff after multiple errors #3471

dhess opened this issue Nov 18, 2024 · 18 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@dhess
Copy link

dhess commented Nov 18, 2024

Component(s)

No response

What happened?

Description

We're running the latest version of the operator — v0.113.0 via v0.74.2 of the operator Helm chart — but have a number of issues that are causing the operator to crashloop. Until a few versions ago, everything was running fine, and in 2 other clusters, nearly identical installations of the operator and its collectors continue to run without issues. The operator and collectors are installed in each cluster via Argo CD.

We've resorted to deleting the operator (including its CRDs) and the collectors from the problematic cluster, then re-installing, as we were concerned about CRDs not having been upgraded correctly, but after re-installation, the problem persists.

At least one of the issues in the logs looks like it should be fixed by #3447, and I believe that fix is included with v0.113.0.

Steps to Reproduce

Install operator via Helm chart v0.74.2 , with the following values.yaml:

manager:
  collectorImage:
    repository: "otel/opentelemetry-collector-contrib"

  serviceMonitor:
    enabled: true

  prometheusRule:
    enabled: true
    defaultRules:
      enabled: true

admissionWebhooks:
  certManager:
    issuerRef:
      group: cert-manager.io
      kind: ClusterIssuer
      name: internal-legacy-issuer

Expected Result

Operator runs without issue.

Actual Result

Operator enters crashloop.

Kubernetes Version

1.31.2

Operator version

v0.113.0

Collector version

otel/opentelemetry-collector-contrib:0.113.0

Environment information

Environment

  • Talos v1.8.3
  • Argo CD v2.12.3+6b9cd82

Log output

manager {"level":"INFO","timestamp":"2024-11-18T13:48:48Z","message":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.113.0","opentelemetry-collector":"otel/opentelemetry-collector-contrib:0.113.0","opentelemetry-targetallocator":"ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0","operator-opamp-bridge":"ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0","auto-instrumentation-java":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5","auto-instrumentation-nodejs":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0","auto-instrumentation-python":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0","auto-instrumentation-dotnet":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0","auto-instrumentation-go":"ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha","auto-instrumentation-apache-httpd":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","auto-instrumentation-nginx":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","feature-gates":"operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls","build-date":"2024-11-08T16:18:49Z","go-version":"go1.22.8","go-arch":"amd64","go-os":"linux","labels-filter":],"annotations-filter":],"enable-multi-instrumentation":true,"enable-apache-httpd-instrumentation":true,"enable-dotnet-instrumentation":true,"enable-go-instrumentation":false,"enable-python-instrumentation":true,"enable-nginx-instrumentation":false,"enable-nodejs-instrumentation":true,"enable-java-instrumentation":true,"create-openshift-dashboard":false,"zap-message-key":"message","zap-level-key":"level","zap-time-key":"timestamp","zap-level-format":"uppercase"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:48Z","logger":"setup","message":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"setup","message":"Prometheus CRDs are installed, adding to scheme."}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"setup","message":"Openshift CRDs are not installed, skipping adding to scheme."}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"setup","message":"Cert-Manager is not available to the operator, skipping adding to scheme."}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/convert"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Conversion webhook enabled","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-v1-pod"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"setup","message":"starting manager"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.metrics","message":"Starting metrics server"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"starting server","name":"health probe","addr":":8081"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.metrics","message":"Serving metrics server","bindAddress":"0.0.0.0:8080","secure":false}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Starting webhook server"}
manager I1118 13:48:54.147049       1 leaderelection.go:254] attempting to acquire leader lease monitoring/9f7554c3.opentelemetry.io...
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Serving webhook server","host":"","port":9443}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.certwatcher","message":"Starting certificate watcher"}
manager I1118 13:48:54.645848       1 leaderelection.go:268] successfully acquired lease monitoring/9f7554c3.opentelemetry.io
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"collector-upgrade","message":"looking for managed instances to upgrade"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1beta1.OpenTelemetryCollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Ingress"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodDisruptionBudget"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceMonitor"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodMonitor"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"instrumentation-upgrade","message":"looking for managed Instrumentation instances to upgrade"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1alpha1.OpAMPBridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.ConfigMap"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.ServiceAccount"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.Service"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.Deployment"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting Controller","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Stopping and waiting for non leader election runnables"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Stopping and waiting for leader election runnables"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","message":"error received after stop sequence was engaged","error":"failed to list: Timeout: failed waiting for *v1beta1.OpenTelemetryCollector Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:512"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Service Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","message":"Could not wait for Cache to sync","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","error":"failed to wait for opentelemetrycollector caches to sync: failed to get informer from cache: Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:200\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:205\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","message":"error received after stop sequence was engaged","error":"failed to wait for opentelemetrycollector caches to sync: failed to get informer from cache: Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:512"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Deployment Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ServiceAccount Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Shutdown signal received, waiting for all workers to finish","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"All workers finished","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1beta1.OpenTelemetryCollector Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.PodMonitor Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.DaemonSet Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.StatefulSet Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:00Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Ingress Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:06Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v2.HorizontalPodAutoscaler Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:07Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.PodDisruptionBudget Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:07Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ServiceMonitor Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Stopping and waiting for caches"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","message":"error received after stop sequence was engaged","error":"failed to list: Timeout: failed waiting for *v1alpha1.Instrumentation Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:512"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1alpha1.OpAMPBridge Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager W1118 13:49:09.146384       1 reflector.go:484] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1.PodMonitor ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Stopping and waiting for webhooks"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.webhook","message":"Shutting down webhook server with timeout of 1 minute"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Stopping and waiting for HTTP servers"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.metrics","message":"Shutting down metrics server with timeout of 1 minute"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"shutting down server","name":"health probe","addr":":8081"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Wait completed, proceeding to shutdown the manager"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Deployment Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"setup","message":"problem running manager","error":"error creating service monitor: servicemonitors.monitoring.coreos.com \"opentelemetry-operator-metrics-monitor\" already exists","stacktrace":"main.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/main.go:517\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.8/x64/src/runtime/proc.go:271"}
Stream closed EOF for monitoring/opentelemetry-operator-65f47c6cc4-g8wbj (manager)

Additional context

No response

@dhess dhess added bug Something isn't working needs triage labels Nov 18, 2024
@itayvolo
Copy link

Having the same issue..
Looks like the operator creates a service monitor (opentelemetry-operator-metrics-monitor) and when a new pod starts up then the serviceMonitor creation error appears.
If I delete the serviceMonitor and let the operator recreate it, then it works (until the next pod interruption)

@dhess
Copy link
Author

dhess commented Nov 18, 2024

@itayvolo That's a good observation, and I can confirm that deleting the opentelemetry-operator-metrics-monitor and then restarting the operator does work around the problem until the pod is restarted for any reason.

(This suggests that the issue is that the pod isn't checking for the existence of the service monitor before trying to create it. However, it doesn't explain why the same operator with an identical configuration works on our other 2 clusters.)

@anakineo
Copy link

Deleting the service monitor works for me as well but it appears release v0.113.0 didn't include the fix #3447

@iblancasa
Copy link
Contributor

iblancasa commented Nov 19, 2024

The issues should be fixed by #3447. Can you check in your environments?

@Rohlik
Copy link

Rohlik commented Nov 19, 2024

I am also looking for the new release, including fix #3447 🙏🏼.

@anakineo
Copy link

@iblancasa my env looks like below, running the latest v0.113.0

{
    "level": "INFO",
    "timestamp": "2024-11-19T08:41:23Z",
    "message": "Starting the OpenTelemetry Operator",
    "opentelemetry-operator": "0.113.0",
    "opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
    "operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
    "auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
    "auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
    "auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
    "auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
    "auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
    "auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
    "build-date": "2024-11-08T16:18:49Z",
    "go-version": "go1.22.8",
    "go-arch": "amd64",
    "go-os": "linux",
    "labels-filter": [],
    "annotations-filter": [],
    "enable-multi-instrumentation": true,
    "enable-apache-httpd-instrumentation": true,
    "enable-dotnet-instrumentation": true,
    "enable-go-instrumentation": false,
    "enable-python-instrumentation": true,
    "enable-nginx-instrumentation": false,
    "enable-nodejs-instrumentation": true,
    "enable-java-instrumentation": true,
    "create-openshift-dashboard": false,
    "zap-message-key": "message",
    "zap-level-key": "level",
    "zap-time-key": "timestamp",
    "zap-level-format": "uppercase"
}

Build date was 2024-11-08T16:18:49Z, but #3447 was merged after that. Judging by release note, can see v0.113.0 was cut at 99b6c6f, before the fix as well. Am I missing something here

@bhvishal9
Copy link

Facing the same issue with v0.113.0

@iblancasa
Copy link
Contributor

@iblancasa my env looks like below, running the latest v0.113.0

{
    "level": "INFO",
    "timestamp": "2024-11-19T08:41:23Z",
    "message": "Starting the OpenTelemetry Operator",
    "opentelemetry-operator": "0.113.0",
    "opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
    "operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
    "auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
    "auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
    "auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
    "auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
    "auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
    "auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
    "build-date": "2024-11-08T16:18:49Z",
    "go-version": "go1.22.8",
    "go-arch": "amd64",
    "go-os": "linux",
    "labels-filter": [],
    "annotations-filter": [],
    "enable-multi-instrumentation": true,
    "enable-apache-httpd-instrumentation": true,
    "enable-dotnet-instrumentation": true,
    "enable-go-instrumentation": false,
    "enable-python-instrumentation": true,
    "enable-nginx-instrumentation": false,
    "enable-nodejs-instrumentation": true,
    "enable-java-instrumentation": true,
    "create-openshift-dashboard": false,
    "zap-message-key": "message",
    "zap-level-key": "level",
    "zap-time-key": "timestamp",
    "zap-level-format": "uppercase"
}

Build date was 2024-11-08T16:18:49Z, but #3447 was merged after that. Judging by release note, can see v0.113.0 was cut at 99b6c6f, before the fix as well. Am I missing something here

I meant with a build from main.

@iblancasa iblancasa added duplicate This issue or pull request already exists and removed needs triage labels Nov 19, 2024
@dhess
Copy link
Author

dhess commented Nov 19, 2024

The issues should be fixed by #3447. Can you check in your environments?

As shown in the bug report, we are running v0.113.0, which is the latest release at the time of writing. (edit: ahh, it appears #3447 isn't included in v0.113.0. Can we get a new release?)

@116davinder
Copy link

I was wondering for 2-3 days, why operator was crashing, finally, someone created the issue, today, I was hoping a create issue for myself here.

any luck, when we will have new release with the fix? can otel do a patch release for it. Randomly otel operator is crashing in all my cluster with same error log :( .

@anakineo
Copy link

anakineo commented Nov 19, 2024

@iblancasa my env looks like below, running the latest v0.113.0

{
    "level": "INFO",
    "timestamp": "2024-11-19T08:41:23Z",
    "message": "Starting the OpenTelemetry Operator",
    "opentelemetry-operator": "0.113.0",
    "opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
    "operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
    "auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
    "auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
    "auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
    "auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
    "auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
    "auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
    "build-date": "2024-11-08T16:18:49Z",
    "go-version": "go1.22.8",
    "go-arch": "amd64",
    "go-os": "linux",
    "labels-filter": [],
    "annotations-filter": [],
    "enable-multi-instrumentation": true,
    "enable-apache-httpd-instrumentation": true,
    "enable-dotnet-instrumentation": true,
    "enable-go-instrumentation": false,
    "enable-python-instrumentation": true,
    "enable-nginx-instrumentation": false,
    "enable-nodejs-instrumentation": true,
    "enable-java-instrumentation": true,
    "create-openshift-dashboard": false,
    "zap-message-key": "message",
    "zap-level-key": "level",
    "zap-time-key": "timestamp",
    "zap-level-format": "uppercase"
}

Build date was 2024-11-08T16:18:49Z, but #3447 was merged after that. Judging by release note, can see v0.113.0 was cut at 99b6c6f, before the fix as well. Am I missing something here

I meant with a build from main.

ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:main@sha256:145216a3debc585a182e16915dd128364737a3e758f14de0826080394afe795d has the fix. But we ran into another issue that operator wasn't able to list Deployment: no deployments found with the specified label.

Fix for that is to make sure below labels exist on operator Deployment, as it's fixed here

app.kubernetes.io/name: "opentelemetry-operator"
control-plane:  "controller-manager"

edit:
@iblancasa Would it make sense to use app.kubernetes.io/component: controller-manager instead of control-plane: "controller-manager" for the list. IMO ideally the two labels should be included in the helm chart by default since if they don't exist, operator fails to start.
app.kubernetes.io/component: controller-manager is already fixed in helm chart ( app.kubernetes.io/name: "opentelemetry-operator" as well ) and app.kubernetes.io/component is one of the known label keys.

edit2:
Did some further testing, if the labels don't exist, operator fails to start, this is NOT true. Setup a local env, and operator does start despite of the no deployments found with the specified label error. we did observe operator failing to start following no deployments found with the specified label error which I guess is a separate thing.

The TL;DR version is, ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:main@sha256:145216a3debc585a182e16915dd128364737a3e758f14de0826080394afe795d has the fix and if there are errors make sure label control-plane: "controller-manager" is added to values file

@iblancasa
Copy link
Contributor

iblancasa commented Nov 20, 2024

Hi @anakineo. Thanks for your testing. I'll take a look to see how this can be improved.
Thanks again for your feedback and help!

Update:

@iblancasa Would it make sense to use app.kubernetes.io/component: controller-manager instead of control-plane: "controller-manager" for the list.

When the OpenTelemetry Operator is deployed using the manifests from this repository, these are the labels that are part of the operator deployment.

  labels:
    app.kubernetes.io/name: opentelemetry-operator
    control-plane: controller-manager

So I think that what should be adapted is the Helm chart to match this.

Setup a local env, and operator does start despite of the no deployments found with the specified label error. we did observe operator failing to start following no deployments found with the specified label error which I guess is a separate thing.

Can you provide more info? I'm unable to reproduce.

@anakineo
Copy link

When the OpenTelemetry Operator is deployed using the manifests from this repository, these are the labels that are part of the operator deployment.

@iblancasa Yes, they are from this repository and deployment should follow source. IMO it's better to use recognized labels in this case.

Can you provide more info? I'm unable to reproduce.

Assume you mean list error causing operator failing to start not producible. Unfortunately I don't have the logs at hand but I recall seeing source failed to sync errors which, to my understanding, seems to relate cache start. Also, as far as I can tell, OperatorMetrics runnable simply log the error and always return, so I believe my observation list Deployment error causing operator failing to start is not relevant to this issue.

Thanks for your investigation.

@fdanielson
Copy link

Just wanted to add another impact story, we're also seeing this problem with 0.113.0. I had to revert the upgrade to maintain stability in production.

@iblancasa
Copy link
Contributor

Can you provide more info? I'm unable to reproduce.

Assume you mean list error causing operator failing to start not producible. Unfortunately I don't have the logs at hand but I recall seeing source failed to sync errors which, to my understanding, seems to relate cache start. Also, as far as I can tell, OperatorMetrics runnable simply log the error and always return, so I believe my observation list Deployment error causing operator failing to start is not relevant to this issue.

Thanks for your investigation.

Sorry. I understood the operator was crashing for you after list Deployment error causing operator failing to start message even with the fix. So it seems we are ok now.

A new release should be done soon with the new operator version.
Closing this issue since it was already fixed in #3447

cterence added a commit to cterence/homelab-gitops that referenced this issue Nov 23, 2024
@dhess
Copy link
Author

dhess commented Dec 5, 2024

Can we please get a release with this fix? edit: never mind, I see it's bee included in 0.114.0, thanks!

@dhess
Copy link
Author

dhess commented Dec 10, 2024

[edit: This seems to have resolved itself finally. We're no longer getting any errors when the operator restarts. Nope, it's back.]

We're still getting these errors after upgrading to 0.114.1. It starts with the following:

opentelemetry-operator-75c9bff9b5-5gpk4 manager {"level":"ERROR","timestamp":"2024-12-10T06:49:10Z","logger":"operator-metrics-sm","message":"error creating Service Monitor for operator metrics","error":"error getting owner references: no deployments found with the specified label","stacktrace":"github.com/open-telemetry/opentelemetry-operator/internal/operator-metrics.OperatorMetrics.Start\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/internal/operator-metrics/metrics.go:75\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}

Then after some indeterminate amount of time, we start getting the same cache sync-related errors shown in the OP, and the operator crashes.

What we've tried:

  • Upgraded to 0.75.0 of the Helm chart, which installs v0.114.1 of the operator.
  • Delete the operator deployment, collector deployment, and CRDs, then re-install.

But the problem persists. The values.yaml we're using are as shown in the OP.

I note that after deleting the operator deployment, the automatically-created service monitor was left lying around, which we deleted by hand before re-installing. I wonder if this has something to do with the problem.

@dhess
Copy link
Author

dhess commented Dec 26, 2024

I think this issue should be re-opened. After some time, sometimes a few hours, sometimes a few days, the operator will crash, and then we get this error in the logs:

manager {"level":"ERROR","timestamp":"2024-12-26T14:23:55Z","logger":"operator-metrics-sm","message":"error creating Service Monitor for operator metrics","error":"error getting owner references: no deployments found with the specified label","stacktrace":"github.com/open-telemetry/opentelemetry-operator/internal/operator-metrics.OperatorMetrics.Start\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/internal/operator-metrics/metrics.go:75\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}          

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

8 participants