Operator enters CrashLoopBackOff after multiple errors #3471

dhess · 2024-11-18T14:04:33Z

Component(s)

No response

What happened?

Description

We're running the latest version of the operator — v0.113.0 via v0.74.2 of the operator Helm chart — but have a number of issues that are causing the operator to crashloop. Until a few versions ago, everything was running fine, and in 2 other clusters, nearly identical installations of the operator and its collectors continue to run without issues. The operator and collectors are installed in each cluster via Argo CD.

We've resorted to deleting the operator (including its CRDs) and the collectors from the problematic cluster, then re-installing, as we were concerned about CRDs not having been upgraded correctly, but after re-installation, the problem persists.

At least one of the issues in the logs looks like it should be fixed by #3447, and I believe that fix is included with v0.113.0.

Steps to Reproduce

Install operator via Helm chart v0.74.2 , with the following values.yaml:

manager:
  collectorImage:
    repository: "otel/opentelemetry-collector-contrib"

  serviceMonitor:
    enabled: true

  prometheusRule:
    enabled: true
    defaultRules:
      enabled: true

admissionWebhooks:
  certManager:
    issuerRef:
      group: cert-manager.io
      kind: ClusterIssuer
      name: internal-legacy-issuer

Expected Result

Operator runs without issue.

Actual Result

Operator enters crashloop.

Kubernetes Version

1.31.2

Operator version

v0.113.0

Collector version

otel/opentelemetry-collector-contrib:0.113.0

Environment information

Environment

Talos v1.8.3
Argo CD v2.12.3+6b9cd82

Log output

manager {"level":"INFO","timestamp":"2024-11-18T13:48:48Z","message":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.113.0","opentelemetry-collector":"otel/opentelemetry-collector-contrib:0.113.0","opentelemetry-targetallocator":"ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0","operator-opamp-bridge":"ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0","auto-instrumentation-java":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5","auto-instrumentation-nodejs":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0","auto-instrumentation-python":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0","auto-instrumentation-dotnet":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0","auto-instrumentation-go":"ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha","auto-instrumentation-apache-httpd":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","auto-instrumentation-nginx":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","feature-gates":"operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls","build-date":"2024-11-08T16:18:49Z","go-version":"go1.22.8","go-arch":"amd64","go-os":"linux","labels-filter":],"annotations-filter":],"enable-multi-instrumentation":true,"enable-apache-httpd-instrumentation":true,"enable-dotnet-instrumentation":true,"enable-go-instrumentation":false,"enable-python-instrumentation":true,"enable-nginx-instrumentation":false,"enable-nodejs-instrumentation":true,"enable-java-instrumentation":true,"create-openshift-dashboard":false,"zap-message-key":"message","zap-level-key":"level","zap-time-key":"timestamp","zap-level-format":"uppercase"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:48Z","logger":"setup","message":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"setup","message":"Prometheus CRDs are installed, adding to scheme."}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"setup","message":"Openshift CRDs are not installed, skipping adding to scheme."}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"setup","message":"Cert-Manager is not available to the operator, skipping adding to scheme."}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/convert"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Conversion webhook enabled","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:53Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-v1-pod"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"setup","message":"starting manager"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.metrics","message":"Starting metrics server"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"starting server","name":"health probe","addr":":8081"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.metrics","message":"Serving metrics server","bindAddress":"0.0.0.0:8080","secure":false}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Starting webhook server"}
manager I1118 13:48:54.147049       1 leaderelection.go:254] attempting to acquire leader lease monitoring/9f7554c3.opentelemetry.io...
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.webhook","message":"Serving webhook server","host":"","port":9443}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"controller-runtime.certwatcher","message":"Starting certificate watcher"}
manager I1118 13:48:54.645848       1 leaderelection.go:268] successfully acquired lease monitoring/9f7554c3.opentelemetry.io
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"collector-upgrade","message":"looking for managed instances to upgrade"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1beta1.OpenTelemetryCollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ConfigMap"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceAccount"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Service"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Deployment"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.DaemonSet"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.StatefulSet"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.Ingress"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v2.HorizontalPodAutoscaler"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodDisruptionBudget"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.ServiceMonitor"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","source":"kind source: *v1.PodMonitor"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting Controller","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","logger":"instrumentation-upgrade","message":"looking for managed Instrumentation instances to upgrade"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1alpha1.OpAMPBridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.ConfigMap"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.ServiceAccount"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.Service"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting EventSource","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","source":"kind source: *v1.Deployment"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:54Z","message":"Starting Controller","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Stopping and waiting for non leader election runnables"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Stopping and waiting for leader election runnables"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","message":"error received after stop sequence was engaged","error":"failed to list: Timeout: failed waiting for *v1beta1.OpenTelemetryCollector Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:512"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Service Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","message":"Could not wait for Cache to sync","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","error":"failed to wait for opentelemetrycollector caches to sync: failed to get informer from cache: Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:200\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:205\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","message":"error received after stop sequence was engaged","error":"failed to wait for opentelemetrycollector caches to sync: failed to get informer from cache: Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:512"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Deployment Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ServiceAccount Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Starting workers","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge","worker count":1}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"Shutdown signal received, waiting for all workers to finish","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
manager {"level":"INFO","timestamp":"2024-11-18T13:48:57Z","message":"All workers finished","controller":"opampbridge","controllerGroup":"opentelemetry.io","controllerKind":"OpAMPBridge"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1beta1.OpenTelemetryCollector Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.PodMonitor Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.DaemonSet Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:48:57Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.StatefulSet Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:00Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Ingress Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:06Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v2.HorizontalPodAutoscaler Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:07Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.PodDisruptionBudget Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:07Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ServiceMonitor Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Stopping and waiting for caches"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","message":"error received after stop sequence was engaged","error":"failed to list: Timeout: failed waiting for *v1alpha1.Instrumentation Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:512"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1alpha1.OpAMPBridge Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager W1118 13:49:09.146384       1 reflector.go:484] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: watch of *v1.PodMonitor ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Stopping and waiting for webhooks"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.webhook","message":"Shutting down webhook server with timeout of 1 minute"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Stopping and waiting for HTTP servers"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.metrics","message":"Shutting down metrics server with timeout of 1 minute"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"shutting down server","name":"health probe","addr":":8081"}
manager {"level":"INFO","timestamp":"2024-11-18T13:49:09Z","message":"Wait completed, proceeding to shutdown the manager"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.Deployment Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"controller-runtime.source.EventHandler","message":"failed to get informer from cache","error":"Timeout: failed waiting for *v1.ConfigMap Informer to sync","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1.1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:76\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:53\nk8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:54\nk8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33\nsigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind[...]).Start.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:64"}
manager {"level":"ERROR","timestamp":"2024-11-18T13:49:09Z","logger":"setup","message":"problem running manager","error":"error creating service monitor: servicemonitors.monitoring.coreos.com \"opentelemetry-operator-metrics-monitor\" already exists","stacktrace":"main.main\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/main.go:517\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.8/x64/src/runtime/proc.go:271"}
Stream closed EOF for monitoring/opentelemetry-operator-65f47c6cc4-g8wbj (manager)

Additional context

No response

The text was updated successfully, but these errors were encountered:

itayvolo · 2024-11-18T16:34:41Z

Having the same issue..
Looks like the operator creates a service monitor (opentelemetry-operator-metrics-monitor) and when a new pod starts up then the serviceMonitor creation error appears.
If I delete the serviceMonitor and let the operator recreate it, then it works (until the next pod interruption)

dhess · 2024-11-18T18:21:05Z

@itayvolo That's a good observation, and I can confirm that deleting the opentelemetry-operator-metrics-monitor and then restarting the operator does work around the problem until the pod is restarted for any reason.

(This suggests that the issue is that the pod isn't checking for the existence of the service monitor before trying to create it. However, it doesn't explain why the same operator with an identical configuration works on our other 2 clusters.)

anakineo · 2024-11-19T03:04:52Z

Deleting the service monitor works for me as well but it appears release v0.113.0 didn't include the fix #3447

iblancasa · 2024-11-19T08:11:00Z

The issues should be fixed by #3447. Can you check in your environments?

Rohlik · 2024-11-19T08:37:58Z

I am also looking for the new release, including fix #3447 🙏🏼.

anakineo · 2024-11-19T09:03:52Z

@iblancasa my env looks like below, running the latest v0.113.0

{
    "level": "INFO",
    "timestamp": "2024-11-19T08:41:23Z",
    "message": "Starting the OpenTelemetry Operator",
    "opentelemetry-operator": "0.113.0",
    "opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
    "operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
    "auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
    "auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
    "auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
    "auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
    "auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
    "auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
    "build-date": "2024-11-08T16:18:49Z",
    "go-version": "go1.22.8",
    "go-arch": "amd64",
    "go-os": "linux",
    "labels-filter": [],
    "annotations-filter": [],
    "enable-multi-instrumentation": true,
    "enable-apache-httpd-instrumentation": true,
    "enable-dotnet-instrumentation": true,
    "enable-go-instrumentation": false,
    "enable-python-instrumentation": true,
    "enable-nginx-instrumentation": false,
    "enable-nodejs-instrumentation": true,
    "enable-java-instrumentation": true,
    "create-openshift-dashboard": false,
    "zap-message-key": "message",
    "zap-level-key": "level",
    "zap-time-key": "timestamp",
    "zap-level-format": "uppercase"
}

Build date was 2024-11-08T16:18:49Z, but #3447 was merged after that. Judging by release note, can see v0.113.0 was cut at 99b6c6f, before the fix as well. Am I missing something here

bhvishal9 · 2024-11-19T11:02:24Z

Facing the same issue with v0.113.0

iblancasa · 2024-11-19T11:36:45Z

@iblancasa my env looks like below, running the latest v0.113.0

{
    "level": "INFO",
    "timestamp": "2024-11-19T08:41:23Z",
    "message": "Starting the OpenTelemetry Operator",
    "opentelemetry-operator": "0.113.0",
    "opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
    "operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
    "auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
    "auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
    "auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
    "auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
    "auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
    "auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
    "build-date": "2024-11-08T16:18:49Z",
    "go-version": "go1.22.8",
    "go-arch": "amd64",
    "go-os": "linux",
    "labels-filter": [],
    "annotations-filter": [],
    "enable-multi-instrumentation": true,
    "enable-apache-httpd-instrumentation": true,
    "enable-dotnet-instrumentation": true,
    "enable-go-instrumentation": false,
    "enable-python-instrumentation": true,
    "enable-nginx-instrumentation": false,
    "enable-nodejs-instrumentation": true,
    "enable-java-instrumentation": true,
    "create-openshift-dashboard": false,
    "zap-message-key": "message",
    "zap-level-key": "level",
    "zap-time-key": "timestamp",
    "zap-level-format": "uppercase"
}

Build date was 2024-11-08T16:18:49Z, but #3447 was merged after that. Judging by release note, can see v0.113.0 was cut at 99b6c6f, before the fix as well. Am I missing something here

I meant with a build from main.

dhess · 2024-11-19T12:13:28Z

The issues should be fixed by #3447. Can you check in your environments?

As shown in the bug report, we are running v0.113.0, which is the latest release at the time of writing. (edit: ahh, it appears #3447 isn't included in v0.113.0. Can we get a new release?)

116davinder · 2024-11-19T18:09:29Z

I was wondering for 2-3 days, why operator was crashing, finally, someone created the issue, today, I was hoping a create issue for myself here.

any luck, when we will have new release with the fix? can otel do a patch release for it. Randomly otel operator is crashing in all my cluster with same error log :( .

anakineo · 2024-11-19T23:19:35Z

@iblancasa my env looks like below, running the latest v0.113.0

{
    "level": "INFO",
    "timestamp": "2024-11-19T08:41:23Z",
    "message": "Starting the OpenTelemetry Operator",
    "opentelemetry-operator": "0.113.0",
    "opentelemetry-targetallocator": "ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.113.0",
    "operator-opamp-bridge": "ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.113.0",
    "auto-instrumentation-java": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.33.5",
    "auto-instrumentation-nodejs": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.53.0",
    "auto-instrumentation-python": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.48b0",
    "auto-instrumentation-dotnet": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0",
    "auto-instrumentation-go": "ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.17.0-alpha",
    "auto-instrumentation-apache-httpd": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "auto-instrumentation-nginx": "ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4",
    "feature-gates": "operator.collector.default.config,-operator.collector.targetallocatorcr,-operator.golang.flags,operator.observability.prometheus,-operator.sidecarcontainers.native,-operator.targetallocator.mtls",
    "build-date": "2024-11-08T16:18:49Z",
    "go-version": "go1.22.8",
    "go-arch": "amd64",
    "go-os": "linux",
    "labels-filter": [],
    "annotations-filter": [],
    "enable-multi-instrumentation": true,
    "enable-apache-httpd-instrumentation": true,
    "enable-dotnet-instrumentation": true,
    "enable-go-instrumentation": false,
    "enable-python-instrumentation": true,
    "enable-nginx-instrumentation": false,
    "enable-nodejs-instrumentation": true,
    "enable-java-instrumentation": true,
    "create-openshift-dashboard": false,
    "zap-message-key": "message",
    "zap-level-key": "level",
    "zap-time-key": "timestamp",
    "zap-level-format": "uppercase"
}

Build date was 2024-11-08T16:18:49Z, but #3447 was merged after that. Judging by release note, can see v0.113.0 was cut at 99b6c6f, before the fix as well. Am I missing something here

I meant with a build from main.

ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:main@sha256:145216a3debc585a182e16915dd128364737a3e758f14de0826080394afe795d has the fix. But we ran into another issue that operator wasn't able to list Deployment: no deployments found with the specified label.

Fix for that is to make sure below labels exist on operator Deployment, as it's fixed here

app.kubernetes.io/name: "opentelemetry-operator"
control-plane:  "controller-manager"

edit:
@iblancasa Would it make sense to use app.kubernetes.io/component: controller-manager instead of control-plane: "controller-manager" for the list. IMO ideally the two labels should be included in the helm chart by default since if they don't exist, operator fails to start.
app.kubernetes.io/component: controller-manager is already fixed in helm chart ( app.kubernetes.io/name: "opentelemetry-operator" as well ) and app.kubernetes.io/component is one of the known label keys.

edit2:
Did some further testing, if the labels don't exist, operator fails to start, this is NOT true. Setup a local env, and operator does start despite of the no deployments found with the specified label error. we did observe operator failing to start following no deployments found with the specified label error which I guess is a separate thing.

The TL;DR version is, ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:main@sha256:145216a3debc585a182e16915dd128364737a3e758f14de0826080394afe795d has the fix and if there are errors make sure label control-plane: "controller-manager" is added to values file

iblancasa · 2024-11-20T09:59:46Z

Hi @anakineo. Thanks for your testing. I'll take a look to see how this can be improved.
Thanks again for your feedback and help!

Update:

@iblancasa Would it make sense to use app.kubernetes.io/component: controller-manager instead of control-plane: "controller-manager" for the list.

When the OpenTelemetry Operator is deployed using the manifests from this repository, these are the labels that are part of the operator deployment.

  labels:
    app.kubernetes.io/name: opentelemetry-operator
    control-plane: controller-manager

So I think that what should be adapted is the Helm chart to match this.

Setup a local env, and operator does start despite of the no deployments found with the specified label error. we did observe operator failing to start following no deployments found with the specified label error which I guess is a separate thing.

Can you provide more info? I'm unable to reproduce.

anakineo · 2024-11-21T10:17:16Z

When the OpenTelemetry Operator is deployed using the manifests from this repository, these are the labels that are part of the operator deployment.

@iblancasa Yes, they are from this repository and deployment should follow source. IMO it's better to use recognized labels in this case.

Can you provide more info? I'm unable to reproduce.

Assume you mean list error causing operator failing to start not producible. Unfortunately I don't have the logs at hand but I recall seeing source failed to sync errors which, to my understanding, seems to relate cache start. Also, as far as I can tell, OperatorMetrics runnable simply log the error and always return, so I believe my observation list Deployment error causing operator failing to start is not relevant to this issue.

Thanks for your investigation.

fdanielson · 2024-11-21T20:38:07Z

Just wanted to add another impact story, we're also seeing this problem with 0.113.0. I had to revert the upgrade to maintain stability in production.

iblancasa · 2024-11-22T07:53:52Z

Can you provide more info? I'm unable to reproduce.

Assume you mean list error causing operator failing to start not producible. Unfortunately I don't have the logs at hand but I recall seeing source failed to sync errors which, to my understanding, seems to relate cache start. Also, as far as I can tell, OperatorMetrics runnable simply log the error and always return, so I believe my observation list Deployment error causing operator failing to start is not relevant to this issue.

Thanks for your investigation.

Sorry. I understood the operator was crashing for you after list Deployment error causing operator failing to start message even with the fix. So it seems we are ok now.

A new release should be done soon with the new operator version.
Closing this issue since it was already fixed in #3447

dhess · 2024-12-05T08:24:07Z

~~Can we please get a release with this fix?~~ edit: never mind, I see it's bee included in 0.114.0, thanks!

dhess · 2024-12-10T06:56:39Z

[edit: ~~This seems to have resolved itself finally. We're no longer getting any errors when the operator restarts.~~ Nope, it's back.]

We're still getting these errors after upgrading to 0.114.1. It starts with the following:

opentelemetry-operator-75c9bff9b5-5gpk4 manager {"level":"ERROR","timestamp":"2024-12-10T06:49:10Z","logger":"operator-metrics-sm","message":"error creating Service Monitor for operator metrics","error":"error getting owner references: no deployments found with the specified label","stacktrace":"github.com/open-telemetry/opentelemetry-operator/internal/operator-metrics.OperatorMetrics.Start\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/internal/operator-metrics/metrics.go:75\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}

Then after some indeterminate amount of time, we start getting the same cache sync-related errors shown in the OP, and the operator crashes.

What we've tried:

Upgraded to 0.75.0 of the Helm chart, which installs v0.114.1 of the operator.
Delete the operator deployment, collector deployment, and CRDs, then re-install.

But the problem persists. The values.yaml we're using are as shown in the OP.

I note that after deleting the operator deployment, the automatically-created service monitor was left lying around, which we deleted by hand before re-installing. I wonder if this has something to do with the problem.

dhess · 2024-12-26T14:26:31Z

I think this issue should be re-opened. After some time, sometimes a few hours, sometimes a few days, the operator will crash, and then we get this error in the logs:

manager {"level":"ERROR","timestamp":"2024-12-26T14:23:55Z","logger":"operator-metrics-sm","message":"error creating Service Monitor for operator metrics","error":"error getting owner references: no deployments found with the specified label","stacktrace":"github.com/open-telemetry/opentelemetry-operator/internal/operator-metrics.OperatorMetrics.Start\n\t/home/runner/work/opentelemetry-operator/opentelemetry-operator/internal/operator-metrics/metrics.go:75\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:226"}

dhess added bug Something isn't working needs triage labels Nov 18, 2024

iblancasa added duplicate This issue or pull request already exists and removed needs triage labels Nov 19, 2024

veyselsahin mentioned this issue Nov 19, 2024

operator doesn't respect manager.serviceMonitor.enabled helm value #3474

Open

iblancasa closed this as completed Nov 22, 2024

cterence added a commit to cterence/homelab-gitops that referenced this issue Nov 23, 2024

fix(otel): revert open-telemetry/opentelemetry-operator#3471

85870e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator enters CrashLoopBackOff after multiple errors #3471

Operator enters CrashLoopBackOff after multiple errors #3471

dhess commented Nov 18, 2024 •

edited

Loading

itayvolo commented Nov 18, 2024

dhess commented Nov 18, 2024 •

edited

Loading

anakineo commented Nov 19, 2024

iblancasa commented Nov 19, 2024 •

edited

Loading

Rohlik commented Nov 19, 2024

anakineo commented Nov 19, 2024

bhvishal9 commented Nov 19, 2024

iblancasa commented Nov 19, 2024

dhess commented Nov 19, 2024 •

edited

Loading

116davinder commented Nov 19, 2024

anakineo commented Nov 19, 2024 •

edited

Loading

iblancasa commented Nov 20, 2024 •

edited

Loading

anakineo commented Nov 21, 2024

fdanielson commented Nov 21, 2024

iblancasa commented Nov 22, 2024

dhess commented Dec 5, 2024 •

edited

Loading

dhess commented Dec 10, 2024 •

edited

Loading

dhess commented Dec 26, 2024

Operator enters CrashLoopBackOff after multiple errors #3471

Operator enters CrashLoopBackOff after multiple errors #3471

Comments

dhess commented Nov 18, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Kubernetes Version

Operator version

Collector version

Environment information

Environment

Log output

Additional context

itayvolo commented Nov 18, 2024

dhess commented Nov 18, 2024 • edited Loading

anakineo commented Nov 19, 2024

iblancasa commented Nov 19, 2024 • edited Loading

Rohlik commented Nov 19, 2024

anakineo commented Nov 19, 2024

bhvishal9 commented Nov 19, 2024

iblancasa commented Nov 19, 2024

dhess commented Nov 19, 2024 • edited Loading

116davinder commented Nov 19, 2024

anakineo commented Nov 19, 2024 • edited Loading

iblancasa commented Nov 20, 2024 • edited Loading

anakineo commented Nov 21, 2024

fdanielson commented Nov 21, 2024

iblancasa commented Nov 22, 2024

dhess commented Dec 5, 2024 • edited Loading

dhess commented Dec 10, 2024 • edited Loading

dhess commented Dec 26, 2024

dhess commented Nov 18, 2024 •

edited

Loading

dhess commented Nov 18, 2024 •

edited

Loading

iblancasa commented Nov 19, 2024 •

edited

Loading

dhess commented Nov 19, 2024 •

edited

Loading

anakineo commented Nov 19, 2024 •

edited

Loading

iblancasa commented Nov 20, 2024 •

edited

Loading

dhess commented Dec 5, 2024 •

edited

Loading

dhess commented Dec 10, 2024 •

edited

Loading