Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry collector failed to boot up when passing in match group references (${1}, ${2}, ...) to Prometheus receiver #35733

Open
chc5 opened this issue Oct 10, 2024 · 6 comments
Assignees
Labels
bug Something isn't working receiver/prometheus Prometheus receiver waiting for author

Comments

@chc5
Copy link

chc5 commented Oct 10, 2024

Component(s)

receiver/prometheus

What happened?

Description

OpenTelemetry collector from v0.105.0 and onwards does not work for my set of configurations that relies on appending the port number to the address to scrape metrics from other Kubernetes pods with Prometheus receiver. It previously works for version 0.104.0 and below, but I saw changes that went in like confmap.strictlyTypedInput and confmap.unifyEnvVarExpansion that may have caused my set of configurations to be incompatible and it doesn't seem like there's any alternative solution to address this from further research.

Steps to Reproduce

Create a prometheus receiver that uses relabel_configs and use match group references in replacement substituted by their value. : https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

For example, in OpenTelemetry I would set this to $$1:$$2 to escape environment variable resolution: Reference

- action: replace
--
  | regex: ([^:]+)(?::\d+)?;(\d+)
  | replacement: $$1:$$2
  | source_labels:
  | - __address__
  | - __meta_kubernetes_pod_annotation_prometheus_io_port
  | target_label: __address__

Expected Result

OpenTelemetry collector should continue to support $$1:$$2 or provide an alternate solution to allow named variables to be passed in like $${__address__}:$${__meta_kubernetes_pod_annotation_prometheus_io_port}.

Actual Result

OpenTelemetry fails to boot up with the following error with $$1:$$2:

Error: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/10/10 19:14:49 Failed to run the service: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

Collector version

v0.104.0 works, but any version higher than 0.104.0 produces this bug.

Environment information

Environment

OS:
Compiler(if manually compiled): golang:1.22

OpenTelemetry Collector configuration

exporters:
  googlecloud:
    metric:
      endpoint: monitoring.googleapis.com:443
      instrumentation_library_labels: "false"
      prefix: custom.googleapis.com
      service_resource_labels: "false"
      skip_create_descriptor: "true"
    project: test-tenant-project-id
processors:
  batch:
    send_batch_size: 500
    timeout: 10s
  filter/apps:
    metrics:
      include:
        match_type: regexp
        metric_names:
        - server_nio
  memory_limiter/prevent_oom:
    check_interval: 30s
    limit_percentage: 80
    spike_limit_percentage: 30
  metricstransform/apps:
    transforms:
    - action: update
      include: server_nio
      new_name: custom.googleapis.com/server/nio
      operations:
      - action: aggregate_labels
        aggregation_type: sum
        label_set:
        - state
      - action: toggle_scalar_data_type
  resource/container:
    attributes:
    - action: delete
      pattern: net.*
    - action: delete
      pattern: service.*
    - action: delete
      key: http.scheme
    - action: delete
      key: method
    - action: upsert
      key: cloud.region
      value: us-west1
    - action: upsert
      key: k8s.cluster.name
      value: test-cluster-name
receivers:
  prometheus/apps:
    config:
      scrape_configs:
      - job_name: prometheus-scraper
        kubernetes_sd_configs:
        - namespaces:
            names:
            - test-ns
          role: pod
          selectors:
          - field: spec.nodeName=${NODE_NAME},metadata.name!=${POD_NAME}
            label: foo.com/platform=gke
            role: pod
        metric_relabel_configs:
        - action: keep
          regex: server_nio
          source_labels:
          - __name__
        relabel_configs:
        - action: keep
          regex: true
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scrape
        - action: drop
          regex: true
          source_labels:
          - __meta_kubernetes_pod_container_init
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scheme
          target_label: __scheme__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_path
          target_label: __metrics_path__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_type
          target_label: __param_type
        - action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $$1:$$2
          source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          target_label: __address__
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_org
          target_label: org
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_env
          target_label: env
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_instance_id
          target_label: instance_id
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_com_version
          target_label: runtime_version
        - action: replace
          replacement: clusters/test-cluster-name/pods/$$1
          source_labels:
          - __meta_kubernetes_pod_uid
          target_label: _uid
        scrape_interval: 60s
        scrape_timeout: 60s
        tls_config:
          insecure_skip_verify: true
    use_start_time_metric: false
service:
  extensions:
  - health_check
  pipelines:
    metrics/apps:
      exporters:
      - googlecloud
      processors:
      - memory_limiter/prevent_oom
      - batch
      - filter/apps
      - resource/container
      - metricstransform/apps
      receivers:
      - prometheus/apps
  telemetry:
    logs:
      level: debug
      output_paths: stdout
    metrics:
      address: :9091

Log output

Error: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$
2024/10/10 19:14:49 Failed to run the service: failed to get config: cannot resolve the configuration: cannot convert the confmap.Conf: environment variable "2" has invalid name: must match regex ^[a-zA-Z_][a-zA-Z0-9_]*$

Additional context

https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/env-vars.md#issues-of-current-behavior
#9984

@chc5 chc5 added bug Something isn't working needs triage New item requiring triage labels Oct 10, 2024
@github-actions github-actions bot added the receiver/prometheus Prometheus receiver label Oct 10, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dashpole
Copy link
Contributor

@mx-psi I haven't been following the configuration work closely enough to answer this. Do you know what prometheus users should do going forward?

@dashpole dashpole removed the needs triage New item requiring triage label Oct 10, 2024
@dashpole dashpole self-assigned this Oct 10, 2024
@mx-psi
Copy link
Member

mx-psi commented Oct 11, 2024

I am unable to reproduce, with the original file I get the following errors:

Error log with file provided in original post (click to expand)
❯ ./otelcol-contrib --config config.yaml 
2024-10-11T10:32:[email protected]/provider.go:59Configuration references unset environment variable{"name": "NODE_NAME"}
2024-10-11T10:32:[email protected]/provider.go:59Configuration references unset environment variable{"name": "POD_NAME"}
Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):

error decoding 'exporters': error reading configuration for "googlecloud": decoding failed due to the following error(s):

'metric.skip_create_descriptor' expected type 'bool', got unconvertible type 'string', value: 'true'
'metric.instrumentation_library_labels' expected type 'bool', got unconvertible type 'string', value: 'false'
'metric.service_resource_labels' expected type 'bool', got unconvertible type 'string', value: 'false'
2024/10/11 10:32:07 collector server run finished with error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):

error decoding 'exporters': error reading configuration for "googlecloud": decoding failed due to the following error(s):

'metric.skip_create_descriptor' expected type 'bool', got unconvertible type 'string', value: 'true'
'metric.instrumentation_library_labels' expected type 'bool', got unconvertible type 'string', value: 'false'
'metric.service_resource_labels' expected type 'bool', got unconvertible type 'string', value: 'false'

With a fixed file:

Fixed file (click to expand)
extensions:
  health_check:
exporters:
  googlecloud:
    metric:
      endpoint: monitoring.googleapis.com:443
      instrumentation_library_labels: false
      prefix: custom.googleapis.com
      service_resource_labels: false
      skip_create_descriptor: true
    project: test-tenant-project-id
processors:
  batch:
    send_batch_size: 500
    timeout: 10s
  filter/apps:
    metrics:
      include:
        match_type: regexp
        metric_names:
        - server_nio
  memory_limiter/prevent_oom:
    check_interval: 30s
    limit_percentage: 80
    spike_limit_percentage: 30
  metricstransform/apps:
    transforms:
    - action: update
      include: server_nio
      new_name: custom.googleapis.com/server/nio
      operations:
      - action: aggregate_labels
        aggregation_type: sum
        label_set:
        - state
      - action: toggle_scalar_data_type
  resource/container:
    attributes:
    - action: delete
      pattern: net.*
    - action: delete
      pattern: service.*
    - action: delete
      key: http.scheme
    - action: delete
      key: method
    - action: upsert
      key: cloud.region
      value: us-west1
    - action: upsert
      key: k8s.cluster.name
      value: test-cluster-name
receivers:
  prometheus/apps:
    config:
      scrape_configs:
      - job_name: prometheus-scraper
        kubernetes_sd_configs:
        - namespaces:
            names:
            - test-ns
          role: pod
          selectors:
          - field: spec.nodeName=${NODE_NAME},metadata.name!=${POD_NAME}
            label: foo.com/platform=gke
            role: pod
        metric_relabel_configs:
        - action: keep
          regex: server_nio
          source_labels:
          - __name__
        relabel_configs:
        - action: keep
          regex: true
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scrape
        - action: drop
          regex: true
          source_labels:
          - __meta_kubernetes_pod_container_init
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scheme
          target_label: __scheme__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_path
          target_label: __metrics_path__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_type
          target_label: __param_type
        - action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $$1:$$2
          source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          target_label: __address__
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_org
          target_label: org
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_env
          target_label: env
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_instance_id
          target_label: instance_id
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_com_version
          target_label: runtime_version
        - action: replace
          replacement: clusters/test-cluster-name/pods/$$1
          source_labels:
          - __meta_kubernetes_pod_uid
          target_label: _uid
        scrape_interval: 60s
        scrape_timeout: 60s
        tls_config:
          insecure_skip_verify: true
    use_start_time_metric: false
service:
  extensions:
  - health_check
  pipelines:
    metrics/apps:
      exporters:
      - googlecloud
      processors:
      - memory_limiter/prevent_oom
      - batch
      - filter/apps
      - resource/container
      - metricstransform/apps
      receivers:
      - prometheus/apps
  telemetry:
    logs:
      level: debug
      output_paths: stdout
    metrics:
      address: :9091

The config validates (I get a different error but it's just wrong setup):

Logs with fixed config
❯ ./otelcol-contrib --config fixed-config.yaml 
2024-10-11T10:33:05.084-0400	info	[email protected]/service.go:136	Setting up own telemetry...
2024-10-11T10:33:05.084-0400	warn	[email protected]/service.go:191	service::telemetry::metrics::address is being deprecated in favor of service::telemetry::metrics::readers
2024-10-11T10:33:05.084-0400	info	telemetry/metrics.go:70	Serving metrics	{"address": ":9091", "metrics level": "Normal"}
2024-10-11T10:33:05.085-0400	debug	builders/builders.go:24	Beta component. May change in the future.{"kind": "exporter", "data_type": "metrics", "name": "googlecloud"}
2024-10-11T10:33:05.085-0400	debug	builders/builders.go:24	Beta component. May change in the future.{"kind": "processor", "name": "metricstransform/apps", "pipeline": "metrics/apps"}
2024-10-11T10:33:05.085-0400	debug	builders/builders.go:24	Beta component. May change in the future.{"kind": "processor", "name": "resource/container", "pipeline": "metrics/apps"}
2024-10-11T10:33:05.085-0400	debug	builders/builders.go:24	Alpha component. May change in the future.	{"kind": "processor", "name": "filter/apps", "pipeline": "metrics/apps"}
2024-10-11T10:33:05.085-0400	info	[email protected]/metrics.go:98	Metric filter configured{"kind": "processor", "name": "filter/apps", "pipeline": "metrics/apps", "include match_type": "regexp", "include expressions": [], "include metric names": ["server_nio"], "include metrics with resource attributes": null, "exclude match_type": "", "exclude expressions": [], "exclude metric names": [], "exclude metrics with resource attributes": null}
2024-10-11T10:33:05.085-0400	debug	builders/builders.go:24	Beta component. May change in the future.{"kind": "processor", "name": "batch", "pipeline": "metrics/apps"}
2024-10-11T10:33:05.085-0400	debug	builders/builders.go:24	Beta component. May change in the future.{"kind": "processor", "name": "memory_limiter/prevent_oom", "pipeline": "metrics/apps"}
2024-10-11T10:33:05.086-0400	info	memorylimiter/memorylimiter.go:151	Using percentage memory limiter	{"kind": "processor", "name": "memory_limiter/prevent_oom", "pipeline": "metrics/apps", "total_memory_mib": 31765, "limit_percentage": 80, "spike_limit_percentage": 30}
2024-10-11T10:33:05.086-0400	info	memorylimiter/memorylimiter.go:75	Memory limiter configured{"kind": "processor", "name": "memory_limiter/prevent_oom", "pipeline": "metrics/apps", "limit_mib": 25412, "spike_limit_mib": 9529, "check_interval": 30}
2024-10-11T10:33:05.086-0400	debug	builders/builders.go:24	Beta component. May change in the future.{"kind": "receiver", "name": "prometheus/apps", "data_type": "metrics"}
2024-10-11T10:33:05.086-0400	debug	builders/extension.go:48	Beta component. May change in the future.	{"kind": "extension", "name": "health_check"}
2024-10-11T10:33:05.072-0400	warn	[email protected]/provider.go:59	Configuration references unset environment variable	{"name": "NODE_NAME"}
2024-10-11T10:33:05.072-0400	warn	[email protected]/provider.go:59	Configuration references unset environment variable	{"name": "POD_NAME"}
2024-10-11T10:33:05.087-0400	info	[email protected]/service.go:208	Starting otelcol-contrib...	{"Version": "0.111.0", "NumCPU": 20}
2024-10-11T10:33:05.087-0400	info	extensions/extensions.go:39	Starting extensions...
2024-10-11T10:33:05.087-0400	info	extensions/extensions.go:42	Extension is starting...	{"kind": "extension", "name": "health_check"}
2024-10-11T10:33:05.087-0400	info	[email protected]/healthcheckextension.go:33	Starting health_check extension	{"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-10-11T10:33:05.088-0400	info	extensions/extensions.go:59	Extension started.	{"kind": "extension", "name": "health_check"}
2024-10-11T10:33:05.114-0400	error	graph/graph.go:426	Failed to start component	{"error": "error finding default application credentials: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information", "type": "Exporter", "id": "googlecloud"}
2024-10-11T10:33:05.114-0400	info	[email protected]/service.go:270	Starting shutdown...
2024-10-11T10:33:05.114-0400	info	healthcheck/handler.go:132	Health Check state change	{"kind": "extension", "name": "health_check", "status": "unavailable"}
2024-10-11T10:33:05.115-0400	info	extensions/extensions.go:66	Stopping extensions...
2024-10-11T10:33:05.115-0400	info	[email protected]/service.go:284	Shutdown complete.
Error: cannot start pipelines: error finding default application credentials: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information; failed to shutdown pipelines: no existing monitoring routine is running
2024/10/11 10:33:05 collector server run finished with error: cannot start pipelines: error finding default application credentials: google: could not find default credentials. See https://cloud.google.com/docs/authentication/external/set-up-adc for more information; failed to shutdown pipelines: no existing monitoring routine is running

@TylerHelmuth could you also take a look? Could this be operator-specific? (Unclear what the environment we are talking about here)

@chc5
Copy link
Author

chc5 commented Oct 11, 2024

I believe the error that you're seeing could be related to googlecloudexporter not having the right credentials. I've trimmed down the config to only use prometheus receiver along with other basic processors and exporters. I hope this config works for you to reproduce the main error on your end:

Revised config (click to expand)
exporters:
  debug:
    verbosity: detailed
processors:
  batch:
    send_batch_size: 500
    timeout: 10s
receivers:
  prometheus/apps:
    config:
      scrape_configs:
      - job_name: prometheus-scraper
        kubernetes_sd_configs:
        - namespaces:
            names:
            - test-ns
          role: pod
          selectors:
          - field: spec.nodeName=${NODE_NAME},metadata.name!=${POD_NAME}
            label: foo.com/platform=gke
            role: pod
        metric_relabel_configs:
        - action: keep
          regex: server_nio
          source_labels:
          - __name__
        relabel_configs:
        - action: keep
          regex: true
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scrape
        - action: drop
          regex: true
          source_labels:
          - __meta_kubernetes_pod_container_init
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_scheme
          target_label: __scheme__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_path
          target_label: __metrics_path__
        - action: replace
          regex: (.+)
          source_labels:
          - __meta_kubernetes_pod_annotation_prometheus_io_type
          target_label: __param_type
        - action: replace
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $$1:$$2
          source_labels:
          - __address__
          - __meta_kubernetes_pod_annotation_prometheus_io_port
          target_label: __address__
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_org
          target_label: org
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_env
          target_label: env
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_instance_id
          target_label: instance_id
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_label_com_version
          target_label: runtime_version
        - action: replace
          replacement: clusters/test-cluster-name/pods/$$1
          source_labels:
          - __meta_kubernetes_pod_uid
          target_label: _uid
        scrape_interval: 60s
        scrape_timeout: 60s
        tls_config:
          insecure_skip_verify: true
    use_start_time_metric: false
service:
  pipelines:
    metrics/apps:
      exporters:
      - debug
      processors:
      - batch
      receivers:
      - prometheus/apps
  telemetry:
    logs:
      level: debug
      output_paths: stdout
    metrics:
      address: :9091

@mx-psi
Copy link
Member

mx-psi commented Oct 11, 2024

After adding the health_check extension it works fine for me.

I tested this with the following steps (Linux amd64 machine):

❯ curl -L -o contrib0.111.tar.gz https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.111.0/otelcol-contrib_0.111.0_linux_amd64.tar.gz
❯ tar pfx contrib0.111.tar.gz 
❯ ./otelcol-contrib --config config.yaml 

and it seems to run fine. So again, I think this may be something specific to how you are running your Collector. Are you using the operator?

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/prometheus Prometheus receiver waiting for author
Projects
None yet
Development

No branches or pull requests

3 participants