Add 'reason' attribute to otelcol_exporter_send_failed_*
metrics
#10157
Labels
collector-telemetry
healthchecker and other telemetry collection issues
Milestone
Is your feature request related to a problem? Please describe.
I am interested in monitoring data loss which occurs when exporting data from one instance of the Collector to another, specifically using the loadbalancingexporter.
At the moment I just see a course grained metric which counts the export failures, but gives me no data on the cause. Was it a permanent or retryable error? Was it a badly configured endpoint, or did the downstream receiver actively reject the data?
I can look into the logs to see info on specific failures, but this is tedious and less easy to understand.
Describe the solution you'd like
I propose that we add a
reason
dimension to theotelcol_exporter_send_failed_*
metrics. Thisreason
could be the GRPC status of the response (I understand that GRPC status is uses as the internal representation of these kind of problems).Describe alternatives you've considered
It is possible to try to correlate export failure metrics with downstream receiver error metrics. We can also try to correlate with "know failure causes", such as
memorylimiterprocessor
errors, which could mean the upstream export failed.We can also check the logs, and even - depending on the system - extra metrics from theses logs.
However, this is all much harder work
Additional context
We could also consider adding a similar attribute to the
otelcol_receiver_refused_spans
metric.I have had a look at the code, and it seems like a fairly small change in / around
exporter/exporterhelper/obsexporter.go
The text was updated successfully, but these errors were encountered: