Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/jaegerreceiver] Thrift packets drop #34462

Open
larsn777 opened this issue Aug 7, 2024 · 4 comments
Open

[receiver/jaegerreceiver] Thrift packets drop #34462

larsn777 opened this issue Aug 7, 2024 · 4 comments

Comments

@larsn777
Copy link
Contributor

larsn777 commented Aug 7, 2024

Component(s)

receiver/jaeger

Describe the issue you're reporting

When using a jaeger receiver, we may periodically lose data on the collector due to the high incoming rate of thrift packets. In this case, the user does not even know that he is losing data on the receiver, since there are no metrics displaying these drops. If we consider that Jaeger libraries can send thrift data without delivery confirmation (oneway methods), we get a situation where the user has no way at all to know that data loss is occurring.

As one of the solutions to the problem, we can export metrics describing the number of processed/dropped thrift packets. A little bit later I can prepare and open the corresponding PR

@larsn777 larsn777 added the needs triage New item requiring triage label Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@yurishkuro
Copy link
Member

yurishkuro commented Aug 7, 2024

Curious why you are still using UDP exporters, which afaik are only available in jaeger SDKs which are retired.

But speaking of those SDKs, the PR you have is not going to solve the problem because it only tracks packets received but not processed. But the other vector for loss is packets not even making it to the receiver because of overload. Jaeger SDKs had a more reliable mechanism for that type of loss by including the count in the packets, such that the receiver would be able to detect the difference between number of spans sent and received from a client:

jaeger_agent_client_stats_batches_sent_total 0
jaeger_agent_client_stats_connected_clients 0
jaeger_agent_client_stats_spans_dropped_total{cause="full-queue"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="send-failure"} 0
jaeger_agent_client_stats_spans_dropped_total{cause="too-large"} 0

@larsn777
Copy link
Contributor Author

larsn777 commented Aug 8, 2024

Hello

Curious why you are still using UDP exporters, which afaik are only available in jaeger SDKs which are retired.

The short answer - legacy code) We want to move away from the Jaeger SDK on the client side, but we have more than 3k of microservices, so the process of updating client libraries can take quite a long time.

Jaeger SDKs had a more reliable mechanism for that type of loss by including the count in the packets, such that the receiver would be able to detect the difference between number of spans sent and received from a client

Yes, I know that client libraries can send statistics about the number of sent batches and errors. However, processing these statistics will not solve all problems with data loss:

  1. As far as I understand from the Thrift scheme, sending statistics is optional. Thus, there may be SDK versions in which client libraries will not send these statistics to the receiver.
  2. Even if we start processing client statistics on the receiver, we still need packets rejection metrics of the receiver itself. Otherwise, most likely, it will be difficult for us to determine the exact place where the data loss occurs.
  3. When using the OTelCol agent <-> gateway deployment scheme, agents can be placed on nodes with a large number of services. And in order to correctly process statistics, we need to be able to clearly identify statistics for each individual service with its own SDK instance.

In fact, I already have a draft code in which the receiver processes client statistics. If I will have some free time, I will try to open PR it in the near future.

Copy link
Contributor

github-actions bot commented Dec 2, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants