Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating trace-level metrics #1968

Closed
adirmatzkin opened this issue Jan 2, 2023 · 4 comments
Closed

Generating trace-level metrics #1968

adirmatzkin opened this issue Jan 2, 2023 · 4 comments
Labels
enhancement New feature or request stale Used for stale issues / PRs

Comments

@adirmatzkin
Copy link
Contributor

Is your feature request related to a problem? Please describe.
As for today, the MetricsGenerator has 2 metrics processors - Service graphs and Span metrics.
When trying to calculate SLIs for a "whole business service" (whether it is in a single app or distributed across multiple apps), the metrics already exist today aren't helping as much, and calculating it using the stored traces would consume a lot of resources (or even not possible).
Some use cases I can think of (relevant for my org and probably more):

  • Reflecting "whole business service" metrics to build upon them dashboards / reports for stakeholders.
  • Being able to see long term trends (months or even years).
  • Monitoring SLOs based on trace telemetry.

Describe the solution you'd like
Another processor for the MetricsGenerator to generate trace-level metrics for ingested spans.
I guess the metrics and dimensions could be similar to those on the span-metrics processor.

Some challenges I can already think of (mostly due to the nature of a trace - built from distributed spans that are not be under a single root span):

  1. Incoming spans come from different sources, meaning possible delays. How long should we "wait" for all spans of a trace to be ingested? (probably should be configurable).
  2. What should we use as the "trace name"? (root span name?) or more generally - how should we differentiate between different traces that have the same root span but serve different purposes? (and maybe even traverse different services).
  3. When should we consider a trace as failed?
    For example
  • Handled errors
  • Multiple services that work together asynchronously that have errors downstream
  • Missing operation span (for instance, if a service didn't pick up a msg in an event driven system)
    Maybe in order to handle those challenges we will need to provide some kind of configuration for those specific needs.
    The more I think about it, I wonder if handling those challenges won't bring us to configure each "different" trace with unique conditions that basically represent the trace itself... Perhaps we could define our MVP for this and start with something.

Describe alternatives you've considered

  • Creating a Prometheus exporter to expose those metrics, using Tempo's search API as the raw data layer - that way I could theoretically give it a configuration for traces filtering and search for the latest (minus some buffer) ones to get their DurationMs or status and generate metrics.
  • Querying the "raw" traces straight to "calculate metrics".
  • Using the upcoming TraceQL to aggregate over all traces.

Additional context
Thread on Slack
@cristiangsp @amitsetty

@joe-elliott joe-elliott added the enhancement New feature or request label Jan 3, 2023
@joe-elliott
Copy link
Member

joe-elliott commented Jan 3, 2023

I am on board with this idea, but would like a little more definition in the metrics you would want to generate.

Incoming spans come from different sources, meaning possible delays. How long should we "wait" for all spans of a trace to be ingested? (probably should be configurable).

Agreed that just making a configurable timeout is the best we can do.

What should we use as the "trace name"? (root span name?) or more generally - how should we differentiate between different traces that have the same root span but serve different purposes? (and maybe even traverse different services).

Likely the root span name is the correct choice for the trace name. Currently we have the ability to dynamically add attributes in span metrics. If we carry this functionality to trace level metrics will it be sufficient to handle your concern? i.e. you could add an attribute to the root span to differentiate between different purposes.

There are definitely cases this wouldn't handle, but it maybe a good starting point.

When should we consider a trace as failed?

Likely should be configurable. By default I'd say only if the root span has a failed status. It's not uncommon in distributed systems for a piece of a the request to fail but the whole to be considered successful. However, I can see some operators wanting to consider a request failed if any span has failed.

Missing operation span (for instance, if a service didn't pick up a msg in an event driven system)

This would be difficult b/c you'd have to encode some of this knowledge into Tempo through configuration. e.g. Tempo would have to know that a specific span is expected under certain circumstances. More generally we could count "broken traces" with trace level metrics where any trace that contained a missing span was considered broken, but I don't think this covers what you're wanting.

I think this is a great idea, but we should start with an agreed upon MVP as suggested. Once the basic code to do trace level metrics is in it would be easier to experiment with and extend as necessary.

@adirmatzkin
Copy link
Contributor Author

adirmatzkin commented Jan 5, 2023

It got a bit messy and long, so besides referencing quotes, I'm giving small titles 😄

Metrics specs

I am on board with this idea, but would like a little more definition in the metrics you would want to generate.

Awesome!
As for the metrics themselves, I was thinking about starting with something simple just like what today's span-metrics emits:

  • A counter that computes the total traces: something like tracemetrics_calls_total
  • A histogram that tracks the distribution of durations of all traces: something like tracemetrics_latency

Those metrics can be used to calculate "RED" metrics, which I believe cover most use cases.

As for their dimensions, I think they will be a bit different than the span-metrics ones:
What do you think about having root_service, root_span_name, status, and trace_kind (async/sync) as default labels? (configurable to extend of course)
What I mean by "trace_kind" is to distinguish between:

  1. Synchronous traces: The root span is the first (obviously) and last thing that happens.
    Endpoint requests are classic here.
  2. Asynchronous traces: The root span is the first, but a different span is the last thing to happen.
    For example, a first service publishes a message consumed by a second service.

If you think about it for a second, maybe we don't even care about generating metrics for "synchronous" traces, because today's span-metrics actually give the exact same benefit for the root-spans calculations... Something to think about... probably we will keep it all for now to keep it simple and clear.

Trace declaration timeout

Incoming spans come from different sources, meaning possible delays. How long should we "wait" for all spans of a trace to be ingested? (probably should be configurable).

Agreed that just making a configurable timeout is the best we can do.

Agree about having a configurable timeout, but want to propose another idea:
What do you think about configuring a condition to declare a "finishing span"? That way we could "close" the lookup for more spans in a trace as soon as one of the conditions occurs - "finishing span" or timeout.
I think most organizations (or tenants within one) will have different traces that their (normal) latency distribution is pretty wide (some in scales of milliseconds, and some maybe minutes). With that solution perhaps we will be able to hold a long timeout for traces that actually need it (memory-utilization-wise) but finish many "quick" traces and "get rid" of them asap.
Another reason why we might want to implement it is to help with generating metrics a bit more close to real-time (otherwise if you have a configured 5 minutes timeout, you won't "let go" any quick traces but only after 5 minutes...).
I'm imagining something like an agreed attribute added to "last spans" to make it possible globally in an organisation.

Edited:
Just realized there is an otel collector processor to group spans by trace. Could be used here to buffer spans until the whole trace arrives. Wondering if we want to use this at the collector level and depend on it, or to implement it and maybe just take inspiration.

Trace name

What should we use as the "trace name"? (root span name?) or more generally - how should we differentiate between different traces that have the same root span but serve different purposes? (and maybe even traverse different services).

Likely the root span name is the correct choice for the trace name. Currently we have the ability to dynamically add attributes in span metrics. If we carry this functionality to trace level metrics will it be sufficient to handle your concern? i.e. you could add an attribute to the root span to differentiate between different purposes.

There are definitely cases this wouldn't handle, but it maybe a good starting point.

I'm not sure what do you mean by "dynamically add attributes in span metrics", tried looking in docs/code with no success... so not so sure about answering that 😢

I'll try to explain what I meant with an example:
Lets say a trace traverses 3 different services, passing messages through a messaging system. The tricky part here is that only on the second service, you know what will be the third one. So it could be A > B > C or A > B > D, and only B (or C/D of course) knows where the trace went (C or D).
Obviously, we don't want to "count" the first and the second trace types together... It passes different services and probably does different things.

I'm aware that we don't want to differ between every single trace that has the smallest differences in every single span, that's of course not what the aggregative metrics are here for... but I do think that a trace that finishes in a whole different place is probably something we do want to differentiate, especially in "async" traces.

Unless I'm missing something about the dynamic attributes you have mentioned, what do think about adding another dimension mentioning the last_span_name? I think it might be very handy here.
I also think we could leverage the last span for more things that I will explain in the following section 😄

Trace failure status

When should we consider a trace as failed?

Likely should be configurable. By default I'd say only if the root span has a failed status. It's not uncommon in distributed systems for a piece of a the request to fail but the whole to be considered successful. However, I can see some operators wanting to consider a request failed if any span has failed.

Totally agree that if we are talking about "synchronous" traces - the root span status is what we want.
As usual... the async traces are the tricky ones... 😆 Continuing my suggestion to use the last span to differentiate between different traces (in the "Trace name" section), and the challenge we have in the "Trace declaration timeout" section that could theoretically be solved by having a "finishing span" attribute, what do you think about leveraging (again) the last span's, this time its' status to say that an async trace is considered a failure? In case of passing the timeout, the trace would be considered finished, and if the last span's status is a failure - the whole trace is a failure.

Another way I'm thinking about helping here is maybe by having a third metric counting the failed spans in the traces (a counter). That way anomalies inside big traces could be tracked easily to cover more cases of failures of different types.
Not so sure about its necessity - wondering what you guys think 😄

Broken traces

Missing operation span (for instance, if a service didn't pick up a msg in an event driven system)

This would be difficult b/c you'd have to encode some of this knowledge into Tempo through configuration. e.g. Tempo would have to know that a specific span is expected under certain circumstances. More generally we could count "broken traces" with trace level metrics where any trace that contained a missing span was considered broken, but I don't think this covers what you're wanting.

Agree that's a tricky one, I'll try to think about it more.
I'm wondering maybe the above suggestions in all other sections are giving us enough coverage also here... maybe the missing part could be a "reason" for why a trace has "finished" - if certain traces normally have the "magic attribute" to declare its end, and the reason for "closing" the trace is timeout... sounds suspicious 🤨

I think this is a great idea, but we should start with an agreed upon MVP as suggested. Once the basic code to do trace level metrics is in it would be easier to experiment with and extend as necessary.

Totally with you! We should start with something.
After hearing your feedback on my suggestions I will sum up all requirements and maybe we will decide to split this feature into smaller steps and loop some feedback as we go further 💪🏼

@joe-elliott
Copy link
Member

The root span is the first, but a different span is the last thing to happen. For example, a first service publishes a message consumed by a second service.

I really struggle with this definition of an async request. Clock skew can create this situation (even in a synchronous request) and async requests could easily not meet this definition if the async process returned before the http request.

Whatever we do I would like it to be in line with OTel semantics. Some links I found:

  • Span.kind can be used as a clue to determine if there were asynchronous processes.
  • There is another discussion about more generally signalling async vs sync, but it has not been resolved.

If detecting sync vs async is important then I would say we should use span.kind as well as a custom attribute on the span. Then if OTel ever agrees on how to signal async we can use that as well.

If you think about it for a second, maybe we don't even care about generating metrics for "synchronous" traces, because today's span-metrics actually give the exact same benefit for the root-spans calculations...

I think this is an important insight. Something we could do now that would provide value and be relatively simple would be to add a root="true" label to root spans in spanmetrics. It would work for most traces and give a lot of what we're discussing here. What do you think about submitting this as PR to get started? (@kovrus heads up on a potential span metrics change)

I'm not sure what do you mean by "dynamically add attributes in span metrics", tried looking in docs/code with no success... so not so sure about answering that

You can dynamically add attributes as labels to span metrics.

You have a lot of other thoughts in this issue. A lot of which I find very compelling. If you are interested in pushing this work forward let's work together on defining one of these "trace level" analysis tools and implement that. Personally I think at the trace level there are more interesting aggregations and analysis then generating metrics (see below). My suggestion is to add the root label to spanmetrics for now and then pick one of these higher level issues.

Personal .02:
Of everything you've discussed trace clustering to create aggregate archetypal traces is the most interesting. We could aggregate traces and use one of many different clustering techniques to determine the major "kinds" of traces present in Tempo.

Also, another "aggregate trace" tool I've always wanted to build is aggregate critical path analysis.

If either of these interest you they would be a very welcome addition to Tempo.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Mar 11, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

2 participants