-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating trace-level metrics #1968
Comments
I am on board with this idea, but would like a little more definition in the metrics you would want to generate.
Agreed that just making a configurable timeout is the best we can do.
Likely the root span name is the correct choice for the trace name. Currently we have the ability to dynamically add attributes in span metrics. If we carry this functionality to trace level metrics will it be sufficient to handle your concern? i.e. you could add an attribute to the root span to differentiate between different purposes. There are definitely cases this wouldn't handle, but it maybe a good starting point.
Likely should be configurable. By default I'd say only if the root span has a failed status. It's not uncommon in distributed systems for a piece of a the request to fail but the whole to be considered successful. However, I can see some operators wanting to consider a request failed if any span has failed.
This would be difficult b/c you'd have to encode some of this knowledge into Tempo through configuration. e.g. Tempo would have to know that a specific span is expected under certain circumstances. More generally we could count "broken traces" with trace level metrics where any trace that contained a missing span was considered broken, but I don't think this covers what you're wanting. I think this is a great idea, but we should start with an agreed upon MVP as suggested. Once the basic code to do trace level metrics is in it would be easier to experiment with and extend as necessary. |
It got a bit messy and long, so besides referencing quotes, I'm giving small titles 😄 Metrics specs
Awesome!
Those metrics can be used to calculate "RED" metrics, which I believe cover most use cases. As for their dimensions, I think they will be a bit different than the span-metrics ones:
If you think about it for a second, maybe we don't even care about generating metrics for "synchronous" traces, because today's span-metrics actually give the exact same benefit for the root-spans calculations... Something to think about... probably we will keep it all for now to keep it simple and clear. Trace declaration timeout
Agree about having a configurable timeout, but want to propose another idea: Edited: Trace name
I'm not sure what do you mean by I'll try to explain what I meant with an example: I'm aware that we don't want to differ between every single trace that has the smallest differences in every single span, that's of course not what the aggregative metrics are here for... but I do think that a trace that finishes in a whole different place is probably something we do want to differentiate, especially in "async" traces. Unless I'm missing something about the dynamic attributes you have mentioned, what do think about adding another dimension mentioning the Trace failure status
Totally agree that if we are talking about "synchronous" traces - the root span status is what we want. Another way I'm thinking about helping here is maybe by having a third metric counting the failed spans in the traces (a counter). That way anomalies inside big traces could be tracked easily to cover more cases of failures of different types. Broken traces
Agree that's a tricky one, I'll try to think about it more.
Totally with you! We should start with something. |
I really struggle with this definition of an async request. Clock skew can create this situation (even in a synchronous request) and async requests could easily not meet this definition if the async process returned before the http request. Whatever we do I would like it to be in line with OTel semantics. Some links I found:
If detecting sync vs async is important then I would say we should use
I think this is an important insight. Something we could do now that would provide value and be relatively simple would be to add a root="true" label to root spans in spanmetrics. It would work for most traces and give a lot of what we're discussing here. What do you think about submitting this as PR to get started? (@kovrus heads up on a potential span metrics change)
You can dynamically add attributes as labels to span metrics. You have a lot of other thoughts in this issue. A lot of which I find very compelling. If you are interested in pushing this work forward let's work together on defining one of these "trace level" analysis tools and implement that. Personally I think at the trace level there are more interesting aggregations and analysis then generating metrics (see below). My suggestion is to add the root label to spanmetrics for now and then pick one of these higher level issues. Personal .02: Also, another "aggregate trace" tool I've always wanted to build is aggregate critical path analysis. If either of these interest you they would be a very welcome addition to Tempo. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
Is your feature request related to a problem? Please describe.
As for today, the
MetricsGenerator
has 2 metrics processors -Service graphs
andSpan metrics
.When trying to calculate SLIs for a "whole business service" (whether it is in a single app or distributed across multiple apps), the metrics already exist today aren't helping as much, and calculating it using the stored traces would consume a lot of resources (or even not possible).
Some use cases I can think of (relevant for my org and probably more):
Describe the solution you'd like
Another processor for the
MetricsGenerator
to generate trace-level metrics for ingested spans.I guess the metrics and dimensions could be similar to those on the
span-metrics
processor.Some challenges I can already think of (mostly due to the nature of a trace - built from distributed spans that are not be under a single root span):
For example
Maybe in order to handle those challenges we will need to provide some kind of configuration for those specific needs.
The more I think about it, I wonder if handling those challenges won't bring us to configure each "different" trace with unique conditions that basically represent the trace itself... Perhaps we could define our MVP for this and start with something.
Describe alternatives you've considered
DurationMs
or status and generate metrics.TraceQL
to aggregate over all traces.Additional context
Thread on Slack
@cristiangsp @amitsetty
The text was updated successfully, but these errors were encountered: