-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making tracing SDK metrics aware #381
Comments
The tracing API should not be made aware of metrics. A tracing SDK should be made configurable to support the kinds of extension you are interested in. The API already supports what is needed in this regard -- every metrics API has a context argument, and an integrated SDK could be configured to aggregate metrics on a per-span level. Issue #259 is a placeholder for this sort of discussion while #347 is still in-progress. This topic was discussed in today's Spec SIG meeting as well. I am looking for a recording. |
@jmacd Why do you see this more as a SDK integration than part of the API? Similar to Resources being shared between the two, why couldn't labels be a shared API construct in the same fashion? My understanding of the API vs SDK separation isn't the best either, any pointers to documentation that defines them? |
We have a Library Guidelines document that explains the API/SDK separation. The way I think about this is that the API should define semantics of the operations and leave implementation to the choice of SDK. The SDK should be able to implement metrics-aware-tracing and tracing-aware-metrics if it pleases.... So long as we set up the APIs with the necessary information crossing the API/SDK boundary and give useful semantics to the API operations. For tracing-aware-metrics, the task would be to develop a SDK that implements both tracing and metrics with the awareness that you want. There are several natural metrics implied by every span that could be automatically instrumented. There's a counter instrument for the number of span started, a measure instrument for the latency of each span, and in some cases it's useful to think of a distinct counter of the number of spans finished. Basically, I think we can accomplish the sorts of things you imagine without any new API surface. You had written about a |
I agree this should be handled by the SDK. I would really like to see this in the default SDK. |
I think to make this feasible and "easily" implementable in the SDK I think updating the API is required to a certain extent. With the rather specialized Span name that is currently being used in Distributed Tracing, such as "GET:/resource/{id}", using a more generic name similar to a metric name would be useful. A HTTP request could have the generic Span name "http_request", from this we could automatically generate metrics as required, "http_request_duration (histogram)", "http_request_status (counter)". We could generate these through SDK configuration and specify which data to extract from a Span to include in the metric, and which labels. For backwards compatibility we could allow a Span to have a "OperationName" format string that would be a template string, "{{label.method}}:{{label.path}}", that picks up different labels from the Span to build the name. This could potentially also just be a SDK configuration or exporter configuration so that it is only actually generated when required. This would also help solve one of the issues we ran into with OpenTracing where operation name is set when the span is created so to change it contribution libraries added special hooks to allow users to change operation name and other labels as part of the span creation[1][2]. I think the other interesting API change would be to try and unify the resource/label-set/attributes that are current being used for process/metrics/spans, and instead use a single type that can be assigned to every level. If we have a LabelSet that can handle merges, we can use it to be able to add labels to a Resource, TraceFactory, NamedTracer, NamedMetricsProvider (not sure what the real name is), Metric and Span. The would be merged from highest to lowest level with a a defined precedence of what labels are being kept/overwritten. The SDK can then be used to define what labels to drop, keep etc from Spans and Metrics, and a user of the API and/or the SDK will have a single construct to learn and understand instead of 4 different varieties of sets. 1 - https://github.com/opentracing-contrib/java-web-servlet-filter/blob/master/opentracing-web-servlet-filter/src/main/java/io/opentracing/contrib/web/servlet/filter/ServletFilterSpanDecorator.java |
@sfriberg Do you I understand your proposal (or at least part of it) correctly:
If yes, then below are my objections :) |
Thanks for the feedback @iNikem. Yes, that resembles what I'm looking at
The main reason is to make it easier for developers and users of the metric to know what to look for and be able have some control of what they want to be collected. A span might be sampled so not making it to the backend and thus getting a completely accurate metric won't be possible unless you have a metric as well. So in part yes it is pushing functionality further down in the stack, but mainly to allow easier implementation for a developer that instead of adding a span + two counters (total requests, total duration) that would be created automatically with a easily found name. |
@iNikem Thanks, had missed that. Is this something we purely aim to do in the collector or is the intent to support it at all levels in SpanProcessors? Having this capability feels like it would further warrant a basic naming scheme on the generation side so that you could easily write a rule for the specific span you are interested in using the name. If the span names comes with complex naming doing the filtering and renaming gets harder. |
I think this issue is straying onto a new topic. There seems to be a point of contention about span naming requirements in order to facilitate generating metrics from spans. I 100% agree, but I don't think we should discuss it in this ticket because it's a large topic. Responding to a misc point above: it should be possible to either: |
FWIW this is very similar what I have done for span naming for consideration when using Datadog as a backend. open-telemetry/opentelemetry-js#1316 (comment) Datadog has the concept of a generic "operation" (e.g. http.request) and an application-defined "resource" (GET /users/{id}). In terms of metrics, everything with the same operation can be sensibly aggregated (you can aggregate over all http.request) and the resource becomes a tag for further filtering/grouping. I've been naming spans |
It's not exactly that the current spans are too high cardinality, like too many to fit in a dropdown or a list. It's that they could be aggregated and compared against other spans.
This is exactly what Datadog does with span operation (http.request) and span resource (GET_/users/{id}). And in fact, it generates metrics automatically. The metric name is the span operation and the resource metric tag is the span resource. Both metric names and metric tags are low-cardinality; a metric name is not only low-carnality it means that everything with the name can be sensible aggregated (e.g. HTTP requests). |
At the end of the day, the important question is: What should instrumenters be doing?After instrumenting libraries, it certainly feels like there is substantial duplication. E.g. I'm instrumenting some RPC library:
Should instrumenters be writing to tracing, metrics, and log APIs? Or should they be writing to the highest-order API (traces) and expecting that SDKs will pull metrics from there? (Note that metrics are least common denominator; it's quite realistic for a consumer to have a metrics backend, but not an actual tracing backend.) This issue really deserves some attention; the entire premise of OTel is that instrumentations and backends can be developed in isolation, and this is a very large practical concern for anyone who has actually tried to write a portable instrumentation. |
@pauldraper, I 100% agree. My answer to the question
Is a definite no. We discussed this briefly in the metrics SIG last week -- the idea that we may want a way to generate metrics from Spans automatically. This is complicated by the internals of some tracing SDKs, which substitute NoOpSpans when a trace is not sampled. |
Until this is resolved sometime in 2022, I've built a little abstraction that registers metadata about span name and attributes. These are then used to generate several metrics with attributes for each span (currently I use it as a utility at the instrumentation API level):
The downside is this approach requires a good bit of additional metadata than just the span. The upside is that it generates some high-quality metrics. |
we doing similar thing. But apart from generating metrics from spans we also allow all metrics to be correlated to spans (do not confuse with exemplars). Our idea is this - spans and metrics are fundamentally semantically linked. Most of (synchronous) metrics are emitted during some logical unit of work covered by a span. We use semantical term "event" to outline the common abstraction for logical unit of work. Event is basically a span which has some predefined metadata (like type - "http.request", "db.query", name, userId, etc) and parent-child connections (just like span) - eventId/parentEventId. Technically event can be boiled down to a subclass of a Span. All metrics, spans and logs are logically associated to some event and events are chained in causal relationship which is propagated across service boundaries with CorrelationContext. Hence metrics now have causal parent-child connection to each other via eventId/parentEventId plus well known set of metadata associated to each metric. What it gives us - ability to query metrics in relation plus a set of common metadata always associated with each telemetry data. Let's talk about metrics relation. In our case we can have ServiceA and ServiceB both calling ServiceC. We see ServiceA's latency metric spiked because of ServiceC latency. If we just look at ServiceC p90 latency it might not show much since it aggregates across all calls from ServiceA and ServiceB and thus may be well within a range. With parent-child relationship on metrics however we can exactly correlate ServiceC's latency metric which caused ServiceA's latency spike. Kind of Now implementation wise we are experimenting with changing SDK's metric processors to always tag metrics with current event attributes which are pulled from Event.current() similar how Span.current() works. I am curious to hear if the folks here thought about similar problems and how they approach it. |
Thanks for the links, Christian. Seems like having Context as a Span parent solves at least half of the problem - now our "event" (context) is present in each Span. But in order to get metrics and spans share the same context, don't we need similar parent relationship between Metric and Context, i.e. each emitted metric also has Context associated? Or did I get it wrong?
|
Yes, that's correct. For languages with automatic context propagation like Java, the context would usually be the current one. |
I'm sure how using Context solves the core issue here. I wanted my metrics on my spans (or my span attributes in my metrics).
This is the core concern: where I should be instrumenting code with all three APIs? |
There are two opposing opinions expressed here:
(I also fear there is a third poorly conceived option floating around, where the consumer of the library has to configure a bunch of SDK stuff to convert spans to metrics, select tags, etc. That is absolutely bad; the library should be able to report metrics without voluminous end-user configuration.) |
As far i remember tracing could or even should just trace a certain percentage or subset of the function calls or requests to reduce the tracing resource impact which might slow down systems. Where metrics typically collected all the time. Ist this assumption still valid this days? |
That is correct. |
Making Tracing API metrics aware
One interesting aspect of Open Telemetry is the goal to provide tracing, metrics, and later a logging API. Right now these APIs are fairly separate so as an end-user or library owner I would need to write code to add a span and some more code to add a metric. This is not different that I would need to do today with OpenCensus or with OpenTracing and a metrics library (Prometheus, DropWizard, etc.), but I think OT would allow this to be simplified due to it's all in one nature.
With the metrics and API that takes a Measurement it feels like the integration with Spans would come fairly natural.
Why
Single code point extension, get metrics for "free".
As a end-user I want to write a little code as possible. Instead of couple of lines for metrics, a few more for tracing and one for logging, if I can get it all done in a single place that would be helpful.
Increased uptake of metrics library
Easily getting metrics from spans would be one further path that we could increase the uptake of the metrics library. Switching from OpenCensus and OpenTelemetry is pretty much required for tracing, but metrics doesn't have the same forced migration and will face a much more mature and larger ecosystem.
Span exploration
Currently the naming of Spans is rather close to that of using labels in metrics names, it ends up being an explosion in names that all have a different path, instead of having a HTTP request span that you can easily filter for properties you are interested in.
Filtering and sampling of Traces and Spans using metric data
One of the longer term interesting aspects of this closer relationship will be that metrics could be used by spans/traces do decide if a span should be after the trace is complete/tail-based sampling. Since the metrics name is know (or the same as the span name) as well as labels used for the metric you would be able to easily query a metrics db to get for example the 99%tile for the last 24 hour and explicitly store all of those metrics as well as any randomly sampled traces.
How
Others will probably have better suggestion, but an initial idea would be to have the constructor/builder pattern simply allow,
generateMetric(boolean)
as part of building the Span.As for labels I think it would be interesting to discuss if labels should be part of the span. One could think of it as being different layers of contextual data for a span. Top layer would be Resources (process level labels), Labels (general request labels), and Attributed (request unique (or high cardinality) labels). Metrics and spans would then be exported with these as appropriate.
Naming of spans could be the same as the metric for easy association and lookup, but for backwards compatibility we could allow a
Operation Name
the constructed through a templated string that could read a label, such as{{label.method}}:{{label:parameterized_url}}
for a HTTP request span.The text was updated successfully, but these errors were encountered: