-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define span modelling conventions for HTTP client spans #1747
Comments
@jkwatson the simplest (and cheapest) model would be a single span for a logical HTTP request with all of the data listed above added as events. Would that model work, or do you have examples of how would be painful? |
I have no opinions on the matter...I just want to make sure we capture the recommendations of the experts. The last question is a stickier one, though, that wouldn't be captured via events, I don't think. |
Ideally, the higher level client should presume that instrumentation for the lower level client is present, and simply decorate the span that the lower level client produces with additional information. If no lower level instrumentation is available, the author of the higher-level instrumentation should create a separate instrumentation package for the lower lever client, rather than duplicate the work. For situations where there is a categorical difference between the two client levels (DB client span vs HTTP client span) I'm less certain. There will definitely be a nested client spans when there is not a 1:1 relationship between DB and HTTP semantics. For example, some DB clients may make more than one logical HTTP request per DB transaction. In that case there would be a client span for the DB transaction, containing multiple HTTP client spans. For databases/queues/etc where there is always 1:1 relationship between a DB transaction and an HTTP request; I'm tempted to say that the DB transaction should decorate the HTTP span rather than wrap it. Operation name would be the only potential conflict of interest in that case. |
This assumes a lot of coordination between library authors to make this possible. I don't think we should assume that coordination will exist or even be possible when we write down these modeling decisions. Also, I'm not sure how a higher level instrumentation could decorate spans generated by something that it calls into. How would that work? |
Good points. The model I was describing works better on the server side; on the client side it is difficult. It would be good to see some examples. |
I agree events would make more sence, there are feasibility concerns. The HTTP instrumentation layer (Java byte-code instumentation, .NET native tracing support, python/node.js monkey patching) usually do'not know about retries which are handled on the higher level by user app/library. If caller creates an
In practice, it's not always possible to track redirects since they could be handled on the lower level than instrumentation (depends on user-provided configuration).
I don't have an opinion, except it's rarely feasible, so it has to be optional.
The coordination may happen through something like terminal span/context - old proposal, ignoring nested client spans or backing-off if traceparent is set (as a last resort). |
Using INTERNAL here seems bad to me. The logical call is still logically a client. In fact, If you have client spans that are direct children of each other, it seems easy to implement a backend-side rule to collapse them together somehow, whereas it is impossible to know generically that the client spans belong to the internal parent. |
We've made this distinction exaclty for the backend and users: it should be possible to separate 'real' rpc calls from logical ones. It should help users understand when their client library makes external calls vs internal stuff. Suppressing anything on the backed would require buffering (you never know what finishes first) and then you can suppress based on some rules. It does not sound like a good generic solution to the problem (which seems to be double instrumentation? or verbosity?) Let's find generic solutions to these problems. |
If using the rule "logical plus potentially real if the root client, otherwise physical" is not enough, I would prefer introducing a span attribute like "is_logical" instead of marking the logical client as internal. |
There is a working example with the web instrumentation in javascript. It leverages the semantics of the W3C Navigation Timing, so maybe this could be a starting point? Two more remarks:
|
a few more things the spec needs to clarify:
|
There is still an currently unused attribute for that in the semantic convention generator: https://github.com/open-telemetry/build-tools/blob/c7067fa1f0a5c61756dcb93cbc56af566992ab4f/semantic-conventions/src/opentelemetry/semconv/model/semantic_attribute.py#L115 (#571 (comment)) |
Here are my thoughts on HTTP instrumentation approach, hope we can use some of them as a starting point. |
I think this may not always be the case. For example, AFAIK in Python if you only instrument the But the User Experience graphics (one client span mysteriously triggers multiple server spans) actually sounds like a good argument that this is really suboptimal. @open-telemetry/python-approvers PTAL, that blog post is quite insightful IMHO.
👍
I'm not so sure about the definition of
I think you convinced me. Having this 1:1 relationship between client and server spans is really preferable. I think we want to write that in the HTTP semantic conventions. |
+1 to have outermost client span per message exchange and different spans for redirects/retries (whenever possible).
The challenge with this is that it could result in a min of 2 client spans per RPC even for the majority of requests that do not do redirects and retries, which seems wasteful. |
Thanks for the feedback!
yeah, I agree that some attributes are controversial. What I don't fully understand is 'optional' attributes. Do we understand why would instrumentation set them? As a library owner, I want to stay lean and efficient and would avoid optional things (and redundancy). |
Beyond redundancy, as @yurishkuro mentioned (~99% requests don't have retries), it's also misleading. We never know if this logical client span captured retries because on the auto or native instrumentation layer, we don't know how user called into HTTP client. So I'd rather leave it for 1) users who want to have it 2) client libraries that run on top of HTTP. And I see a huge benefit in knowing about retries, and maybe we'll find ways to approach it, but it seems controversial now. We don't have to agree on this today, but we can make progress on RPC call level regardless of logical ones. |
That's a different philosophy than what was used when designing these. The problem is that on the client you typically only have the full URL and would need to parse it, while on the server side you always get a "split" URL, because that's how HTTP works. So the idea was that every instrumentation can report directly what it has available without doing data processing on the agent side. The backend (or an in-between processor like the collector) is responsible for normalizing the data. Exactly because data-processing on the agent can be expensive (and lossy!), and a lean agent is more important than a lean backend because the agent runs in more processes. |
Why? I'm sure there are cases where we would be wrong, but in general instrumenting the most high-level API will work very well. For example, in Java if you use HttpUrlConnection for example, it will follow redirects for you by default. So it would be easy to create a single span covering the "logical" HTTP call, probably easier than creating one span per redirect even (because for the former you can just instrument a public API, while the latter needs to instrument internals -- but that's just my assumption I haven't written an instrumentation for that).
That's many HTTP libraries, for example Python requests runs on top of Python urllib3 which runs on top of Python's http.client standard library module. |
Pinging @open-telemetry/python-approvers (and @mariojonke): The suggestion (to which I agree) that the child-most client spans are the most important ones contradicts the current implementation in Python where inner (urllib3) spans are suppressed if there is an outer (requests) span. Even with disabling this suppression (or instrumenting just urllib3), I'm not sure if a span would be created per request/redirect. Please take a look at this discussion / join it if you have an opinion here (if not, be prepared to change the HTTP client instrumentation 😄). |
@Oberon00 I haven't gone through the entire discussion but I also agree that the innermost span is the most important one and should be recorded (and labelled as CLIENT). The API doesn't allow it but when I first thought about this problem, I wanted to flip what Python instrumentation do today and instead of suppressing child spans, I wanted every instrumentation that generates the CLIENT spans to set its parent span's kind to INTERNAL. For example, if both requests and urllib were instrumented, urllib instrumentation would fetch the current span (generated by requests instrumentation) and set its kind to INTERNAL before generating it's own CLIENT span. If every client instrumentation did this by convention then we'd always have a single CLIENT span for one outgoing call. This is probably not doable though because of streaming exporter implementations among other possible limitations. |
I think having nested client spans that are marked as client is preferable to having nested client spans where only the innermost one is marked as client. If we want to get rid of additional outer spans in Python we could e.g. just drop the requests instrumentation because the inner spans are covered by urllib3. |
Here's java example https://gist.github.com/lmolkova/f34d2dfcb23522b2dc115462f2f76d20 - you can use high-level API however you want - there are no rules about it. You can do retries with something on higher-level (e.g. with project reactor or rxjava. We in Azure SDK, for example, have abstraction over httpclient to make it pluggable and always do retries with high-level API. I, unfortunately, have 0 data on how people usually do retries and can only assume things. I assume it's a wild west and high-level logical span is misleading for at least a significant number of users. |
This worth another discussion. I choose INTERNAL span for logical call because of topology map visualizations. They are presumably interested in client-level calls to show what service calls into. If there are multiple nested spans, they have no heuristics to tell how many outgoing connections to show. We need heuristics to tell them which client span represents the best destination (if they don't want to show all of them). It could be internal span or something else but in the spec. I recall @anuraaga had somewhat similar concerns for AWS SDK. E.g. 2 client spans - two destination internal + client span: 1 destination (in Azure Monitor we decided that's preferrable for our users) |
At least for Python, I assume the opposite, as retries are baked into most libraries. But even if that is not the case, at least we would add the information that some spans belong together, even if we have a disconnect at an even higher level. That information cannot be recovered on the backend. I would be open to use links instead of a parent span though: The high-level instrumentation sets some special context field like "retry_of"/"redirect_of" to the SpanContext of the previous try/request, and the sensor picks it up. EDIT: Of course there is also the practical argument that some users would complain if we change this now in Python, so you need to persuade at least the Python maintainers 😃 |
You could use resources to determine if these spans come from the same entity. We could also have a flag on the protocol to say if the parent was remote or not (the exporter knows this). If that is not enough/possible, we could discuss having a new span type "LOGICAL_CLIENT" for spans that expect to have more LOGICAL_CLIENT or CLIENT spans locally under them. But I think an hasRemoteParent on the protocol is what you want for your heuristic. |
Retry policies are usually app-specific - it's hard to tell what needs to be retried and how in general case. I assume users either give retry policies to Python clients or call into high-level api multiple times. Is there a good read/doc/example on this? |
Briefly searched: python But google tells it's not the only way https://julien.danjou.info/python-retrying/
I'm not saying there are a lot of users who do a simple loop over high-level API, but they're out there and would come to ask questions and request support from OTel maintainers and vendors. |
They will certainly complain if the current behavior changes, which is that only the first (most logical) client span is instrumented. So far I haven't seen any complaints about the current behavior on the other hand, but I'm not monitoring otel-python very closely, so @open-telemetry/python-approvers might know if some users where already suprised by the current behavior and expected the "bottom-most client" one suggested here (to which I agree that it is in theory superior). |
I would like again to remind about one specific use-case that any proposal has to support: application developer manually wrapping high-level http call into CLIENT span with HTTP semantic convention AND, possibly, manually injecting the SpanContext of that span into outgoing request. I think whatever auto-instrumentations do they at least have to cope with this scenario and don't offer too much of a surprise to the app developer. E.g. always clearing out injected context and force-injecting lower-level context may be an unpleasant surprise for them. |
This work has been completed as part of HTTP semantic convention stabilization. |
For consistency across languages and HTTP client libraries, we should provide suggested conventions for modeling details of HTTP client calls. Here is a short list of items to include in the modeling effort:
The text was updated successfully, but these errors were encountered: