-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify usage of Distributed Tracing Extension by OpenTelemetry #912
base: v1.0.1
Are you sure you want to change the base?
Clarify usage of Distributed Tracing Extension by OpenTelemetry #912
Conversation
Signed-off-by: Joao Grassi <joao.grassi@dynatrace.com>
6c0c753
to
d5090aa
Compare
@joaopgrassi thanks a ton for doing this! Hopefully this will address some of the concerns that have come up in the past.
IMO, I think it would be fine to do this rename if that's what the group wants. |
We also use Distributed Tracing Extension in Azure SDKs and it would be great to see clarification on how it should be used in the CloudEvents spec. I have a tiny concern about renaming it to E.g. in Azure SDKs, we support legacy tracing libraries along with OpenTelemetry, and if someone would ever want to plug another tracing solution, they would still be able to do so and leverage Distributed Tracing extension on CloudEvents. |
Yes, I agree. I also have mixed feelings about renaming it. I think keeping as-is is fine, as long as this document is clear on how it is used and people coming into this can find clear guidance. It can still be used by other means, but that's the user choice, so not much we can do at that point I'm afraid. |
Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension, | ||
if used, MUST carry the same trace information contained in protocol specific tracing headers. | ||
|
||
Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST | ||
carry the trace information of the starting trace of the transmission. | ||
In other words, it MUST NOT carry trace information of each individual hop, since this information is usually | ||
carried using protocol specific headers, understood by tools like [OpenTelemetry](https://opentelemetry.io/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading this part and I don't think the goal of this extension and OTEL are the same.
I think the goal of this extension is to do Event Lineage.
In lineage, we care only about event processing hopes, while OTEL traces every remote call, even middlewares that don't impact the original event.
I think we are changing the meaning of the extension more than clarifying it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pierDipi thanks for the input.
The semantic conventions we are trying to define in OTel open-telemetry/opentelemetry-specification#1978 are not defining tracing of the transport layer (middlewares, http transport and etc). We are rather focused on tracing the "application-layer", meaning what happened after I sent this event. This seems to align well with the link you mentioned:
An Event lineage is a path in the event processing network that enables the tracing of the consequences of an event, or finding the reasons for a certain action.
The way we are planning to achieve this is by: Every time an event is created, a Span is created. Then, we get this Span and add it to the event's Distributed Tracing Extension.
Then, on every place this event arrives, a Processing span is created. This processing span is then connected to the create span by adding the context from the Distributed Tracing Extension to it as a Link
With this, you would be able to see all operations that happened because of the initial create operation.
Regarding the old text on this spec :
Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension,
if used, MUST carry the same trace information contained in protocol specific tracing headers.
This can't be true, because if we create a span to track the "event creation" we'll have context A. Then when sending the event, say via HTTP, if the app is auto-instrumented, then there will be another span created for the physical send operation, so context B. The physical send operation is "outside" of the apps domain and we can't really modify the event at that point, to inject the same context (unless the auto-instrumentation understands CloudEvents and does this).
Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.
"starting trace of the transmission" is what the OpenTelemetry conventions is modeling with the "Create Span"
Once the trace context is set on the event, it MUST not be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a use-case with an app:
And this is what a trace will look like in Jaeger
Note the consumer link. If you click that you go to the Create span. Links are not yet well supported by many back-ends, but we hope they will catch up and present this in a more linear way. @lmolkova had a similar example using Azure monitor where links are showed below the create, as you would expect.
## Using the Distributed Tracing Extension | ||
|
||
The | ||
[OpenTelemetry Semantic Conventions for CloudEvents](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/trace/semantic_conventions/cloudevents.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a 404 from this link - is this is placeholder for a new doc that's in-the-works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's a placeholder for when it's merged open-telemetry/opentelemetry-specification#1978
On the 12/02 call we approved the overall direction of this PR but we'll hold off merging it until the referenced doc is ready. Then we'll revisit it to make sure that between this PR and that doc everything looks good from an end-user's perspective. |
@duglin thanks for this update! One thing that I should bring up is that if the PR in the OpenTelemetry gets merged, it will still be marked experimental. The reason is that we are also working in the more general messaging conventions and finding out things "as we go". Since CloudEvents and Messaging overlap in some cases, it's likely that we'll need to make changes in the CloudEvents semantic conventions again as well. That then could also affect this document in the CloudEvent spec, but it's hard to say for now. So we might consider only merging this once the CloudEvents semantic conventions in OpenTelemetry is marked as stable. What do you think? |
@joaopgrassi it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you... |
Great, will wait for the OT PR: I hope we include the HTTP header names with the CloudEvents spec here for clarity. |
If it adds improvement to what we have today it's hard to say 😅 but it will def have to be adapted as we work on the OTel side. So let's leave it open for now, and I'll keep working on it. I just found out that we actually have code that produces this |
Just to make it official... we're putting this PR "on hold" until the OT side of things are ready. |
Wanted to share my view: In the last ~6-7 years I've seen a fair share of clients and projects going for stronger choreography often mis-using especially Kafka as a message (queue) expecting "reliable processing" (whatever that means) and experiencing incredible difficulty tracking down anything leaving the happy path. Band-aids are usually hacked together and stuffed on it to somehow gain visibility & "control". Often falling back to using classic message queue metaphors (self-implemented dead-letter-queues with manual copy-back jobs because processing was not implemented in a proper idempotent way etc...). An enormous amount of effort & time goes down the drain for this; and it all starts with visibility. Out of that I want to say: this is incredibly useful and please embrace it / improve on it. Writing down a clear convention how to transport (CloudEvents Extension) and how to link (OpenTelemetry) things to the original event creating span is excellent and the right way to do it. There's a range of use-cases that are just not properly solved otherwise:
The only solution (and this here is similar with links but of course tackling it at the right level) I've seen doing it kind-of right has been Azure EventHub with the corresponding libraries. They also link to the original span showing it then in Azure Monitor. However it's incredibly badly documented, the only good explanation I found back then is on this medium article and inconsistent between e.g. C#, Java and node.js libraries... (or at least it was) So I'd actually (once all is finalised and clarified, and judging from the last fine-adjustments I saw from @lmolkova I have a positive feeling) love to (optionally) see this support in the CloudEvents libraries / or a good hook that the OpenTelemetry libs hook into, because:
And yes, in the most simple scenario (HTTP direct call, no batching, no magic) you'd also see the chain in the normal HTTP W3C trace-headers and auto-instrumentation. But I think also in this scenario the additional link wouldn't hurt... just my (very long) 5cents... |
@joaopgrassi whats the status of things? It appears the OT PR might be merged |
Hi @duglin |
@joaopgrassi cool - thanks for the update! |
This Pull Request is stale because it has been open for 30 days with |
Not stale, we are still working on bringing messaging conventions in OTel to stable. |
/remove-lifecycle stale |
This Pull Request is stale because it has been open for 30 days with |
/remove-lifecycle stale |
@joaopgrassi just a periodic ping to see if anything has changed |
Hi @duglin ! So something did happen actually! We have been working on a OTEP that re-defines the general messaging systems semantic convention and that was merged last week. Now the work is to start porting the text from the OTEP proposal to the OTel specification and hopefully mark it stable. Since CloudEvents have a lot of overlap with it, we have a tracking issue open-telemetry/semantic-conventions#654 to investigate how the new conventions overlap/conflict and how we should proceed. So, some movement, but nothing concrete for this yet. I will keep you/all posted on the progress. |
@joaopgrassi great - thanks for the update |
This Pull Request is stale because it has been open for 30 days with |
/remove-lifecycle stale I didn't forget about this. We are moving forward with the stabilization of messaging conventions in OTel, so getting closer to dig into this :) |
@joaopgrassi thanks for the update! |
Background
We have a working group in the OpenTelemetry project that is making progress towards defining conventions around tracing for messaging systems.
As part of this effort, the group is defining how to trace applications that use CloudEvents. There is a PR open in the OpenTelemetry specification (Introduce new semantic conventions for CloudEvents) where you can see what the proposal is.
The PR addresses when to create spans, how the context will "flow" between event "hops" and how such spans should look like (name, attributes, type, etc). The goal of this is also to define a standard for Span structure. In theory, people looking at a trace for a "CloudEvents based system" should have the same experience everywhere.
In a nutshell, OpenTelemetry is going to heavily rely on the
Distributed Tracing Extension
defined in this spec. Via this extension, CloudEvents libraries (SDKs, etc) that support OpenTelemetry, can inject the span context into the event, and later on extract this context to continue the trace.Proposed Changes
During one of the meetings, @clemensv suggested that it would make sense that the Distributed Tracing Extension specification "links" to the OpenTelemetry semantic conventions. The idea is that people, (SDK authors or users) coming to read this spec will see the link to OpenTelemetry and will be able to read there and get a full picture of why this is here and how it should be used.
As we all know, this was cause for confusion in the past (Remove distributed tracing extension), as its purpose was not clear and how it related to tracing frameworks such as OpenTelemetry. Now that OpenTelemetry is defining its usage, CloudEvents users will benefit from it and hopefully no more confusions will arise.
Summary of changes
Using the Distributed Tracing Extension
sectionQuestions for this group
OpenTelemetry Distributed Tracing Extension
? Would this be a breaking change?As an example, the Go SDK offers such capability: OpenTelemetry ObservabilityService. Users only have to opt-in to use this "service" and all their CloudEvents operations (Send, Receive) are traced automatically. It might not be possible to offer the same behavior in all SDKs, but at least we should aim to have something that makes users lives easier, using the language constructs available (middlewares, wrappers, handlers, etc).
Release Note
Note: Please take a moment to read through the PR open in the OpenTelemetry spec. If you have questions, concerns or don't agree with something there, leave a comment and consider joining the public meeting we have every Thursday (Instrumentation: Messaging)