Clarify usage of Distributed Tracing Extension by OpenTelemetry #912

joaopgrassi · 2021-11-19T13:50:51Z

Background

We have a working group in the OpenTelemetry project that is making progress towards defining conventions around tracing for messaging systems.

As part of this effort, the group is defining how to trace applications that use CloudEvents. There is a PR open in the OpenTelemetry specification (Introduce new semantic conventions for CloudEvents) where you can see what the proposal is.

The PR addresses when to create spans, how the context will "flow" between event "hops" and how such spans should look like (name, attributes, type, etc). The goal of this is also to define a standard for Span structure. In theory, people looking at a trace for a "CloudEvents based system" should have the same experience everywhere.

In a nutshell, OpenTelemetry is going to heavily rely on the Distributed Tracing Extension defined in this spec. Via this extension, CloudEvents libraries (SDKs, etc) that support OpenTelemetry, can inject the span context into the event, and later on extract this context to continue the trace.

Proposed Changes

During one of the meetings, @clemensv suggested that it would make sense that the Distributed Tracing Extension specification "links" to the OpenTelemetry semantic conventions. The idea is that people, (SDK authors or users) coming to read this spec will see the link to OpenTelemetry and will be able to read there and get a full picture of why this is here and how it should be used.

As we all know, this was cause for confusion in the past (Remove distributed tracing extension), as its purpose was not clear and how it related to tracing frameworks such as OpenTelemetry. Now that OpenTelemetry is defining its usage, CloudEvents users will benefit from it and hopefully no more confusions will arise.

Summary of changes

Make it clear that this extension is what OpenTelemetry will use to trace CloudEvents applications
Removed instructions on how/why this is used, as it's already defined by the OpenTelemetry specification
Added a direct link to the relevant OpenTelemetry specification on the Using the Distributed Tracing Extension section

Questions for this group

Should we rename this to OpenTelemetry Distributed Tracing Extension? Would this be a breaking change?
The OpenTelemetry semantic conventions for CloudEvents mentions that SDKs/Libraries should offer ways to automatically "manage" the Span creation for users (auto-instrumentation). But, since it makes not much sense to define behavior of what CloudEvents SDKs should have in the OTel spec, I think we should consider adding this in the CloudEvents spec somehow.

As an example, the Go SDK offers such capability: OpenTelemetry ObservabilityService. Users only have to opt-in to use this "service" and all their CloudEvents operations (Send, Receive) are traced automatically. It might not be possible to offer the same behavior in all SDKs, but at least we should aim to have something that makes users lives easier, using the language constructs available (middlewares, wrappers, handlers, etc).

Release Note

Clarify usage of Distributed Tracing Extension by OpenTelemetry

Note: Please take a moment to read through the PR open in the OpenTelemetry spec. If you have questions, concerns or don't agree with something there, leave a comment and consider joining the public meeting we have every Thursday (Instrumentation: Messaging)

Signed-off-by: Joao Grassi <joao.grassi@dynatrace.com>

duglin · 2021-11-19T19:24:54Z

@joaopgrassi thanks a ton for doing this! Hopefully this will address some of the concerns that have come up in the past.

Should we rename this to OpenTelemetry Distributed Tracing Extension? Would this be a breaking change?

IMO, I think it would be fine to do this rename if that's what the group wants.

lmolkova · 2021-11-22T17:56:25Z

We also use Distributed Tracing Extension in Azure SDKs and it would be great to see clarification on how it should be used in the CloudEvents spec.

I have a tiny concern about renaming it to OpenTelemetry Distributed Tracing Extension. It currently relies on the W3C Trace-Context standard only and there is nothing OpenTelemetry-specific about it.

E.g. in Azure SDKs, we support legacy tracing libraries along with OpenTelemetry, and if someone would ever want to plug another tracing solution, they would still be able to do so and leverage Distributed Tracing extension on CloudEvents.

joaopgrassi · 2021-11-23T16:14:23Z

I have a tiny concern about renaming it to OpenTelemetry Distributed Tracing Extension. It currently relies on the W3C Trace-Context standard only and there is nothing OpenTelemetry-specific about it.

Yes, I agree. I also have mixed feelings about renaming it. I think keeping as-is is fine, as long as this document is clear on how it is used and people coming into this can find clear guidance. It can still be used by other means, but that's the user choice, so not much we can do at that point I'm afraid.

pierDipi · 2021-11-24T09:57:06Z

extensions/distributed-tracing.md

-Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension, 
-if used, MUST carry the same trace information contained in protocol specific tracing headers.
-
-Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST 
-carry the trace information of the starting trace of the transmission. 
-In other words, it MUST NOT carry trace information of each individual hop, since this information is usually 
-carried using protocol specific headers, understood by tools like [OpenTelemetry](https://opentelemetry.io/).


I'm reading this part and I don't think the goal of this extension and OTEL are the same.

I think the goal of this extension is to do Event Lineage.
In lineage, we care only about event processing hopes, while OTEL traces every remote call, even middlewares that don't impact the original event.

I think we are changing the meaning of the extension more than clarifying it.

Hi @pierDipi thanks for the input.

The semantic conventions we are trying to define in OTel open-telemetry/opentelemetry-specification#1978 are not defining tracing of the transport layer (middlewares, http transport and etc). We are rather focused on tracing the "application-layer", meaning what happened after I sent this event. This seems to align well with the link you mentioned:

An Event lineage is a path in the event processing network that enables the tracing of the consequences of an event, or finding the reasons for a certain action.

The way we are planning to achieve this is by: Every time an event is created, a Span is created. Then, we get this Span and add it to the event's Distributed Tracing Extension.

Then, on every place this event arrives, a Processing span is created. This processing span is then connected to the create span by adding the context from the Distributed Tracing Extension to it as a Link

With this, you would be able to see all operations that happened because of the initial create operation.

Regarding the old text on this spec :

Given a single hop event transmission (from sink to source directly), the Distributed Tracing Extension,
if used, MUST carry the same trace information contained in protocol specific tracing headers.

This can't be true, because if we create a span to track the "event creation" we'll have context A. Then when sending the event, say via HTTP, if the app is auto-instrumented, then there will be another span created for the physical send operation, so context B. The physical send operation is "outside" of the apps domain and we can't really modify the event at that point, to inject the same context (unless the auto-instrumentation understands CloudEvents and does this).

Given a multi hop event transmission, the Distributed Tracing Extension, if used, MUST
carry the trace information of the starting trace of the transmission.

"starting trace of the transmission" is what the OpenTelemetry conventions is modeling with the "Create Span"

Once the trace context is set on the event, it MUST not be modified.

Here is a use-case with an app:

And this is what a trace will look like in Jaeger

Producer

Consumer

Note the consumer link. If you click that you go to the Create span. Links are not yet well supported by many back-ends, but we hope they will catch up and present this in a more linear way. @lmolkova had a similar example using Azure monitor where links are showed below the create, as you would expect.

duglin · 2021-12-02T16:57:56Z

extensions/distributed-tracing.md

+## Using the Distributed Tracing Extension
+
+The
+[OpenTelemetry Semantic Conventions for CloudEvents](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/trace/semantic_conventions/cloudevents.md)


I get a 404 from this link - is this is placeholder for a new doc that's in-the-works?

Yes, it's a placeholder for when it's merged open-telemetry/opentelemetry-specification#1978

duglin · 2021-12-06T19:29:33Z

On the 12/02 call we approved the overall direction of this PR but we'll hold off merging it until the referenced doc is ready. Then we'll revisit it to make sure that between this PR and that doc everything looks good from an end-user's perspective.

joaopgrassi · 2021-12-07T16:15:45Z

@duglin thanks for this update!

One thing that I should bring up is that if the PR in the OpenTelemetry gets merged, it will still be marked experimental. The reason is that we are also working in the more general messaging conventions and finding out things "as we go". Since CloudEvents and Messaging overlap in some cases, it's likely that we'll need to make changes in the CloudEvents semantic conventions again as well. That then could also affect this document in the CloudEvent spec, but it's hard to say for now.

So we might consider only merging this once the CloudEvents semantic conventions in OpenTelemetry is marked as stable. What do you think?

duglin · 2021-12-08T12:45:08Z

@joaopgrassi it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you...

grant · 2021-12-09T17:42:44Z

Great, will wait for the OT PR:
open-telemetry/opentelemetry-specification#1978
Then look at this PR.

I hope we include the HTTP header names with the CloudEvents spec here for clarity.

joaopgrassi · 2021-12-09T17:59:27Z

@duglin

it's ok with me if you'd prefer to wait - just let us know when things are ready. The only reason I can think of for merging this sooner rather than later is if the intermediate version would be a significant improvement over what's there today. Totally up to you...

If it adds improvement to what we have today it's hard to say 😅 but it will def have to be adapted as we work on the OTel side. So let's leave it open for now, and I'll keep working on it.

I just found out that we actually have code that produces this ce-traceparent.. so I also have to look for that. It is actually interesting because it just came up today during our OTel messaging meeting that we'll need some sort of header to propagate the initial creation context for messaging in general, and this seems like a good candidate for CloudEvents. This header helps avoiding having to unpack it to read the value of the extension, which could be costly.

duglin · 2022-01-13T16:10:07Z

Just to make it official... we're putting this PR "on hold" until the OT side of things are ready.
People should feel free to still comment on it though.

LeDominik · 2022-06-26T10:55:54Z

Wanted to share my view: In the last ~6-7 years I've seen a fair share of clients and projects going for stronger choreography often mis-using especially Kafka as a message (queue) expecting "reliable processing" (whatever that means) and experiencing incredible difficulty tracking down anything leaving the happy path. Band-aids are usually hacked together and stuffed on it to somehow gain visibility & "control". Often falling back to using classic message queue metaphors (self-implemented dead-letter-queues with manual copy-back jobs because processing was not implemented in a proper idempotent way etc...). An enormous amount of effort & time goes down the drain for this; and it all starts with visibility.

Out of that I want to say: this is incredibly useful and please embrace it / improve on it. Writing down a clear convention how to transport (CloudEvents Extension) and how to link (OpenTelemetry) things to the original event creating span is excellent and the right way to do it.

There's a range of use-cases that are just not properly solved otherwise:

(most common) suboptimal auto-instrumentation - and this can happen easily. In many situations it can be a good idea to have an outbox-pattern; you don't want to send out an event proclaiming a new state of the world and then fail internally. Here the event is logically created in the flow and within an RDBMS transaction and then sent out by some work-queue / "smart" in-house "best-practice" framework... so the technical tracing context, threads, etc... are suddenly different and in most cases context is not transported correctly
Intermediaries / Proxies / Converters -> this is used as headline example by @joaopgrassi, just check out kNative Eventing & Dapr. I think it's safe to assume that this will grow to funnel data between clouds / products, etc...
Batch send which is even supported by CloudEvents HTTP Binding and employed by many of those bespoke in-house "best-practice" frameworks trying to be "smart"
Batch consume (obvious) - so you cannot correlate back in at this low-level

The only solution (and this here is similar with links but of course tackling it at the right level) I've seen doing it kind-of right has been Azure EventHub with the corresponding libraries. They also link to the original span showing it then in Azure Monitor. However it's incredibly badly documented, the only good explanation I found back then is on this medium article and inconsistent between e.g. C#, Java and node.js libraries... (or at least it was)

So I'd actually (once all is finalised and clarified, and judging from the last fine-adjustments I saw from @lmolkova I have a positive feeling) love to (optionally) see this support in the CloudEvents libraries / or a good hook that the OpenTelemetry libs hook into, because:

encoding an object into CloudEvent standard is actually the moment to create the CloudEvents Create <event_type> SPAN and get that trace context into the CloudEvent
decoding a CloudEvent is actually the moment in which well processing starts so CloudEvents Process <event_type> and link your processing to the original CloudEvent Create Span

And yes, in the most simple scenario (HTTP direct call, no batching, no magic) you'd also see the chain in the normal HTTP W3C trace-headers and auto-instrumentation. But I think also in this scenario the additional link wouldn't hurt...

just my (very long) 5cents...

duglin · 2022-10-07T08:24:56Z

@joaopgrassi whats the status of things? It appears the OT PR might be merged

joaopgrassi · 2022-10-11T07:08:24Z

Hi @duglin
Yes, the initial PR for adding CloudEvents semantic conventions in OTel was merged. The current status is: There is a group that is working on messaging in general, and trying to bring the messaging conventions to a stable 1.0 release. Because a lot of the things related to this overlaps with CloudEvents, I expect some things might change, specially around the trace structure. I'm part of the group, so I'm keeping tabs on it.

duglin · 2022-10-13T15:42:55Z

@joaopgrassi cool - thanks for the update!

github-actions · 2023-03-02T01:29:04Z

This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

joaopgrassi · 2023-03-10T10:00:43Z

Not stale, we are still working on bringing messaging conventions in OTel to stable.

duglin · 2023-03-10T13:59:06Z

/remove-lifecycle stale

github-actions · 2023-05-07T01:28:02Z

This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

joaopgrassi · 2023-05-07T12:32:42Z

/remove-lifecycle stale

duglin · 2023-07-05T13:47:11Z

@joaopgrassi just a periodic ping to see if anything has changed

joaopgrassi · 2023-07-05T14:12:31Z

Hi @duglin ! So something did happen actually! We have been working on a OTEP that re-defines the general messaging systems semantic convention and that was merged last week. Now the work is to start porting the text from the OTEP proposal to the OTel specification and hopefully mark it stable.

Since CloudEvents have a lot of overlap with it, we have a tracking issue open-telemetry/semantic-conventions#654 to investigate how the new conventions overlap/conflict and how we should proceed.

So, some movement, but nothing concrete for this yet. I will keep you/all posted on the progress.

duglin · 2023-07-06T10:34:56Z

@joaopgrassi great - thanks for the update

github-actions · 2023-09-25T01:22:55Z

This Pull Request is stale because it has been open for 30 days with
no activity. Mark as fresh by updating e.g., adding the comment /remove-lifecycle stale.

joaopgrassi · 2023-09-25T14:48:28Z

/remove-lifecycle stale

I didn't forget about this. We are moving forward with the stabilization of messaging conventions in OTel, so getting closer to dig into this :)

duglin · 2023-09-25T14:56:05Z

@joaopgrassi thanks for the update!

Clarify usage of Distributed Tracing Extension by OpenTelemetry (#1)

d5090aa

Signed-off-by: Joao Grassi <joao.grassi@dynatrace.com>

joaopgrassi force-pushed the otel_semconv_dte branch from 6c0c753 to d5090aa Compare November 19, 2021 13:53

joaopgrassi mentioned this pull request Nov 19, 2021

Introduce new semantic conventions for CloudEvents open-telemetry/opentelemetry-specification#1978

Merged

pierDipi reviewed Nov 24, 2021

View reviewed changes

duglin reviewed Dec 2, 2021

View reviewed changes

grant mentioned this pull request Dec 3, 2021

What to do about CE extension specs #894

Closed

joaopgrassi changed the title ~~Clarify usage of Distributed Tracing Extension by OpenTelemetry (#1)~~ Clarify usage of Distributed Tracing Extension by OpenTelemetry Dec 9, 2021

lance mentioned this pull request Dec 15, 2021

Support CloudEvents Traceparent Extension cloudevents/sdk-javascript#452

Closed

github-actions bot added the lifecycle/stale label Mar 2, 2023

github-actions bot removed the lifecycle/stale label Mar 11, 2023

github-actions bot added the lifecycle/stale label May 7, 2023

github-actions bot removed the lifecycle/stale label May 8, 2023

github-actions bot added the lifecycle/stale label Sep 25, 2023

github-actions bot removed the lifecycle/stale label Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify usage of Distributed Tracing Extension by OpenTelemetry #912

Clarify usage of Distributed Tracing Extension by OpenTelemetry #912

joaopgrassi commented Nov 19, 2021

duglin commented Nov 19, 2021

lmolkova commented Nov 22, 2021 •

edited

joaopgrassi commented Nov 23, 2021

pierDipi Nov 24, 2021

joaopgrassi Nov 24, 2021 •

edited

joaopgrassi Nov 24, 2021

duglin Dec 2, 2021

joaopgrassi Dec 2, 2021

duglin commented Dec 6, 2021

joaopgrassi commented Dec 7, 2021 •

edited

duglin commented Dec 8, 2021

grant commented Dec 9, 2021

joaopgrassi commented Dec 9, 2021

duglin commented Jan 13, 2022

LeDominik commented Jun 26, 2022 •

edited

duglin commented Oct 7, 2022

joaopgrassi commented Oct 11, 2022

duglin commented Oct 13, 2022

github-actions bot commented Mar 2, 2023

joaopgrassi commented Mar 10, 2023

duglin commented Mar 10, 2023

github-actions bot commented May 7, 2023

joaopgrassi commented May 7, 2023

duglin commented Jul 5, 2023

joaopgrassi commented Jul 5, 2023 •

edited

duglin commented Jul 6, 2023

github-actions bot commented Sep 25, 2023

joaopgrassi commented Sep 25, 2023

duglin commented Sep 25, 2023

Clarify usage of Distributed Tracing Extension by OpenTelemetry #912

Are you sure you want to change the base?

Clarify usage of Distributed Tracing Extension by OpenTelemetry #912

Conversation

joaopgrassi commented Nov 19, 2021

Background

Proposed Changes

duglin commented Nov 19, 2021

lmolkova commented Nov 22, 2021 • edited

joaopgrassi commented Nov 23, 2021

pierDipi Nov 24, 2021

Choose a reason for hiding this comment

joaopgrassi Nov 24, 2021 • edited

Choose a reason for hiding this comment

joaopgrassi Nov 24, 2021

Choose a reason for hiding this comment

duglin Dec 2, 2021

Choose a reason for hiding this comment

joaopgrassi Dec 2, 2021

Choose a reason for hiding this comment

duglin commented Dec 6, 2021

joaopgrassi commented Dec 7, 2021 • edited

duglin commented Dec 8, 2021

grant commented Dec 9, 2021

joaopgrassi commented Dec 9, 2021

duglin commented Jan 13, 2022

LeDominik commented Jun 26, 2022 • edited

duglin commented Oct 7, 2022

joaopgrassi commented Oct 11, 2022

duglin commented Oct 13, 2022

github-actions bot commented Mar 2, 2023

joaopgrassi commented Mar 10, 2023

duglin commented Mar 10, 2023

github-actions bot commented May 7, 2023

joaopgrassi commented May 7, 2023

duglin commented Jul 5, 2023

joaopgrassi commented Jul 5, 2023 • edited

duglin commented Jul 6, 2023

github-actions bot commented Sep 25, 2023

joaopgrassi commented Sep 25, 2023

duglin commented Sep 25, 2023

lmolkova commented Nov 22, 2021 •

edited

joaopgrassi Nov 24, 2021 •

edited

joaopgrassi commented Dec 7, 2021 •

edited

LeDominik commented Jun 26, 2022 •

edited

joaopgrassi commented Jul 5, 2023 •

edited