Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Tracing API vs Tokio-Tracing API for Distributed Tracing #1571

Open
cijothomas opened this issue Feb 23, 2024 · 19 comments
Open
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA.

Comments

@cijothomas
Copy link
Member

cijothomas commented Feb 23, 2024

Background

The Rust ecosystem has two prominent tracing APIs: the OpenTelemetry Tracing API (Otel for short), delivered through the opentelemetry crate, and the Tokio tracing API, provided by the
tracing crate. The OTel Tracing API adheres to the OpenTelemetry specification, ensuring alignment with OpenTelemetry Tracing implementations in other languages like C++, Java etc. Conversely, the Tokio tracing ecosystem, which predates
OpenTelemetry, boasts widespread adoption, with many popular libraries already instrumented. The tracing-opentelemetry crate, maintained outside of OpenTelemetry repositories, act as a "bridge", enabling applications instrumented with tracing to work with OpenTelemetry.

The issue

The coexistence of the OTel Tracing API and Tokio-Tracing poses a dilemma, forcing end users to choose between two competing APIs. This situation complicates the decision-making process due to the absence of comprehensive
documentation comparing the two options. A significant concern is the lack of tested interoperability between the APIs, which can result in issues, especially in applications where different layers use different tracing APIs, potentially
leading to incomplete traces. This also impacts the log correlation scenarios as well.

A Comparison with OTel .NET

The OpenTelemetry .NET community encountered a similar challenge when the OTel Tracing API was introduced, as the .NET runtime library (shipped as the
DiagnosticSource package) already had a similar API in place. This issue was resolved through collaboration between OTel .NET maintainers and the .NET
runtime team, leading to the alignment of the .NET runtime's tracing API with the OTel specifications. This approach was later applied to the Metrics API as well. While the decision by OTel .NET to prioritize the .NET Runtime library's
API over its own for tracing/metrics has generally been successful, it has not been without its challenges. Despite declaring stability years ago, OTel .NET has yet to implement certain aspects of the OTel specification fully.

Although the outcomes in the .NET ecosystem might not directly forecast the success of similar efforts in Rust, they provide a valuable reference point.

Options for Consideration

  1. Deprecate Tokio-Tracing: This approach would align Rust with the OpenTelemetry strategies adopted by other languages. However, considering the popularity and active maintenance of the tracing crate in the Rust ecosystem, this path has highest friction and is highly improbable.

  2. Deprecate OTel Tracing: Promoting Tokio-Tracing as the standard could be a feasible option, albeit requiring comprehensive evaluation. This strategy would cause OTel Rust to deviate from its counterparts in other languages.
    Potential alignment of Tokio-Tracing with OTel Tracing specifications could mitigate this concern but necessitates groundwork to identify gaps and propose solutions. Tokio-Tracing maintainers have shown willingness to accommodate
    reasonable changes, pending a clear set of requirements. This option does not eliminate the OTel Tracing API completely, but it'll still remain to compensate for things missing from Tokio-Tracing - only those APIs which are overlapping/competing with Tokio-Tracing needs to be deprecated/removed.

  3. Maintain Both APIs: This alternative emphasizes the importance of ensuring seamless interoperability between the two APIs, allowing users to choose based on preference or specific needs without compromising trace completeness. Achieving this goal requires significant effort to identify and bridge any existing gaps in the interoperability story. Users should be able freely chose between, without worrying about any broken traces.

  4. Do nothing.: OTel Rust has some special accommodations done to help tracing crate (and vice-versa). We can just remove them, and let each crate follow their own destiny. (Highly undesirable state, just listed for completion)

Are there more options? Please let us know in the comments!

Current State

The Rust tracing ecosystem is at a critical juncture. Active discussions between the OTel Rust team and the Tracing Rust team are taking place, with updates and deliberations shared on Cloud Native
Slack
. Interested individuals are encouraged to join the discussion on Slack (or right in this Github issue). All decisions and considerations will be posted on GitHub as well for wider visibility and to gather feedbacks.

Timeline

Resolving this issue is a prerequisite (though not the only one) for declaring the Tracing signal as GA (General Availability) for OTel Rust. Given the goal to achieve Tracing GA (alongside other milestones) soon, it's crucial that this issue is resolved promptly. A tentative deadline to reach a decision on the chosen path forward is set for April 30th, 2024, approximately 2 months from today.

Related issues

#1378 Tracing Propagation.
#1394 (comment)
Broken Trace example : #1690

@cijothomas cijothomas added A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA. A-log Area: Issues related to logs labels Feb 23, 2024
@cijothomas cijothomas pinned this issue Feb 23, 2024
@cijothomas
Copy link
Member Author

Tagging @open-telemetry/rust-approvers
@jtescher as tracing-opentelemetry maintainer
@davidbarsky as tracing maintainer

@TommyCpp
Copy link
Contributor

If we were to use tracing as the API. This is the deviation between existing tracing API and Otel tracing API

  • - means no support today
  • ? means no support but pontential way to implement it
  • + means supported
Feature Tokio Tracing
TracerProvider
Create TracerProvider -
Get a Tracer -
Get a Tracer with schema_url -
Get a Tracer with scope attributes -
Associate Tracer with InstrumentationScope -
Safe for concurrent calls +
Shutdown (SDK only required) N/A
ForceFlush (SDK only required) N/A
Trace / Context interaction
Get active Span +
Set active Span +
Tracer Optional
Create a new Span +
Documentation defines adding attributes at span creation as preferred +
Get active Span +
Mark Span active +, same as tracing enter
SpanContext
IsValid ?, can add as event or metadata
IsRemote ?, can add as event or metadata
Conforms to the W3C TraceContext spec -
Span Optional
Create root span +
Create with default parent (active span) +
Create with parent from Context +
No explicit parent Span/SpanContext allowed +
SpanProcessor.OnStart receives parent Context -
UpdateName -
User-defined start timestamp -
End +, when span closes(not exit)
End with timestamp -
IsRecording ?, can add attributes or metadata
IsRecording becomes false after End ?
Set status with StatusCode (Unset, Ok, Error) ?, can add as event or metadata
Safe for concurrent calls +
events collection size limit -
attribute collection size limit -
links collection size limit -
Span attributes Optional
SetAttribute +
Set order preserved -
String type +
Boolean type +
Double floating-point type +
Signed int64 type +
Array of primitives (homogeneous) +
null values documented as invalid/undefined N/A
Unicode support for keys and string values +
Span linking Optional
Links can be recorded on span creation ?, can add as metadata
Links can be recorded after span creation ?, can add as metadata
Links order is preserved ?, depends on subscriber
Span events
AddEvent +
Add order preserved ?, depends on subscriber
Safe for concurrent calls +
Span exceptions
RecordException ?, can add as events w/ special flags
RecordException with extra parameters ?, can add as events w/ special flags, see tracing-error
Sampling Optional
Allow samplers to modify tracestate N/A
ShouldSample gets full parent Context N/A
Sampler: JaegerRemoteSampler N/A
New Span ID created also for non-recording Spans
IdGenerators ?, not supported directly but can add via subscribers
SpanLimits -
Built-in SpanProcessors implement ForceFlush spec N/A
Attribute Limits -
Fetch InstrumentationScope from ReadableSpan -

@hdost
Copy link
Contributor

hdost commented Feb 24, 2024

From the metrics perspective exemplars are also something to take into account.

@hdost
Copy link
Contributor

hdost commented Mar 26, 2024

As requested in the community meeting:
I am partial towards Option 2.
Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly.

Questions that are open from my perspective:

  • What are the minimal set of features we feel like we'd want to maintain?
  • Are the tokio team amenable to supporting such features?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway.

@ramosbugs
Copy link
Contributor

I am partial towards Option 2.
Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

As a heavy user of direct OpenTelemetry instrumentation (e.g., using SpanBuilder a lot, along with span events and span links) in the backend of my AWS Lambda-based web app, I'm nervous reading this. There are almost 800 calls to set_attribute alone in my codebase, and moving off of a deprecated API used this heavily would be a major undertaking.

Which interfaces specifically would be deprecated?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

I suspect a lot of other OpenTelemetry users are also doing so in private repositories, so I agree that it's hard to measure. I would caution against inferring much from these public GitHub usage stats.

@TommyCpp
Copy link
Contributor

Which interfaces specifically would be deprecated?

We don't know exactly yet. The idea is to bridge the gap between the tracing and OTEL spec using custom API but if something tracing supports we want to deprecate in favor of that. But we are still debating ideas and nothing has been decided yet. Thus, any feedback is greatly appreciated!

@lalitb
Copy link
Member

lalitb commented Mar 27, 2024

I vote for option 2, as there are challenges with other options:

  • Option 1: Tokio-Tracing is deeply ingrained in the Rust ecosystem with wide adoption across many libraries and applications. It's practically not feasible to ask community to migrate from tokio-tracing to Otel tracing API.
  • Option 3: Keeping OpenTelemetry API, while also supporting tracing-opentelemetry would increase maintenance effort., add confusion to the developers on which API to choose, and interop across both would be a challenge (as seen here)
  • Option 4: This approach, while minimizing immediate effort, overlooks the broader implications for the Rust ecosystem's health and future growth.

Going with Option 2, we also need evaluation for introducing an extension API within OpenTelemetry. This is to effectively bridge the existing gaps between the OTel specifications and Tokio-Tracing's functionalities (e.g, Baggage support, Propagators).

@lalitb
Copy link
Member

lalitb commented Mar 27, 2024

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Direct consumption of opentelemetry-api could be for traces, metrics and logs, and I agree it is really hard to get the actual statistics for "traces" only :)

@julianocosta89
Copy link
Member

OpenTelemetry comes from OpenCensus and OpenTracing merge.
The deprecation of those 2 parent projects took some time, but it happened.

IDK if I have a saying because I don't maintain the OTel Rust, but I'd vote for Option 1 and invite the maintainers of tokio-tracing to join the OTel project as maintainers/approvers.
Basically continue what they are doing, but under CNCF umbrella and OpenTelemetry as main project.

IDK how much tokio-tracing follows the OTel specification and semantic convention, but another thing to highlight is that whenever we have the 3 signals stable in OTel Rust, we would have 2 different approaches for telemetry in Rust.

  • tokio-tracing for traces
  • OTel for metrics, logs and profiling ...

I'm biased but I see OTel as the future for Observability signals.

@lalitb
Copy link
Member

lalitb commented Mar 27, 2024

Tagging for more inputs.
@hawkw, @davidbarsky, as tokio-tracing maintainers
@jtescher as tracing-opentelemetry, and opentelemetry-rust maintainer

@TommyCpp
Copy link
Contributor

TommyCpp commented Mar 27, 2024

but another thing to highlight is that whenever we have the 3 signals stable in OTel Rust, we would have 2 different approaches for telemetry in Rust

Just to provide context here. I think if we move to tracing as the API for Otel Rust. We could unify all 3 signals under tracing.

  • Logs, per spec we didn't define a new API in Otel Rust. We have bridge for log facade and tracing facade
  • Metrics, I believe tracing has some WIP to support metrics collection.

@jtescher
Copy link
Member

I haven't had much time recently to work on open source, but my perspective is that option 3 is likely optimal in the near term. I suspect that expressing the full otel API via tracing would be difficult and likely require some changes to the underlying library which would be orthogonal to their current designs and goals (trace ids on span creation, metrics in general, etc). It may be possible to express them via a large range of special fields but it seems likely that it would be worse than the current two API confusion. Someone could do a proof of concept trying to unify them though to be sure.

Option 3 could be done via clearer purposes for each API (e.g. low level "full" api via otel, or high level "limited but ergonomic and user-friendly" api via tracing macros) and examples of suggested architectural patterns (e.g. otel API "between" application boundaries, tracing within applications and crates, or similar sets of suggestions). But as already mentioned here it is somewhat more cumbersome and confusing than a consistent single API. Being not fully in control of the otel spec or the log/tracing ecosystems means the rust otel stack finds itself somewhat stuck in the middle.

@davidbarsky
Copy link

I'm on vacation, so I'll be brief and try to expand next week/summarize my thoughts from Slack: pulling a .NET (paying attention to the intent, not the letter of the spec) is very much possible, down to the fact that propagators remained in a dedicated OTEL library for 2.5 years. I think tracing could have a native notion of propagators, but I don't think we have the bandwidth to figure that out/I'd rather wait for Tower to reach 1.0 before making decisions on that front.

@cijothomas
Copy link
Member Author

down to the fact that propagators remained in a dedicated OTEL library for 2.5 years

In practice, this is still the case! So is Baggage. OTel .NET still maintains the API for these things, that are not covered by the .NET Runtime API. If we go with option2 here, I'd expect that we'll only eliminate those APIs for which there is a clear equivalent in tracing.

I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails.

I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further.

@hdost
Copy link
Contributor

hdost commented Mar 30, 2024

I guess I'll try to take a look at how we could go for option 3.

From the top of my head use cases to look at:

  • Apps + Libs Instrumented with tracing + tracing-opentelemetry + an exporter
    • this is the basic tracing to OTel case.
  • Apps + Libs Instrumented with opentelemetry + opentelemetry-appender-tracing
    • this is the basic Otel to tracing case.

Then some variant of the two where both tracing and Otel are used for instrumentation.

Those will be "advanced cases", but honestly it might be more common than one might think.

@cijothomas
Copy link
Member Author

cijothomas commented Apr 9, 2024

Comment/Discussion from Community Meeting for option3:

Test to validate the option3
A -> B -> C

A - uses tracing for producing span
B - uses otel tracing api for producing span
C - uses tracing for producing span

3 spans
SpanA
SpanB (parent=SpanA)
SpanC (parent=SpanB)

It may not be feasible to ask users to use same api for all 3, as they may not own/control some of them. eg: B could be reqwest crate.

#1378 (comment) shows an examples where logging and tracing (distributed tracing aka spans) are used, and correlation is broken when tracing crate is used to produce span, instead of otel tracing api.

@cijothomas
Copy link
Member Author

down to the fact that propagators remained in a dedicated OTEL library for 2.5 years

In practice, this is still the case! So is Baggage. OTel .NET still maintains the API for these things, that are not covered by the .NET Runtime API. If we go with option2 here, I'd expect that we'll only eliminate those APIs for which there is a clear equivalent in tracing.

I'll also be on vacation for ~1 week. Once back, I'll write down more details on how option2 could potentially look like. I didn't want to spend too much time on exploring any of the options, without observing which one the community as a whole would lean to.. It does not look like there are any clear winners so far, but part of the reason could be due to lack of specifics/details on what would each option really entails.

I'm not yet in a position to strongly support any option so far, however, I'll take a stab at exploring option 2 further.

Took some time to get to this due to other priorities, but here are more details on one possible way to go with option2, including a prototype:
#1689

@diurnalist
Copy link

diurnalist commented Apr 30, 2024

👋🏻 I am not a Rust developer so am coming from a very different perspective. My take is that, to my knowledge, every other ecosystem has opted for Option 1 long-term, Option 3 near-term. Specifying the API in OTel was (I assume) a large effort and we have seen the API evolve as developers have battle-tested it and provided feedback (e.g., lack of a synchronous gauge instrument, which is now in the spec.) My impression is that spec evolution is a pretty collaborative process, which is nice to observe.

In my view it would be a mistake to align on pre-existing instrumentation conventions as OTel's mission has been to provide a standard API that instrumentations across languages/systems can adhere to. This is particularly important as it provides a path for libraries to provide instrumentation hooks to, e.g. automatically generate traces and metrics as part of their own business logic, kind of like bpf kernel tracepoints or UDST. And those hooks are written according to a wider specification and hence less vulnerable to governance issues that tend to come up in external libraries from time to time.

In the Go OTel SDK there are several "bridge" interfaces that help to close the gap b/w the OTel API and existing instrumentation libraries, e.g., the opencensus bridge. Perhaps this would be a way to pave the path towards wider OTel API adoption.

/$0.02 🙇🏻

@cijothomas
Copy link
Member Author

As requested in the community meeting: I am partial towards Option 2. Specifically I don't think we'd eliminate the API surface as we're currently supporting basically all the needed features in the spec.

I would like say that we should probably try to see if it's not possible to improve the inter-compatibility as people will still try to use it directly.

Questions that are open from my perspective:

  • What are the minimal set of features we feel like we'd want to maintain?
  • Are the tokio team amenable to supporting such features?

We know that the tracing crate is widely used in the community, but so is the opentelemetry SDK, it's hard to necessarily see how many directly instrument with OpenTelemetry.

Update: I think really I'd be more 3 than 2. If we can promote inter-compatibility between the two then I think that's a greater win for the community at large. Because as I mentioned during the meeting we will still need to have "some" API anyway.

@hdost After re-reading this, I am not entirely sure if I understand the part where you said "I mentioned during the meeting we will still need to have "some" API anyway" I think @TommyCpp also mentioned this (in metrics context though).

If you look at the prototype, it has tracing sdk only! No tracing api. i.e there is nospan/span.start()/end() etc. We'll need opentelemetry crate itself, where we need to expose APIs for things not covered by tokio-tracing. Eg: Baggage, Propagators, Metrics, LogBridge etc.
https://github.com/cijothomas/opentelemetry-tracing/blob/main/src/opentelemetry_sdk.rs

Could you check this. We can discuss in the next SIG call and figure out what are the gaps in our understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA.
Projects
None yet
Development

No branches or pull requests

9 participants