New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using diagnostic-channel for instrumentation #1263
Comments
I wasn't around when such early decisions were made, but I was also not aware of diagnostic-channel. One issue I see is that the diagnostic channel doesn't allow for any static analysis to make sure it is correct, where writing the instrumentations directly using a typescript API does. If we try to instrument modules which already support it then it might be a good place to start, but I don't see a strong reason to rewrite the instrumentations we have. |
I would support this effort since this was the original intent of diagnostic-channel. Diagnostic channel shouldn't conflict with any of the existing stuff OpenTelemetry does. Mirroring what Roch mentioned, the main benefit here would be a super thin way for library writers to emit events (which could be used for anything, not just by APMs) instead of having to take a dependency on OpenTelemetry APIs. Currently no libraries emit these events. We do something similar for our existing .NET APM, |
Will still need patching for context extraction and propagation. If an event is emitted from a db driver that is all well and good, but what is its parent? Similarly, how would http headers be injected/extracted. |
We expose Injecting/extracting headers/metadata is something it does not currently do, but I think a similar pattern could be implemented, |
The main advantage is that it's not reasonable to expect libraries to support OpenTelemetry internally, meaning that monkey-patching will always be necessary. Similar applies to Node where I don't think it would ever be considered to have OpenTelemetry built-in. However, a simple diagnostic channel could eventually land in both, removing the need for any monkey-patching. This would also benefit tools outside the realm of APM to consume these events, some of which might be completely unrelated to tracing. If the community moves towards events at the core of Node diagnostics capability, cc @Qard since I know there are discussions around
For this one, I will admit I'm not sure if it should go in
That's not really an argument since the project's instrumentation could be updated so that it does include static analysis. This is especially true if OpenTelemetry contributors also start contributing to the project. |
Very interested in this. If this becomes a language-supported feature, then it makes obvious sense to support. @rochdev @markwolff how would you suggest we get started with this? You mentioned few/no projects are currently exporting data using diagnostic channels. Does this mean we would still need to use the same plugins we do now, but instead of calling opentelemetry APIs in the patched methods, we would emit events? What component would listen for the events? |
Yeah, no way Node.js would ever land the full OpenTelemetry in core, but a lower-level data channel is a definite possibility I’m working towards now. The shape of it is likely to evolve somewhat from the original Microsoft design though, so I’m not sure how much sense it makes to use it right now but definitely worth tracking for future consideration. |
Node already have the For reference about this system, i would recommend reading this doc: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/edit#heading=h.jh64i9l3vwa1 EDIT: This is also relevant: nodejs/TSC#853 |
Wouldn't it be better if the design stays as close as possible? The idea being that if the design is standard, then it could be implemented in every language and become an actual standard of its own on which OpenTelemetry would sit in every language. I don't have enough knowledge of how other languages work to know if that would be feasible though, but given the simplicity of the design I think it should definitely be possible. |
The issue is mainly around the design of the publish method. As it is currently, the inputs to the function will always be allocated/constructed regardless of if there is actually a subscriber. For something intended to be used very often, this could get expensive fast. It needs some changes to allow lazy input construction. Something like: channel.publish('my-event', () => {
return { myData: someExpensiveThing() }
}) or if (channel.has('my-event')) {
channel.publish('my-event', { myData: someExpensiveThing() })
} The current model of sharing all events across a single global namespace is a bit worrying too. I'm considering a channel of channels where each module would get its own named channel. For example: // Find or create a named channel
const channel = dc.channel('some-async-thing')
module.exports = function someAsyncThing (a, b, c, callback) {
// Publish an event to the named channel
channel.publish('start', { args: [ a, b, c ] })
setImmediate(() => {
channel.publish('stop')
callback(a, b, c)
})
}
// elsewhere...
// Find or create a named channel
const channel = dc.channel('some-async-thing')
// Subscribe to named events of named channel
channel.subscribe('start', recordStart)
channel.subscribe('stop', recordStop) This would allow each named channel to have its own set of subscribers and publishers, making lookups faster. This way, an APM could attached hundreds of subscribers and only the ones on channels actually in-use would actually matter during runtime lookups for event routing. I don't have a 100% solid idea of exactly what the final API should look like, but that's just some of what I've been thinking about on that. |
For reference, @Qard opened an PR to land this into Node: nodejs/node#34895 (also the discussion on the diagostics group) |
Yep. Still very much experimental and will likely continue to evolve before it's ready to land. |
@Qard I know you were deeply involved in the design of |
Yep, it has worked very well for Datadog so far. I'm happy to discuss whatever questions you might have at the next OTel JS library sync, or I can write something up explaining why things are the way they are in diagnostics_channel and how it could be a beneficial approach for OTel. In the meantime, if there's anything specific you want to know I'm happy to answer here. 🙂 |
@Qard Thanks for the quick response and willingness to engage! A brief summary of the design rationale for In the meantime, I took a look at the dd-trace-js code, which helped me grok things a bit; I hadn't realized that DD is already using diagnostics_channel in production. From my read through, it seems like Datadog is basically:
Is that right? If so, my primary questions are:
Thanks again, for your help here and your hard work getting these APIs into Node! :) |
The path forward for APMs is to short/medium-term continue with their existing patching but emit diagnostics_channel events instead of calling their APIs directly and layer module plugins over that to describe how to map the event data to their API calls. Some things will map fairly directly but some will need a bit of restructuring to match their data model. It's not in-scope for diagnostics_channel events to always be able to map to spans directly with no code, but it should take very minimal code. This is part of why TracingChannel has different names for every source, so you can attach different logic to map those events into spans. It does follow a generalized pattern though to enable as much code reuse as possible for those plugins, and you can probably model some common patterns over similar models, like SQL client libraries can probably use the same or very similar plugins. Eventually what is being captured and sent through diagnostics_channel in patches will be contributed upstream to those libraries, and we're already starting on that now. When those events get merged upstream we can start listening to those channels instead without needing any patches, which should be a huge improvement for performance. :) The AsyncResource thing Datadog is doing currently is temporary. TracingChannel was designed with a shared context object to eliminate the need for that, and also can work in cooperation with // publishing side
const { readFile } = require('fs')
const { tracingChannel } = require('diagnostics_channel')
// A TracingChannel is a container of channels that follow a consistent naming
// scheme and behavioural model.
//
// This will contain channels for:
// - tracing:something:start
// - tracing:something:end
// - tracing:something:asyncStart
// - tracing:something:asyncEnd
// - tracing:something:error
const channel = tracingChannel('something')
// traceCallback will trigger a start and end around the sync portion of the
// function given as the first argument. The second argument will set the
// position of the callback to wrap. The callback received by the given
// function will then be wrapped with another function which will trigger the
// asyncStart and asyncEnd events around the execution of the callback.
channel.traceCallback(readFile, -1, { file }, undefined, file, () => {})
// subscribing side
const { channel } = require('diagnostics_channel')
// The start channel can be used as a scope for AsyncLocalStorage
const start = channel('tracing:something:start')
start.bindStore(spanStore, (context) => {
const parentSpan = spanStore.getStore()
const span = new Span({
// get whatever is needed from the context data
}, parentSpan)
context.span = span
return span
})
// By storing the span on the context object it's possible to manually restore
// context in the event that AsyncLocalStorage fails to propagate, for
// whatever reason. This is generally not recommend and should probably be
// considered a bug if AsyncLocalStoragevis not propagating correctly.
const asyncStart = channel('tracing:something:asyncStart')
asyncStart.bindStore(spanStore, (context) => {
return spanStore.getStore() || context.span
}) From the publishing side, it's the intent that they will share any data they have in the context object and the consumer can extract what they want from that. For example, an http request will share the request and response objects and the subscriber can extract things like method and path from that data. As for the question about existing channels, TracingChannel can be used to map over existing channels that don't necessarily need to follow the naming convention, but if the don't match the behavioural convention it may not make sense. In the case of http we'll probably wrap a TracingChannel around the request/response lifetime but as a separate set of events from what exists currently. For a more complete example of the tracing side: const { channel } = require('diagnostics_channel')
const start = channel('tracing:something:start')
const end = channel('tracing:something:end')
const error = channel('tracing:something:error')
const asyncStart = channel('tracing:something:asyncStart')
const asyncEnd = channel('tracing:something:asyncEnd')
start.bindStore(spanStore, (context) => {
const parentSpan = spanStore.getStore()
const span = new Span({
// get whatever data is needed from the context data
}, parentSpan)
context.span = span
return span
})
// Error events can be attributed to the span
error.subscribe((context) => {
const currentSpan = spanStore.getStore() || context.span
if (!currentSpan) return
currentSpan.setError(context.error)
})
// For a sync source, you could trigger a span end when the end event happens
end.subscribe((context) => {
const currentSpan = spanStore.getStore()
if (!currentSpan) return
currentSpan.end({
// get whatever additional data is needed from the context data.
// With `traceSync(...)`, the context object will also contain
// `result` and `error` where relevant.
})
})
// For an async source, either the asyncStart or asyncEnd could be used to mark the span end,
// which you should use depends on if you consider callback execution time as being part of the span.
// In either case the `result` or `error` properties will be set.
asyncStart.subscribe((context) => {
const currentSpan = spanStore.getStore()
if (!currentSpan) return
currentSpan.end({
// get whatever additional data is needed from the context data.
// With `traceCallback(...)` and `tracePromise(...)`, the context object will
// also contain `result` and `error` where relevant.
})
}) |
Thanks @Qard! Those examples are super helpful. I was mostly asking because, for my own modules, I'd like to start emitting diagnostics channel events (rather than, e.g., directly depending on the OTel API). Then, if I use those modules at work, I will need to hook those events into Otel, so I wanted a fuller picture of how that would work as well. I think your response answers all my questions! I ended up doing a deep dive into these APIs over the weekend (including watching your talk at Node Congress 2023), and I'm very excited to see all this work coming to fruition! While learning the APIs, I did have two small pieces of feedback. I realize that's a bit off topic to this issue, but I don't feel strongly enough about these ideas to open a new issue for them, so I just wanted to mention them to you briefly here. Feel free to ignore them/take them with a giant grain of salt — they may be totally off the mark (I'm still missing much context behind the design here). But, they were:
But those are just nits. Again, thanks for all your work here! |
Yeah, agreed that runStores is not the clearest API. I wasn't getting any better names for it though. As for why it does both things, the original intent was that any event could trigger the bindStore wrapping, which originally used enterWith on the store, but that was too unreliable so I changed it to use run on the store but that required making an actual function scope to wrap. I could have split the concepts at that point, but I felt it probably was best to not burned module authors so much with deciding data relevance differences between the event need and the scoping need and to instead just pass along all the data they have on hand and let the consumer decide what to do with it. This meant runStores and publish would basically always occur in the same place with the same arguments. We also didn't want to put the users in a position where that ordering could change as we designed it such that the scoping will run first so ever event in the span should have the store value set with bindStore as the current store state in the subscribers for any of the channels. (Assuming context doesn't break for async events.) For error/result, attaching it directly to the context object seemed the most appropriate. I don't foresee a need for additional properties in the future as the level of abstraction significantly limits what actual context it even could have generically, but if changes need to happen in the future I think I'd go for something similar to executionAsyncResource() where an external API is called to pull data out of a managed internal state object where it's only in-scope during the events or something like that. In any case, we'll see if it ever comes to that. 🤷🏻♂️ |
I'm not sure. Maybe just
Yeah, I agree with putting the key directly on the context object; I was just proposing possibly putting it at |
PRs are welcome if you have ideas for improvements. Best to make any changes now while it's still marked experimental. I would probably go for something like |
I agree that now'd be the time to make changes, but I really don't feel like I have the bandwidth for a long discussion, unfortunately, which I suspect a PR would prompt. So I more just wanted to plant these ideas in your head, in case you think (with much more context than me) that they're worth pursuing. But now I'll stop hijacking this thread 😁
I like those methods for ease-of-use, but I don't think they'd really make the implementation changeable down the line if the Symbols are observable (i.e., added as a key in the object); to make it changeable, you'd probably have to put the error/result in a weakmap that those helpers would access, which could work too. |
Not sure if this has been discussed before, so please let me know if that's the case, but I was wondering why the actual instrumentation from plugins and context propagation was not done using diagnostic-channel. I think it would make a lot of sense to report events from the instrumentation that could then be converted to spans within OpenTelemetry. That model would make it possible to use these events for other things outside the realm of OpenTelemetry or even telemetry in general, and since the surface area of sending events to a centralized channel is a lot smaller, this would be a lot more likely to be exposed from within the libraries directly, and even eventually by Node. This would eventually remove the need to have the instrumentation at the OpenTelemetry level, and plugins would only build spans based on events.
What do you think?
cc @markwolff since you are contributing to both projects.
(P.S. I checked both checkboxes because I think eventually such a diagnostic channel could be standardized across languages)
The text was updated successfully, but these errors were encountered: