Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using diagnostic-channel for instrumentation #1263

Open
2 tasks done
rochdev opened this issue Jun 30, 2020 · 21 comments
Open
2 tasks done

Using diagnostic-channel for instrumentation #1263

rochdev opened this issue Jun 30, 2020 · 21 comments

Comments

@rochdev
Copy link
Member

rochdev commented Jun 30, 2020

  • This only affects the JavaScript OpenTelemetry library
  • This may affect other libraries, but I would like to get opinions here first

Not sure if this has been discussed before, so please let me know if that's the case, but I was wondering why the actual instrumentation from plugins and context propagation was not done using diagnostic-channel. I think it would make a lot of sense to report events from the instrumentation that could then be converted to spans within OpenTelemetry. That model would make it possible to use these events for other things outside the realm of OpenTelemetry or even telemetry in general, and since the surface area of sending events to a centralized channel is a lot smaller, this would be a lot more likely to be exposed from within the libraries directly, and even eventually by Node. This would eventually remove the need to have the instrumentation at the OpenTelemetry level, and plugins would only build spans based on events.

What do you think?

cc @markwolff since you are contributing to both projects.

(P.S. I checked both checkboxes because I think eventually such a diagnostic channel could be standardized across languages)

@dyladan
Copy link
Member

dyladan commented Jul 1, 2020

I wasn't around when such early decisions were made, but I was also not aware of diagnostic-channel. One issue I see is that the diagnostic channel doesn't allow for any static analysis to make sure it is correct, where writing the instrumentations directly using a typescript API does. If we try to instrument modules which already support it then it might be a good place to start, but I don't see a strong reason to rewrite the instrumentations we have.

@markwolff
Copy link
Member

I would support this effort since this was the original intent of diagnostic-channel. Diagnostic channel shouldn't conflict with any of the existing stuff OpenTelemetry does. Mirroring what Roch mentioned, the main benefit here would be a super thin way for library writers to emit events (which could be used for anything, not just by APMs) instead of having to take a dependency on OpenTelemetry APIs. Currently no libraries emit these events. We do something similar for our existing .NET APM, DiagnosticSource, but I'm not sure if its used for otel at all.

@dyladan
Copy link
Member

dyladan commented Jul 1, 2020

Will still need patching for context extraction and propagation. If an event is emitted from a db driver that is all well and good, but what is its parent? Similarly, how would http headers be injected/extracted.

@markwolff
Copy link
Member

We expose channel.bindToContext(fn) to enable subscribers to add their own propagation callback that publishers would use. The channel doesn't do any context propagation itself, it just uses whatever was passed to it.

Injecting/extracting headers/metadata is something it does not currently do, but I think a similar pattern could be implemented, channel.addMetadataPropagation(...). This adds a bit of complexity that blurs the lines between the API & the channel though. Do you know how common of a scenario this would be beyond http/grpc?

@rochdev
Copy link
Member Author

rochdev commented Jul 2, 2020

If we try to instrument modules which already support it then it might be a good place to start, but I don't see a strong reason to rewrite the instrumentations we have.

The main advantage is that it's not reasonable to expect libraries to support OpenTelemetry internally, meaning that monkey-patching will always be necessary. Similar applies to Node where I don't think it would ever be considered to have OpenTelemetry built-in. However, a simple diagnostic channel could eventually land in both, removing the need for any monkey-patching. This would also benefit tools outside the realm of APM to consume these events, some of which might be completely unrelated to tracing.

If the community moves towards events at the core of Node diagnostics capability, opentelemetry-js will also end up having to be rewritten anyway. It's also worth noting that "rewrite the instrumentation we have" is not 100% correct since they could be reused and moved to diagnostic-channel for the most part.

cc @Qard since I know there are discussions around diagnostic-channel going on in Node Diagnostics.

Similarly, how would http headers be injected/extracted.

For this one, I will admit I'm not sure if it should go in diagnostic-channel or not. As mentioned by @markwolff it may be a concern outside the realm of the project, but at the same time, if the idea is to propagate diagnostics-related metadata it might make sense. Otherwise it would fall on the libraries to expose these hooks, which is not ideal but that could still be done through monkey-patching, but the extent of the patching would be significantly reduced.

One issue I see is that the diagnostic channel doesn't allow for any static analysis to make sure it is correct, where writing the instrumentations directly using a typescript API does.

That's not really an argument since the project's instrumentation could be updated so that it does include static analysis. This is especially true if OpenTelemetry contributors also start contributing to the project.

@dyladan
Copy link
Member

dyladan commented Jul 2, 2020

cc @Qard since I know there are discussions around diagnostic-channel going on in Node Diagnostics.

Very interested in this. If this becomes a language-supported feature, then it makes obvious sense to support.

@rochdev @markwolff how would you suggest we get started with this? You mentioned few/no projects are currently exporting data using diagnostic channels. Does this mean we would still need to use the same plugins we do now, but instead of calling opentelemetry APIs in the patched methods, we would emit events? What component would listen for the events?

@Qard
Copy link

Qard commented Jul 2, 2020

Yeah, no way Node.js would ever land the full OpenTelemetry in core, but a lower-level data channel is a definite possibility I’m working towards now. The shape of it is likely to evolve somewhat from the original Microsoft design though, so I’m not sure how much sense it makes to use it right now but definitely worth tracking for future consideration.

@vmarchaud
Copy link
Member

vmarchaud commented Jul 2, 2020

Node already have the trace_events core module is pretty much focused on that use case, it already allow to "trace" dns/fs sync requests and more (you can also link trace between themselves).
However there are no javascript API exposed to user, not sure if it was discussed. I know @jasnell opened this PR to allow core code to emit more trace events.
cc @jasnell Do you have any insight about why it wasn't exposed to users ?

For reference about this system, i would recommend reading this doc: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/edit#heading=h.jh64i9l3vwa1

EDIT: This is also relevant: nodejs/TSC#853

@rochdev
Copy link
Member Author

rochdev commented Jul 6, 2020

The shape of it is likely to evolve somewhat from the original Microsoft design though

Wouldn't it be better if the design stays as close as possible? The idea being that if the design is standard, then it could be implemented in every language and become an actual standard of its own on which OpenTelemetry would sit in every language. I don't have enough knowledge of how other languages work to know if that would be feasible though, but given the simplicity of the design I think it should definitely be possible.

@Qard
Copy link

Qard commented Jul 6, 2020

The issue is mainly around the design of the publish method. As it is currently, the inputs to the function will always be allocated/constructed regardless of if there is actually a subscriber. For something intended to be used very often, this could get expensive fast. It needs some changes to allow lazy input construction. Something like:

channel.publish('my-event', () => {
  return { myData: someExpensiveThing() }
})

or

if (channel.has('my-event')) {
  channel.publish('my-event', { myData: someExpensiveThing() })
}

The current model of sharing all events across a single global namespace is a bit worrying too. I'm considering a channel of channels where each module would get its own named channel. For example:

// Find or create a named channel
const channel = dc.channel('some-async-thing')

module.exports = function someAsyncThing (a, b, c, callback) {
  // Publish an event to the named channel
  channel.publish('start', { args: [ a, b, c ] })
  setImmediate(() => {
    channel.publish('stop')
    callback(a, b, c)
  })
}

// elsewhere...

// Find or create a named channel
const channel = dc.channel('some-async-thing')

// Subscribe to named events of named channel
channel.subscribe('start', recordStart)
channel.subscribe('stop', recordStop)

This would allow each named channel to have its own set of subscribers and publishers, making lookups faster. This way, an APM could attached hundreds of subscribers and only the ones on channels actually in-use would actually matter during runtime lookups for event routing.

I don't have a 100% solid idea of exactly what the final API should look like, but that's just some of what I've been thinking about on that.

@vmarchaud
Copy link
Member

For reference, @Qard opened an PR to land this into Node: nodejs/node#34895 (also the discussion on the diagostics group)

@Qard
Copy link

Qard commented Aug 29, 2020

Yep. Still very much experimental and will likely continue to evolve before it's ready to land.

@ethanresnick
Copy link

@Qard I know you were deeply involved in the design of diagnostics_channel, so I'd be curious (if you're willing) to learn about whether things are now sufficiently far along on the Nodejs front that it'd be possible to have some sort of generic Otel adapter code to instrument modules using channels. I imagine the idea would be to take a list of channel names (for diagnostic channels or TracingChannels), plus some optional configuration, and convert the events from those channels into proper Otel spans/metrics/etc.

@Qard
Copy link

Qard commented Jun 10, 2023

Yep, it has worked very well for Datadog so far. I'm happy to discuss whatever questions you might have at the next OTel JS library sync, or I can write something up explaining why things are the way they are in diagnostics_channel and how it could be a beneficial approach for OTel. In the meantime, if there's anything specific you want to know I'm happy to answer here. 🙂

@ethanresnick
Copy link

@Qard Thanks for the quick response and willingness to engage!

A brief summary of the design rationale for diagnostics_channel, and how it could be applied with Otel, would be incredibly helpful (thanks!) — both for me and, I imagine, others trying to get up to speed on these new APIs. I did try reading the nodejs docs first, but I still don't have a great mental model of how the diagnostics_channel APIs are supposed to be used/fit together, or how stable different parts are.

In the meantime, I took a look at the dd-trace-js code, which helped me grok things a bit; I hadn't realized that DD is already using diagnostics_channel in production. From my read through, it seems like Datadog is basically:

  • monkeypatching the methods of the npm modules it's instrumenting, so that they publish to new channels;
  • doing the channel publishing within the context of a new AsyncResource, to correlate the messages;
  • then, in separate code, subscribing to these channels and converting their messages to spans.

Is that right?

If so, my primary questions are:

  • Eventually, I think the idea is for APM vendors to be able to collect traces without having to monkeypatch the modules they're instrumenting, right?

  • If so, it seems like the module authors would need to be able to write their code to publish messages without, knowing what tool is going to consume those messages to construct spans. My question is, with the new APIs, exactly what is that supposed to look like? In particular:

    • Is the idea that, with TracingChannel, module authors wouldn't need to create the AsyncResource that DD is creating, because of the shared context object passed to/created by the TracingChannel.trace*** methods?
    • For built-in Node modules that are already using diagnostics channels (say, the http.server.request.start/finish channels and the channels used by undici), is the idea that APMs would have to treat those channels specially, by knowing about some object in the messages (e.g., the request object) that has a consistent identity for message correlation?
    • Could you show a brief example of how this would all fit together? I.e., both what the module author would write, and then how external code (from Otel or elsewhere) would consume the module's output to correctly produce traces while managing all the trace context?

Thanks again, for your help here and your hard work getting these APIs into Node! :)

@Qard
Copy link

Qard commented Jun 13, 2023

The path forward for APMs is to short/medium-term continue with their existing patching but emit diagnostics_channel events instead of calling their APIs directly and layer module plugins over that to describe how to map the event data to their API calls. Some things will map fairly directly but some will need a bit of restructuring to match their data model. It's not in-scope for diagnostics_channel events to always be able to map to spans directly with no code, but it should take very minimal code. This is part of why TracingChannel has different names for every source, so you can attach different logic to map those events into spans. It does follow a generalized pattern though to enable as much code reuse as possible for those plugins, and you can probably model some common patterns over similar models, like SQL client libraries can probably use the same or very similar plugins.

Eventually what is being captured and sent through diagnostics_channel in patches will be contributed upstream to those libraries, and we're already starting on that now. When those events get merged upstream we can start listening to those channels instead without needing any patches, which should be a huge improvement for performance. :)

The AsyncResource thing Datadog is doing currently is temporary. TracingChannel was designed with a shared context object to eliminate the need for that, and also can work in cooperation with channel.bindStore(...) to wrap your context around any given TracingChannel scope. Here's an example:

// publishing side
const { readFile } = require('fs')
const { tracingChannel } = require('diagnostics_channel')

// A TracingChannel is a container of channels that follow a consistent naming
// scheme and behavioural model.
//
// This will contain channels for:
// - tracing:something:start
// - tracing:something:end
// - tracing:something:asyncStart
// - tracing:something:asyncEnd
// - tracing:something:error
const channel = tracingChannel('something')

// traceCallback will trigger a start and end around the sync portion of the
// function given as the first argument. The second argument will set the
// position of the callback to wrap. The callback received by the given
// function will then be wrapped with another function which will trigger the
// asyncStart and asyncEnd events around the execution of the callback.
channel.traceCallback(readFile, -1, { file }, undefined, file, () => {})

// subscribing side
const { channel } = require('diagnostics_channel')

// The start channel can be used as a scope for AsyncLocalStorage
const start = channel('tracing:something:start')

start.bindStore(spanStore, (context) => {
  const parentSpan = spanStore.getStore()
  const span = new Span({
    // get whatever is needed from the context data
  }, parentSpan)
  context.span = span
  return span
})

// By storing the span on the context object it's possible to manually restore
// context in the event that AsyncLocalStorage fails to propagate, for
// whatever reason. This is generally not recommend and should probably be
// considered a bug if AsyncLocalStoragevis not propagating correctly.
const asyncStart = channel('tracing:something:asyncStart')

asyncStart.bindStore(spanStore, (context) => {
  return spanStore.getStore() || context.span
})

From the publishing side, it's the intent that they will share any data they have in the context object and the consumer can extract what they want from that. For example, an http request will share the request and response objects and the subscriber can extract things like method and path from that data.

As for the question about existing channels, TracingChannel can be used to map over existing channels that don't necessarily need to follow the naming convention, but if the don't match the behavioural convention it may not make sense. In the case of http we'll probably wrap a TracingChannel around the request/response lifetime but as a separate set of events from what exists currently.

For a more complete example of the tracing side:

const { channel } = require('diagnostics_channel')

const start = channel('tracing:something:start')
const end = channel('tracing:something:end')
const error = channel('tracing:something:error')
const asyncStart = channel('tracing:something:asyncStart')
const asyncEnd = channel('tracing:something:asyncEnd')

start.bindStore(spanStore, (context) => {
  const parentSpan = spanStore.getStore()
  const span = new Span({
    // get whatever data is needed from the context data
  }, parentSpan)
  context.span = span
  return span
})

// Error events can be attributed to the span
error.subscribe((context) => {
  const currentSpan = spanStore.getStore() || context.span
  if (!currentSpan) return
  currentSpan.setError(context.error)
})

// For a sync source, you could trigger a span end when the end event happens
end.subscribe((context) => {
  const currentSpan = spanStore.getStore()
  if (!currentSpan) return
  currentSpan.end({
    // get whatever additional data is needed from the context data.
    // With `traceSync(...)`, the context object will also contain
    // `result` and `error` where relevant.
  })
})

// For an async source, either the asyncStart or asyncEnd could be used to mark the span end,
// which you should use depends on if you consider callback execution time as being part of the span.
// In either case the `result` or `error` properties will be set.
asyncStart.subscribe((context) => {
  const currentSpan = spanStore.getStore()
  if (!currentSpan) return
  currentSpan.end({
    // get whatever additional data is needed from the context data.
    // With `traceCallback(...)` and `tracePromise(...)`, the context object will
    // also contain `result` and `error` where relevant.
  })
})

@ethanresnick
Copy link

Thanks @Qard! Those examples are super helpful. I was mostly asking because, for my own modules, I'd like to start emitting diagnostics channel events (rather than, e.g., directly depending on the OTel API). Then, if I use those modules at work, I will need to hook those events into Otel, so I wanted a fuller picture of how that would work as well. I think your response answers all my questions!

I ended up doing a deep dive into these APIs over the weekend (including watching your talk at Node Congress 2023), and I'm very excited to see all this work coming to fruition!

While learning the APIs, I did have two small pieces of feedback. I realize that's a bit off topic to this issue, but I don't feel strongly enough about these ideas to open a new issue for them, so I just wanted to mention them to you briefly here. Feel free to ignore them/take them with a giant grain of salt — they may be totally off the mark (I'm still missing much context behind the design here). But, they were:

  • The runStores method name was pretty confusing to me... namely, I didn't expect that it would also publish on the underlying channel. In my mental model, running a function with some context is logically independent from publishing diagnostics events messages, so I didn't understand why it'd make sense to couple them and why the data that's fed to the store transformers would make sense as an event message. But, assuming there is some good reason to couple them for common use cases, which I suspect there is, then I still think the name could hint at that better. I also found the docs on runStores a bit confusing; I had to read the source code to understand it.

  • I wonder if it would make sense for the error/result properties that Node adds to the context object of TracingChannel messages to be named with symbols, rather than strings? Since error and result are pretty common property names, it seems like there's some risk of event publishers either accidentally overwriting them, or having the data they put in those properties get clobbered by Node. Also, if the TracingChannel APIs ever wants to add a new property to these objects in the future, there will be no safe/backwards-compatible way to do it except to make that new property use a Symbol, and then it might be weird to have two of the properties use strings while the new ones use symbols.

But those are just nits. Again, thanks for all your work here!

@Qard
Copy link

Qard commented Jun 13, 2023

Yeah, agreed that runStores is not the clearest API. I wasn't getting any better names for it though. As for why it does both things, the original intent was that any event could trigger the bindStore wrapping, which originally used enterWith on the store, but that was too unreliable so I changed it to use run on the store but that required making an actual function scope to wrap. I could have split the concepts at that point, but I felt it probably was best to not burned module authors so much with deciding data relevance differences between the event need and the scoping need and to instead just pass along all the data they have on hand and let the consumer decide what to do with it. This meant runStores and publish would basically always occur in the same place with the same arguments. We also didn't want to put the users in a position where that ordering could change as we designed it such that the scoping will run first so ever event in the span should have the store value set with bindStore as the current store state in the subscribers for any of the channels. (Assuming context doesn't break for async events.)

For error/result, attaching it directly to the context object seemed the most appropriate. I don't foresee a need for additional properties in the future as the level of abstraction significantly limits what actual context it even could have generically, but if changes need to happen in the future I think I'd go for something similar to executionAsyncResource() where an external API is called to pull data out of a managed internal state object where it's only in-scope during the events or something like that. In any case, we'll see if it ever comes to that. 🤷🏻‍♂️

@ethanresnick
Copy link

Yeah, agreed that runStores is not the clearest API. I wasn't getting any better names for it though.

I'm not sure. Maybe just runAndPublish?

For error/result, attaching it directly to the context object seemed the most appropriate.

Yeah, I agree with putting the key directly on the context object; I was just proposing possibly putting it at context[require('node:diagnostics_channel').error] and context[require('node:diagnostics_channel').result], where those would be symbols. I think that might make sense for avoiding conflicts, even if no more properties are added. (Although I think I also saw some chatter about possibly adding an async property too?)

@Qard
Copy link

Qard commented Jun 15, 2023

PRs are welcome if you have ideas for improvements. Best to make any changes now while it's still marked experimental. I would probably go for something like tracingChannel.getResult(context) and tracingChannel.getError(context) as an abstraction over that symbol storage or something like that to allow for possible future changes to the implementation details of how it works though. 🤔

@ethanresnick
Copy link

PRs are welcome if you have ideas for improvements. Best to make any changes now while it's still marked experimental.

I agree that now'd be the time to make changes, but I really don't feel like I have the bandwidth for a long discussion, unfortunately, which I suspect a PR would prompt. So I more just wanted to plant these ideas in your head, in case you think (with much more context than me) that they're worth pursuing. But now I'll stop hijacking this thread 😁

I would probably go for something like tracingChannel.getResult(context) and tracingChannel.getError(context) as an abstraction over that symbol storage or something like that to allow for possible future changes to the implementation details

I like those methods for ease-of-use, but I don't think they'd really make the implementation changeable down the line if the Symbols are observable (i.e., added as a key in the object); to make it changeable, you'd probably have to put the error/result in a weakmap that those helpers would access, which could work too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants