Diagnostics Channel - Next steps #180

mhdawson · 2018-04-02T17:25:24Z

The concept of a diagnostic channel was discussed in the Diagnostics summit as well as here: #134

In a nutshell the concept is to provide an API that modules can use to publish events, and an API that consumers (such as APM vendors) can consume. The key differences between the channel provided by these APIs and trace events are:

operation is synchronous, the consumer has the ability to take an action before the publish event completes on the emitter side.
data is passed (and any other data available indirectly) by the publish and this data can be mutated by the consumer.

This approach requires both implementation of the diagnostic channel as well as updates to modules in order to use the API. An interim step where monkey patching is used to add calls to the API is seen as a good way to bootstrap.

Microsoft already has an experimental implementation which we should evaluate to see what parts (if any) that we can re-use.

An initial cut at next steps include:

Create a new repo where we can experiment with a new implementation under the nodejs org - diagnostic-channel
review the existing experimental implementation
create performance regression suite
define the diagnostic publish API
define the diagnostic consumer API
pull what parts we can from existing implementation into publish API
pull what parts we can from existing implementation into consumer API
Iterate until we hit a 0.1.0 release that is ready for broader review

mcollina · 2018-04-03T12:20:37Z

Overhead and performance should be a first level concern about this API. It should result in no-ops if it is not enabled, and get smart if it is enabled. I think we should trade usability for performance whenever possible, as long as it still satisfy our use case.

jkrems · 2018-04-03T13:59:48Z

@mcollina - [ ] Create performance regression suite?

mcollina · 2018-04-03T14:00:41Z

@jkrems 👍 !

jkrems · 2018-04-03T14:02:17Z

Added to the top post

Qard · 2018-04-04T20:02:03Z

I brought it up in the diag meeting just now, but I worked a bit on a proof-of-concept universal diagnostics channel type thing recently, after I got back from the Ottawa summit and then Elastic{ON}. It's a very rough demo and definitely needs more perf focus before it could be actually usable, but you can have a look at it here: https://github.com/Qard/universal-monitoring-bus

It uses subscription queries to express interest in certain messages in the channel, similar to trace_events, but in a currently more flexible way (sacrificing performance for flexibility while trying to identify interesting filter data). It has no direct graph-forming, instead opting to direct the message bus into aggregators, which can live in-process or out-of-process, and makes no assumptions about transport mechanism. It is simply a callback that receives JSON-serialization-ready objects that you can do whatever you like with.

It also expresses both spans and metrics updates uniformly in a single message format, with purpose-specific extension, allowing for an easy shared communication channel or aggregator re-use and even correlation between the spans and metrics data.

Qard · 2020-08-05T21:13:55Z

I'm picking this back up now and working on some ideas of how to reduce/eliminate the runtime overhead. My main concern is that the prototype is basically just an event emitter so it has to lookup the subscribers on every publish. It does have a shouldPublish method to avoid constructing input data if there's nothing listening, but it's basically functionally identical to the listeners method of an event emitter which just results in an additional lookup. And approach I've been looking at is named channel objects that can be acquired once and reused as a way to make the lookup only happen once at the time of trying to acquire the channel.

I also don't think the module patches and context management aspects should be included. For core adoption, we should stick strictly to the data firehose component and strip that down to the bare essentials for something speedy and highly extendable in userland.

Qard · 2020-08-09T23:02:36Z

Here's my prototyping progress this week: https://gist.github.com/Qard/b21bb8cb1be03274f4a8c585c659adf2

I tried to follow the original design, for the most part, but added regex subscribers and the concept of channel and subscriber objects to take the lookup cost out-of-band from the publish so it doesn't have to do the lookup every time. The result is publish overhead on-par with a single variable increment, which is pretty good!

What do people think? Is the simple text name a flexible enough way to search for events in the firehose? Another possibility is object matching, which I did in this prototype several years back. I think it's a bit friendlier and more flexible, but I couldn't figure out a good way to pull the object-style search out-of-band from the messaging like I did for the prototype above. 🤔

mmarchini · 2020-08-12T21:39:57Z

Some questions from today's meeting:

Any plans on supporting publishers in C++ (not as a public API necessarily, but as a mechanism to instrument C++ code in core)
there's publish and publishIfActive. IMO it would be better ergonomics for publish to, by default, only publish if active, and then have a forcePublish or something to publish even if inactive.
I wonder if we should have a concept of "parent publisher" to avoid, so that publishing to a child would also publish to the parent. For example, have fs as a parent and fs.readFile as a child publisher, publishing to fs.readFile also publishes to fs (and following 2, it only publishes to active publishers by default, so if fs is active and fs.readFile is not, it would only publish to fs
How does it compare to the existing trace_events module? Could it be a functionality of trace_events instead of a new module?
Based on 4 and assuming we have it separate from trace_events, do we want to have publisher instrumentation on the same places we have trace_events? How to we keep those in sync in that case? How do we decide what to instrument and how do we measure progress of instrumentation in core?
Mentioned in the meeting and on Slack, but just reinforcing I think starting with an userland module published to npm is the way to go to experiment with this until we need something core-specific or until the API is closer to be solidified.
On current implementation subscribers receive the messages syncronously. Should we async subscribers as an option or even by default?
a) Any chance we can provide a channel that allows publisher and subscribers to be in different isolates (so we could have a worker thread as subscriber, for example)?

Overall I think it's a great start for the API and it would be good to get input from potential consumers (APMs) and publishers (frameworks) before attempting to land it in core.

Qard · 2020-08-12T23:34:01Z

Not sure about C++ yet. I haven't figured out if there's anywhere specific we'd want to capture C++-side data and, if so, how we could do that in a reasonably fast way.
The reason for publishIfActive(...) is to defer the input data construction to only happen if the channel is active. It's the same as if (channel.shouldPublish()) channel.publish(data) though, so possibly better to tell users to use that. At the moment it only really exists as an alternative API to compare performance and ergonomics. I do think we should settle on one way to do it before landing though.
That complicates the publish logic somewhat and makes it more expensive to decide if there is work to do at runtime. It'd have to walk the tree upward for every publish because the end of the branch might be inactive but other nodes in the graph might be active. It also imposes some structural opinions on the shape a channel can take from the perspective of the subscriber whereas I'd prefer the publisher to be as simple as possible and offload complicated/expensive logic as much as possible to the subscriber and hopefully out-of-band as much as possible. Performance when not listening is a top priority here.
The trace_events API is not consumable within JavaScript so not really useful for production monitoring, it's more of a post-mortem analysis tool. It's also target more specifically at chrome://tracing. It's not really suitable for broadcasting arbitrary/unstructured data.
There may be a few places where we broadcast data for both systems but I don't see there being a lot of overlap.
I disagree about making it a userland module. Most of the data that we (APM vendors) would want to gather from it is from core APIs. Starting as a userland module would just make it even harder to get it into core. (as evidenced by it taking literally half a decade for Node.js to get an async context management API despite APMs asking for it repeatedly for years. 😬 )
At the moment, having it sync is intentional. Timestamps can be added to messages on the consuming end that way. If we move it to async we lose the timing certainly and would likely need to do more within the API--possible, but trying to avoid that unless it's really necessary. As for the worker thread idea, I've thought about it. At the moment I'm looking more at the ergonomics and concepts of the API than the specific implementation details like that.

There's a lot to consider in how the ecosystem would want to engage with such an API to be able to deliver something that is satisfying, especially given our increasing focus on long-term vision of the project. We need to make sure this is not just relevant now, but will remain relevant and useful for the foreseeable future.

I have a few months to hack on this so there's plenty of room for iteration. We'll see how successful I am in getting this to a place where people are happy with it and it can deliver on the promise of a high-performance unified data firehose. 😅

Flarna · 2020-08-14T11:00:13Z

Sorry for not participating the meetings but the meeting times we have are in general hard to catch for me.

Regarding (7): I think it's vital that this is sync to allow subscribers to correlate with their current state/context. Just getting the info that e.g. a DB query XXX is sent is not that helpful on its own. APMs correlate events to show e.g. that query X happened during processing request Y. As X and Y in general originate from different modules the events itself can't hold this information.

I'm not sure if we need the regex subscribers at all. Published data is only of use if subscriber know the structure of the content for each event. Therefore I think subscribers are quite specific and using them for a possibly unbounded amount of publishers sounds not that reasonable. Maybe for something like a log/batching subscriber which just pushes the data as is or stringified further to the next consumer.

RafaelGSS · 2020-08-14T12:38:49Z

Let me share my opinions too:

I agree with @Flarna about the regex.
I would like to see this prototype in action taking by example a database_module just to validate and see the gains with it. I would like to help you with this if you agree 😄.

And I have a question too:

Do you have some idea of what kind of information from node-core we would like to listen/publish? I see this feature as a good idea to all diagnostics on the node, like cpu monitor, memory...

mmarchini · 2020-08-14T17:12:11Z

@Flarna good point on the regex subscriber. Can we turn the Diagnostics Channel into an EventEmitter and emit and event every time a publisher or subscriber is added or removed? This way we can replace the regex with a more flexible approach, where the user fetches all publishers, filters out to get what they want, and listens to the added/removed events so they can update the list of publishers they're subscribing to. This could even be extended (possibly in userland) as a RegExSubscriber, which keeps a cache of publishers that match a regex using the list + events.

Qard · 2020-08-14T17:53:19Z

Yes, that's basically what I have locally, though it's not done yet. Replacing the regex-specific subscribers concept with a generalized object that has a function like shouldSubscribe(channelName) that does whatever it needs to do with the channel name and returns true or false to tell it if it should listen, then it has onMessage(message) to receive the messages. The regex matching subscriber can easily be implemented on top of that.

As for the need for sync: yes, I mostly agree with that assessment. There's performance implications to that though, moving the subscriber execution into the direct execution path. It also might be impossible to do sync if we consider possibilities like subscribing from a worker thread, so context management may need to get baked in, if we go that route.

This is all very early in the experimentation process--I've only been working on this a week and a bit, so still just experimenting with ideas and getting for a feel for what the possibilities are in what the ultimate implementation could look like. The code as it is now is intended only as a proof-of-concept of API ergonomics and as a way to test approximate performance impact of injecting such a thing into core code paths. The code is likely to evolve considerably before actually considering it for core.

Flarna · 2020-08-14T18:18:06Z

I'm curious why async is seen as more performant then sync in this case.
Async implies that some task needs to be created and enqueued in addition to create the data object to publish. In case of workers the data must be additionally serialized.
I would assume that subscribers either do very little work in the callback or trigger an async task as needed by them.

Qard · 2020-08-14T18:25:10Z

I'm a bit uncertain also, but I've got comments suggesting moving APM agents into worker threads to take their overhead out of the main application, which seems possibly reasonable in theory anyway. I'm personally a bit skeptical that would be better though. 🤔

Flarna · 2020-08-14T18:40:21Z

Collecting the data must be in main thread. What can be moved is serialization/compression/correlation of the collected data and transmission to the APM server. But this is something APM tools have to work on and not the diagnostics channel.
Worst case would be to publish data to a worker and the subscriber there just forwards it back to the main thread...

Qard · 2020-08-14T18:54:19Z

Well, ideally if diagnostics channel does what it's intended to do, the need for APM code to run on the main thread should mostly disappear. Context management is the one challenge there though which leads me to think that might not be entirely feasible. If at all possible, I'd love to be able to fully isolate all APM code to a separate thread, but I'm not convinced that's doable without some fairly major changes to Node.js.

Flarna · 2020-08-14T19:18:30Z

ok so the main thread publishes data; either via some sensors placed by the APM tool or by modules having this built in. This data is collected in a worker, processed and transmitted.
Sounds fine so far.
Thinking on a generic DiagnosticsChannel I still question the async publish to a worker as one tool may run in one worker and another tool in a different worker. So somehow the publish target needs to be part of the subscriber.

mmarchini · 2020-08-14T19:28:39Z

fwiw Timestamps can be added to messages on the consuming end that way was enough to convince me that sync in the main thread for the API we expose is the best approach (I was thinking timestamps coming from the publisher, but I guess it makes more sense on the subscriber) :)

Qard · 2020-08-14T19:29:29Z

Yeah, it's my intuition that it's better to just keep it in the main thread and encourage APMs themselves to dispatch that to their own threaded thing. Timestamps are an easy fix with a wrapper object, but context management gets harder, especially with multi-tenancy.

Qard · 2020-08-23T20:13:31Z

Pull request opened! 🎉 nodejs/node#34895

github-actions · 2020-11-22T00:49:25Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

mhdawson · 2020-11-23T22:32:05Z

Removing stale as @quad is working on it.

github-actions · 2021-02-23T00:53:40Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

mhdawson added the diag-agenda label Apr 2, 2018

mhdawson mentioned this issue Apr 4, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-04-04 #178

Closed

mike-kaufman mentioned this issue Apr 4, 2018

Eliminate Monkey Patching for Diagnostic Instrumentation #134

Closed

This was referenced Apr 16, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-04-18 #182

Closed

Node.js Foundation Diagnostics WorkGroup Meeting 2018-04-18 mhdawson/temp-testing#34

Closed

mmarchini mentioned this issue Apr 25, 2018

Strategic Initiatives/Champions like we have on TSC #185

Closed

mhdawson mentioned this issue Apr 30, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-05-02 #187

Closed

mhdawson mentioned this issue May 14, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-05-16 #192

Closed

mike-kaufman added the Tracking-Issue label May 15, 2018

mhdawson mentioned this issue May 28, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-05-31 (11 AM WEDT, 2AM PDT) #200

Closed

mhdawson mentioned this issue Jun 11, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-06-13 #204

Closed

mhdawson mentioned this issue Jun 25, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-06-27 #208

Closed

mike-kaufman removed the diag-agenda label Jun 27, 2018

Qard mentioned this issue Jan 25, 2020

Executive summary: Introducing a CLS-like API to Node.js core nodejs/TSC#807

Closed

This was referenced May 29, 2020

stdin.read interface nodejs/tooling#68

Closed

fs hooks nodejs/tooling#63

Closed

Qard mentioned this issue Jun 2, 2020

Session Proposal: Unifying hook systems in Node.js openjs-foundation/summit#294

Closed

11 tasks

mmarchini added the diag-deepdive-agenda Used for agenda items related to diagnostic deep dive sessions label Aug 5, 2020

mhdawson mentioned this issue Aug 10, 2020

Node.js Diagnostics Deep Dive Meeting 2020-08-12 #425

Closed

Qard mentioned this issue Aug 23, 2020

lib: create diagnostics_channel module nodejs/node#34895

Closed

4 tasks

vmarchaud mentioned this issue Aug 29, 2020

Using diagnostic-channel for instrumentation open-telemetry/opentelemetry-js#1263

Open

2 tasks

github-actions bot added the stale label Nov 22, 2020

Qard added never stale and removed stale never stale labels Nov 24, 2020

github-actions bot added the stale label Feb 23, 2021

Qard added never stale and removed stale labels Feb 25, 2021

spencerwilson mentioned this issue Jun 30, 2021

diagnostics channel out for performance checking graphql/graphql-js#3133

Open

Qard closed this as completed Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diagnostics Channel - Next steps #180

Diagnostics Channel - Next steps #180

mhdawson commented Apr 2, 2018 •

edited by Qard

mcollina commented Apr 3, 2018

jkrems commented Apr 3, 2018 •

edited

mcollina commented Apr 3, 2018

jkrems commented Apr 3, 2018

Qard commented Apr 4, 2018 •

edited

Qard commented Aug 5, 2020

Qard commented Aug 9, 2020

mmarchini commented Aug 12, 2020

Qard commented Aug 12, 2020

Flarna commented Aug 14, 2020

RafaelGSS commented Aug 14, 2020

mmarchini commented Aug 14, 2020

Qard commented Aug 14, 2020

Flarna commented Aug 14, 2020

Qard commented Aug 14, 2020

Flarna commented Aug 14, 2020

Qard commented Aug 14, 2020

Flarna commented Aug 14, 2020

mmarchini commented Aug 14, 2020

Qard commented Aug 14, 2020

Qard commented Aug 23, 2020

github-actions bot commented Nov 22, 2020

mhdawson commented Nov 23, 2020

github-actions bot commented Feb 23, 2021

Diagnostics Channel - Next steps #180

Diagnostics Channel - Next steps #180

Comments

mhdawson commented Apr 2, 2018 • edited by Qard

mcollina commented Apr 3, 2018

jkrems commented Apr 3, 2018 • edited

mcollina commented Apr 3, 2018

jkrems commented Apr 3, 2018

Qard commented Apr 4, 2018 • edited

Qard commented Aug 5, 2020

Qard commented Aug 9, 2020

mmarchini commented Aug 12, 2020

Qard commented Aug 12, 2020

Flarna commented Aug 14, 2020

RafaelGSS commented Aug 14, 2020

mmarchini commented Aug 14, 2020

Qard commented Aug 14, 2020

Flarna commented Aug 14, 2020

Qard commented Aug 14, 2020

Flarna commented Aug 14, 2020

Qard commented Aug 14, 2020

Flarna commented Aug 14, 2020

mmarchini commented Aug 14, 2020

Qard commented Aug 14, 2020

Qard commented Aug 23, 2020

github-actions bot commented Nov 22, 2020

mhdawson commented Nov 23, 2020

github-actions bot commented Feb 23, 2021

mhdawson commented Apr 2, 2018 •

edited by Qard

jkrems commented Apr 3, 2018 •

edited

Qard commented Apr 4, 2018 •

edited