Eliminate Monkey Patching for Diagnostic Instrumentation #134

mike-kaufman · 2017-12-17T23:41:06Z

Forked from issue #95 so we can track seperately:

Problem: Can we standardise the way tracing information is sent from an instrumented module (called module-x below) to an APM agent?

Suggestions:

Maybe we can use the V8 trace events API
- Pros: No need for module-x to know about the APM agents
- Cons:
- Currently no way to bind the trace event to the current context
- Might be expensive to cross the barrier into C-land
- Might not be detailed enough
- Overhead of running in production?

Alternatively invent a new API in JavaScript land
- This could either be in Node core or a userland module
- If it’s a userland module, then module-x needs to detect if it’s present before it can send events to it. > The APM agents need to do the same.
- What if there’s multiple versions of this module installed?
- Research if other languages have something similar

To frame the problem a bit more concisely: Current APM vendors have to monkey-patch libraries to produce diagnostic data. E.g., if you want to know specific details of a DB query, one would monkey-patch the DB driver to capture any necessary params and stats.

This is problematic because

monkey-patching is brittle (it necessarily relies on internal implementation details),
each APM vendor has to effectively do the same thing.
multiple APM vendors loaded into the same process can trample on each other.
ESM modules wil require a custom loader to support monkey-patching.

We would like to get to a solution that has the following characteristics:

de-facto standard APIs for producing & consuming messages. (i.e., major APM vendors should buy into this)
ability to "phase-in" use of APIs by monkey-patching at first, and then pushing library maintainers to write to the "event sink" internally, thus eliminating needs for monkey-patching.
API is back-compat (ideally down to Node v4)
Ability to turn off event publication and then have "near-zero" overhead of any libraries that are leveraging publication APIs.
a solution that accounts for asynchronous context. That is, data events necessarily need to be correlated with asynchronous context. That said, we shouldn't conflate the problem of "monkey-patching for data events" (what we're tryign to solve here) with "monkey-patching to track async continuations" (addressed by async-hooks).

AndreasMadsen · 2017-12-18T07:18:09Z

What are the cons of using async_hooks? I can think of a few but was this not discussed?

Flarna · 2017-12-18T08:13:06Z

AsyncHooks are fine to correlated initiator and corresponding callbacks for native calls. But I don't think it's really possible to extract the data passed to the functions.

Besides that I think AsyncHooks have some blind spots:

Native extensions not using the AsyncHooks Embedder API (e.a. plain NAN or V8 APIs). This could be improved by integrating AsyncHooks Embedder API to NAN APIs used in such cases.
APIs which are based on parsing some data stream where there is no 1:1 mapping of outgoing writes and incoming data - in the end AsyncHooks were never designed for such cases.

AndreasMadsen · 2017-12-18T08:41:48Z

In AsyncHooks the embedder can pass a custom resource object where they can store any information they like. Such as the data passed to a function.

watson · 2017-12-18T09:10:06Z

Wouldn't you normally only implement the Embedder API on async boundaries - when queueing a callback, just before calling it, just after calling it etc. In APM we need sometimes hooks into other places of the code. For example instrumenting all middleware functions in Express or templating languages that doesn't do any async calls are both quite common.

AndreasMadsen · 2017-12-18T09:36:46Z

Wouldn't you normally only implement the Embedder API on async boundaries - when queueing a callback, just before calling it, just after calling it etc. In APM we need sometimes hooks into other places of the code.

You could also use the Embedder API for making the context flow easier to understand. A database request, even if there is no queue, can involve a hundred async operations. Wrapping that with async_hooks seems very reasonable.

For example instrumenting all middleware functions in Express or templating languages that doesn't do any async calls are both quite common.

Indeed, for the purely synchronous cases it doesn't make sense to use async_hooks.

I think trace_events is the way to go when async_hooks doesn't make sense (actually async_hooks emits trace_events).

Currently no way to bind the trace event to the current context

This we can fix. The V8 team is already working on it.

Might be expensive to cross the barrier into C-land

I've done a great deal of benchmarking with trace_events versus more naive logging. The price of crossing the JS/C++ boundary is offset by the great performance of trace_events logging versus more native logging.

If someone would like to benchmark this for themselves, I have already implemented a basic version of trace_events on process.binding('trace_events'). See usage example here: https://github.com/nodejs/node/blob/master/lib/internal/trace_events_async_hooks.js

Might not be detailed enough

What details do you need? You can emit arbitrary events in trace_events you can also link things together. Yes, by default it only supports having two extra arguments per event, but these can contain strings so one could just JSON.stringify if there are more. A more involved implementation can also support more than two extra arguments.

Overhead of running in production?

For sure trace_events will have the smallest overhead. It was built to be used in production.

Flarna · 2017-12-18T09:54:53Z

Maybe I haven't fully understood Async Hooks yet. If my task is to trace e.a. a http.get() call or or just a fs.read() . Which data to I get from AyncHooks here?

In our case we usually monitor high level requests e.a. a DB request, not all the 100 async calls needed to get it done. As a result we would like to extract the high level data not the low level protocol messages.

Sure, the 100 async interactions in between need to be tracked also in a lot cases but just to correlate request and cb, without extracting specific data. Here async hooks look really fine.

danielkhan · 2017-12-18T10:06:13Z

We started working with the OpenCensus folks to create a generic, vendor neutral tracing API. Maybe this is something we can incorporate into Node.js as well. Did anyone already research using anything like OpenCensus, Dapper, OpenTracing? I'd almost expect that, as Google is driving that.
If not, I will put something together over xmas.

AndreasMadsen · 2017-12-18T10:44:50Z

Maybe I haven't fully understood Async Hooks yet. If my task is to trace e.a. a http.get() call or or just a fs.read() . Which data to I get from AyncHooks here?

async_hooks will as a minimum trace the "atomic" requests. But you could also trace the non-atomic requests. You will typically end up doing that anyway for queues, which is why I don't see a problem with doing that too for non-queue cases.

mike-kaufman · 2017-12-18T16:09:11Z

Did anyone already research using anything like OpenCensus, Dapper, OpenTracing? I'd almost expect that, as Google is driving that. If not, I will put something together over xmas.

@danielkhan - I'm not aware of anything. Would be great to see something like this, at a minimum just an inventory of different approaches so we can be somewhat systematic. nodejs/node-eps#48 looks like an effort to explose tracing_events through JS, which also may suffice here.

Also, to be clear, we should scope this issue to addressing the "monkey-patching for gathering metadata about API calls" problem. As discussed above, ppl are monkey-patching for two distinct reasons - tracking async context and capturing details about API calls (e.g., params passed to a DB call). Async hooks is intended for the former. I'm not following suggestions to use Async Hooks APIs for the latter, but perhaps I'm missing something. If this is a valid approach, we should list in above list.

tedsuo · 2017-12-18T22:22:33Z

Ted from OpenTracing here. Would you consider the OpenTracing C++11 API, or a C API? We're in the process of integrating into a number of technologies, now would be a good time to start working with you!

https://github.com/opentracing/opentracing-cpp
Consider dynamic loading as a way to continue to ship node without needing to recompile in order to include a tracer: opentracing/opentracing-cpp#28

vdeturckheim · 2017-12-18T23:31:52Z

@mike-kaufman if I understand well, you are suggesting we would want some kind of proper overriding tooling in Node.js?

mike-kaufman · 2017-12-19T00:30:53Z

@vdeturckheim - not sure what you mean by "proper overriding tooling". What's been expressed here is that monkey patching is invasive & fragile - i.e., it relies on internal details of some library, and when those details change, monkey-patching breaks. Consequently, there's a desire to get away from it.

I think we'd like to find some solution where

module authors (e.g., DB client libraries) have a consistent way to publish "data events" about interesting things happening in their library.
Interested consumers (e.g., APM vendors) can listen for those events & act on them.

mhdawson · 2018-01-03T19:39:26Z

@tedsuo last year Node.js integrated a trace engine from google and we have been discussing how to add tracepoints etc. I'm not sure if the use of OpenTracing is relevant to this thread but it might be interesting as a discussion on its own. Ideally it we would be able to plug in different trace engines under the APIs that we expose in C and Javascript in Node.js for tracing. I've not read enough on OpenTracing to see how it would fit into that model. If you have some cycles to thing about it and possibly give the diagnostics team a n overview I'd suggest opening a new issue to start that discussion.

SergeyKanzhelev · 2018-01-09T18:45:22Z

@mike-kaufman thanks for pointing me to this thread.

@danielkhan we are discussing node.js OpenCensus SDK with Google. I think diagnostics channel and OpenCensus are very aligned. We just need to understand the proper layering and requirements of those.

ofrobots · 2018-01-09T19:23:28Z

For background, OpenCensus (github) is a new project to make a language and vendor neutral distributed tracing API and SDK. This work is in concert with the w3c proposal for distributed trace context. We are hoping to extract Node.js instrumentation code from Stackdriver Trace and incorporate that into the OpenCensus SDK as a starting point. Some of the other languages already have fairly functional SDKs, but we haven't gotten that far on JavaScript just yet.

Even with this in place, the question asked by @mike-kaufman in the OP still applies. How do we connect producers of high level tracing data (e.g. express) to the consumers (APM, debugging tools). This may be trace-events (but OP has valid concerns) or it may be something like the diagnostic channel module from Microsoft. I see both of these attempting to do the same thing from different ends (low-level vs. high level), and one thing I would like to achieve at the upcoming diagnostic summit is whether there is a path to intersect.

Some thoughts on the specific concerns by OP:

Maybe we can use the V8 trace events API
Pros: No need for module-x to know about the APM agents
Cons:

Currently no way to bind the trace event to the current context

This is a problem. We need the notion of context to be well defined within Node core. Once this notion is available, it would be fairly easy to bind trace events to the context.

Might be expensive to cross the barrier into C-land

This is a problem. We will need a way for the VMs to accelerate (JIT) tracing calls from JavaScript. I think we would probably want the VMs to provide an intrinsic.

Might not be detailed enough

IMO, this is addressable.

Overhead of running in production?

If we can avoid the JavaScript to C transitions, it would hard to beat the performance of trace-events with a pure JS.

Trace-event is designed first of all as a low level tracing mechanism to be used primarily from C/C++. The additional benefit it provides is that it acts like a single bus that aggregates all performance event data. Diagnostic channel does the same thing, but it will be limited to only high-level (i.e. not too high frequency) performance event data. Perhaps diagnostic channel is a good starting point. Once/if we have the trace-event API available from JS, we can additionally inject the diagnostic channel data into trace events as well, giving us a single stream of all performance event data?

Just some thoughts.

mike-kaufman · 2018-01-24T16:56:04Z

Based on @ofrobots comment above, I think there's a path forward here by the following:

Land on a JS API for a pub/sub model for events. (e.g., DiagnosticChannel or something equivalent)
Start building up monkey-patching libraries that use above API. Consumers (e.g., APM vendors) can start consuming these, and each vendor doesn't need to do their own monkey patching.
Open appopriate PRs on monkey-patched libraries to make use of the API above, and start deprecating the monkey-patch libraries.
Resolve perf implications of using trace_events from JS.
Integrate API from 1 above with trace_events.
open-census messages can be implemented on top of messages produced from API. i.e., there will be an "open-census" translation library that will subscribe to events & transform them into open-census things.

Notes:

above list isn't necessarily ordered or sequential. Some things can happen in parallel & some things can be moved up/down.
RE context formalization, I think that can be addressed at the JS API at first through monkey-patching and/or async-hooks, and then evolve as formalization efforts crystalize.

Does this sound like a fair interpretation of thread so far? Please let me know if I'm missing something.

ofrobots · 2018-01-24T20:56:14Z

@mike-kaufman 👍

mcollina · 2018-01-25T01:54:28Z

I would also note that monkey patching requires a custom loader to work with ESM. I would like to add this as one of the problems that removing monkey patching would solve.

As part of the goals, we should consider the LTS cycle. When opening PRs to 3rd-party libraries, we should be focusing on making sure that the library could run without modifications on old node releases (down to 4 ideally).

Moreover, we should ensure that when turned off, this tracing layer has zero overhead. One of the advantages of monkey patching is that when it's turned off it's off.

mike-kaufman · 2018-01-25T17:20:14Z

@mcollina - Thanks, updated summary above to reflect your comments. Note I changed "zero overhead" to "near zero" in case anyone is being pedantic. With tracing disabled, overhead should be a few instructions.

Flarna · 2018-01-25T21:32:20Z

I don't think that only emitting trace events will fit for all usecases. Currently we have to modify the arguments array (e.g. by wrapping the passed callback) in quite some cases.
For some outbound requests (HTTP, message queues) we have to modify outgoing data (e.g. HTTP headers) to get our trace tag transmitted.
For inbound HTTP requests we even modify the response data stream in some cases.

Maybe async hooks can help to avoid the need to wrap callbacks in most cases as we get the call context via this hooks.

If I compare NodeJS instrumentation with other techs like Java or .NET the main difference is in my opinion not the access to internals (for Java/.NET byte code is patched). The main difference is that in NodeJs monkey patching is done by a lot people - not just by a few APM vendors (which are anyway not used in the same application at the same time).

Catching up with changes in internals is definitely not my favorite task. But till now it has shown that it's harder and more cumbersome to find issues caused by combining modules using monkey patching.

mike-kaufman · 2018-01-25T21:48:59Z

I don't think that only emitting trace events will fit for all usecases.

@Flarna - if you could be more explicit about use cases that aren't handled by emitting "trace events", and potential any work-around, that would be helpful here. E.g., for adding http headers, can you inject a piece of middleware?

kjin · 2018-01-25T23:18:58Z

@ofrobots and I were discussing this just now -- to summarize, we worked with the following three requirements for getting out of monkeypatching for tracing:

Context propagation (to propagate trace context within the application)
Information retrieval (to create and populate trace data, such as path/query/etc.)
Outgoing message metadata injection (to handle distributed trace context; this refers to injecting HTTP headers as @Flarna mentioned, but also generalized to include other RPC types other than HTTP)

The ecosystem adoption of async_hooks fulfills (1), and the diagnostic-channel module mentioned here addresses (2) with its pub/sub API. However, these APIs alone don't satisfy (3). Injecting a piece of middleware would work somewhat but only applies to server responses, not outgoing requests (on top of that, I would consider injecting middleware a form of monkeypatching).

What we think is that Node core could expose some API where an APM vendor would be able to specify that they want a set of key-value pairs to be added to every outgoing HTTP request upon calling http.request. A second, similar API might exist to specify the same thing but for server responses.

Flarna · 2018-01-26T11:26:14Z

@mike-kaufman - I will try to get more concrete by describing some samples. Please note that this does not describe every detail/corner case.

Example 1: Simple outgoing request without tagging; e.g. database requests mongodb.find()
Currently we monkey patch find(). Once our wrapper is called we do following actions:

check if we should monitor this request or not
start a trace to book CPU time on it (name of function and type of this is used here)
record operation specific start data to this trace (e.g. hostname, query,... - extracted from this and/or function arguments and/or data captured via earlier, related calls)
modify arguments by replacing the user supplied callback (if present) with a wrapped version
call original function
patch returned promise (if present, depends on (4))
sometimes we also capture a stacktrace and attach it to the trace to get deeper code level insight
end trace (stops CPU time booking)

Once the wrapped callback is called (or the promise then/catch function) we do similar actions:

start a trace linked to above to book CPU time for callback (name of function and type of this is used here)
extract end data from arguments (e.g. error object, result size,...) and store it to the trace
call original callback
end trace (stops CPU time booking)

In case another patched function like mongodb.find() is called within the callback (or later after a setImmediate(),...) the same as above happens. From call context we are able to link them together and split CPU time in case of direct calls.

I could imagine that this use case can be completely covered by events and async_hooks (assuming enough info is passed to the events, e.g. we have begin/end events for all functions,...)
In case this high level events issued hold some sort of transaction id/context (maybe this is created by events consumer to avoid a table lookup) it's even not needed at all to use low level async_hooks to connect them.

Example 2: "Complicated" outgoing request without tagging; e.g. database request via query/stream object
Similar as above we patch the the API which triggers the request and creates a trace including entry data. But instead of wrapping the callback (4) or patching the promise (6) we have to patch the returned query object. Depending on how this query/stream object is designed it's needed to monkey patch some APIs and wrap callback registered. Usually we have to store our transaction context in these objects.
When results are issued by these objects we do similar actions as above for the single callback.

Not sure if we are able to trace transactional context via async_hooks along such multi step operations. If not, the trace events emitted by the module could contain an unique identifier for the operation to allow to track context.

Example 3: Simple outgoing request with tagging; e.g. HTTP request
Approach is similar as above without tagging. But additionally we create a trace tag (unique for this request) at (2) and inject it to the transmitted message.
In case of HTTP this is a dedicated request header we either inject by modifying the options passed to http.request() or by calling req.setHeader() afterwards (a better place would be maybe _storeHeader()).
In case of outbound HTTP request we have to ensure also that we capture all emitted events on request/response objects - there is no single callback.

Besides events to extract data and async_hooks to track context we need some hook here to inject a HTTP header. For other RPC/Messaging protocols a similar hook would be needed. Important here is that the trace tag injected shall be unique to the operation and is created during the actual call (at (2)).

Example 4: Incoming request, e.g. http.Server
In this case we wrap the incoming message handler. In case of http this is done by wrapping the callback registered to the 'request' event of http.Server. Once this function is called we do following actions:

check if we should monitor this request or not
extract the trace tag from incoming data (e.a. our HTTP trace header)
start a trace using the tag extracted to book CPU time on it (name of callback and type of this is used here)
record operation specific start data to this trace (e.g. hostname, HTTP method,...)
depending on concrete request additionally patch request object to allow us to monitor follow-up data (e.a. HTTP form data)
patch request object to allow us to wrap other listers (e.g. 'close', 'error') and get notified when request is ended (e.g. req.end() is called)
call original callback
end trace (stops CPU time booking)

Once wrapped events or end() is called we do similar actions:

start a trace linked to above to book CPU time for callback (name of function and type of this is used here)
extract end data from arguments (e.g. error object, status code, response headers,...) and store it to the trace
call original callback/function
end trace (stops CPU time booking)

With trace events emitted we could for sure cover most or even all of above functionality. Not sure if async_hooks allow to track context for all possible combinations of events/requests on HTTP request/response objects. I would expect that we need at least something on top of it to link the request event and the corresponding resp.end(). We know that there are frameworks out there which do pooling.

Example 5: Incoming HTTP request injecting a JS agent
This one is like example 4 but additionally we have to

parse and potentially modify the HTTP response data
remove/modify incoming and outgoing HTTP headers
catch incoming requests from injected JS agent and completely handle them (not visible to real application at all)

I fear that this use case is too heavy and specific to be covered by generic hooks/events....

ofrobots · 2018-01-26T22:19:04Z

@Flarna thanks for the detailed examples. Can you elaborate on what you mean by 'request injecting a JS agent'? I think I am understanding with the rest of your description, but I am not sure what a 'JS agent' is.

mike-kaufman · 2018-01-26T22:23:21Z

Can you elaborate on what you mean by 'request injecting a JS agent'?

I interpreted this as needing to inject a script into a html response, but @Flarna please clarify. :)

danielkhan · 2018-01-26T22:24:42Z

@mike-kaufman, @ofrobots Stepping in here. Yes it's about changing the response by adding a script tag to the body.

tedsuo · 2018-01-26T22:33:26Z

@mhdawson replying in regards to using OpenTracing for C/JS instrumentation.

We are adding support for dynamic loading (opentracing/opentracing-cpp#45) and will be adding a C bridge as the next step. This will allow any API-compliant tracer to be dynamically linked to the nodejs binary, much like what we are doing with envoy, nginx, postgres, etc. This has the advantage of not requiring NodeJS to be tied to a particular tracing implementation or version, as the code is separated. I can start another issue to discuss this.

Qard · 2018-01-27T00:48:58Z

I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like needing to inspect a connection object of a database driver during a query to include host/port information with contextual data of the trace layer, needing to inspect the socket of an incoming http request to get the remote address data, and many other cases where the patches do a lot more than just naively capturing function arguments. It's very common to need to dig about in the function context or augment things to intercept things like errors as transparently as possible.

edit: had this issue open since yesterday and it hadn't refreshed with any of the posts from the large one from @Flarna on. Oops. 😅

mike-kaufman · 2018-01-27T00:53:06Z

I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like...

@Qard - I agree in short term. Interesting exercise here will be to take use cases like you and @Flarna describe, and then see if we can get a system in place that provides a path to the required data.

mike-kaufman · 2018-01-27T00:56:47Z

@Flarna - thanks for the detail above. Will be interesting to use your use cases as a function to see if we can get the right API in place. Won't be 100% at first, but should be able to evolve in that direction & the exercise will bring more clarity/definition to the shape of the necessary APIs.

LewisJEllis · 2018-01-27T02:21:33Z

I agree in short term. Interesting exercise here will be to take use cases like you and Flarna describe, and then see if we can get a system in place that provides a path to the required data.

I think part of the difficulty is that given an arbitrarily implemented database driver, we can't predict how to reach (for example) its connection object automatically, and the implementor/maintainer of that driver cannot predict what information they should provide to some new potential tracing hooks/reporting API to satisfy the needs of every tool (APM or otherwise).

I have seen cases of monkeypatch instrumentation tracking state across nested objects or calls in ways that the instrumented module itself doesn't track (or otherwise need to). In these cases the instrumented module itself might have a hard time reporting the data collected by the monkeypatching instrumentation. For example, finding the full route of an express request currently requires monkeypatching every layer in the routing tree and pushing each layer's route fragment into a stack during traversal. Unless I am mistaken, Express does not otherwise keep track of or have a way to easily find this full route. I imagine there are more compelling examples of this difficulty; this is just one I am familiar with.

Maybe the potential tracing reporting API would allow Express to report the full route piece-by-piece, but then we're still relying on Express maintainers to understand and accept a pull request that does this. Maybe we can count on Express to do such a thing, but can we count on every tracing-relevant module out there to do such a thing? And to collate and report every potential data point that any tool out there might find interesting? And can we count on end-users to use updated versions of all these modules?

I think it could be within the realm of possibility to eliminate the need to monkeypatch Node core libraries, but similar to @Qard I am not convinced that it is possible to avoid monkeypatching 3rd-party ecosystem modules. I think it'll be enough of a challenge to have everyone in the ecosystem propagate asynchronous contexts correctly in the face of various queueing/pooling/promise-caching scenarios, much less spit out sufficient trace data for every tracing/diagnostic use case to fully avoid monkeypatching.

We probably don't want to end up with every APM/diagnostic-tool vendor trying to send PRs to every library they want to trace, trying to have them report some additional piece of diagnostic data that one vendor cares about but another doesn't. I imagine that with the natural competitive inclinations of vendors wanting to provide features/capabilities that others don't, we would either see monkeypatching continue or we might even see ecosystem module maintainers caught in the middle of an awkward "what all should this module report?" tug of war.

I don't mean to just be a naysayer - I think reducing the need for/incidence of monkeypatching would be excellent, and I think the plan/steps Mike laid out above can satisfy a huge portion of use cases and go a very long way toward that goal. I just hesitate at the notion that monkeypatching can be eliminated entirely; despite the downsides, it's practically a feature of JavaScript.

I would be very interested to read about how other languages/platforms handle this sort of need, if anybody is familiar with good resources or examples. I know some other languages have handled asynchronous context tracking brilliantly, so maybe there is also something to be learned in this adjacent area.

danielkhan · 2018-01-29T01:53:50Z

I imagine that with the natural competitive inclinations of vendors wanting to provide features/capabilities that others don't, we would either see monkeypatching continue or we might even see ecosystem module maintainers caught in the middle of an awkward "what all should this module report?" tug of war.

Speaking as a vendor, I would appreciate a generic API. The agent part of an APM product is just a fraction of the value proposition of APM.

I don't think that there is a generic way to instrument every module there is.
I don't even think that this is needed, or maybe we have to reiterate on which problem we are trying to solve.

For me the MongoDB APM vendor API is a good example of how a module vendor provides a way for APM vendors to register callbacks instead of monkey patching.
So as APM vendor I know that this is MongoDB and I have to deliberately register myself there but the instrumentation happens deterministically in a publisher / subscriber way and isn't prone to break when the signature of a function changes.

Flarna · 2018-01-29T07:41:50Z

I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like needing to inspect a connection object of a database driver during a query to include host/port...

Fully agree here. The trace events have to give more info then just function arguments. Depending on the concrete usecase it may be quite some effort to put all this data together.

And can we count on end-users to use updated versions of all these modules?

I fear not - we even see customers insisting on use of NodeJS 0.12. Others use ancient versions of NodeJs 4.

I would be very interested to read about how other languages/platforms handle this sort of need, if anybody is familiar with good resources or examples. I know some other languages have handled asynchronous context tracking brilliantly, so maybe there is also something to be learned in this adjacent area.

As far as I know the approach in Java and .NET is patching bytecode during loading. This requires you to be an "agent module" loaded in a special way - not like in NodeJS.
I would guess more knowledge of internals is needed here compared to NodeJs.
On the other hand you can be quite sure that your are the only one in a process doing something like this - again a major difference to NodeJs where patching of foreign objects is very common (not just for tracing/diagnostics).

mike-kaufman · 2018-01-29T16:26:52Z

RE other runtimes, .net is now using something called DiagnosticSource, which is similar in function to DiagnosticChannel.

mike-kaufman · 2018-01-29T16:39:11Z

@LewisJEllis - thanks for your comments. A brief response to some of your points:

I don't think we'll see monkey-patching go away overnight; in fact the plan above has us relying on monkey-patching in the interim until appropriate APIs can get merged into target libraries. Someone doing this work will need to address surfacing necessary data, and this may very well involve refactoring on the target library. e.g., if express has challenges in tracking route info w/out monkey-patching, then we should consider a PR to refactor express to make this a supported scenario.
RE APM vendors all getting on board, each vendor is free to go their own way if they prefer. That said, I think there's enough benefits in what's being proposed to get the vendors + library maintainers to cooperate.

isaachier · 2018-03-09T16:59:07Z

Even if you want to use OpenCensus, they seem to be interested in implementing the OpenTracing interface. OpenTracing is an interface behind which the actual tracer can do tracer-specific work. Here is the issue regard the OpenCensus Go client, which easily applies to all of the clients: census-instrumentation/opencensus-go#502.

mike-kaufman · 2018-04-04T19:14:46Z

closing in liue of #180.

mike-kaufman mentioned this issue Dec 17, 2017

Node.js Collaboration Summit diagnostics discussion minutes #95

Closed

mike-kaufman added the diag-agenda label Jan 10, 2018

mike-kaufman mentioned this issue Jan 10, 2018

Diag WG Meeting - 2018-01-10 #116

Closed

mike-kaufman mentioned this issue Jan 24, 2018

Diag WG Meeting - 2018-01-24 #117

Closed

mike-kaufman changed the title ~~Get out of monkey patching~~ Eliminate Monkey Patching for diagnostic instrumentation Jan 24, 2018

mike-kaufman changed the title ~~Eliminate Monkey Patching for diagnostic instrumentation~~ Eliminate Monkey Patching for Diagnostic Instrumentation Jan 24, 2018

mike-kaufman mentioned this issue Feb 5, 2018

Diag WG Meeting - 2018-02-07 #145

Closed

mike-kaufman mentioned this issue Feb 20, 2018

Diag WG Meeting - 2018-02-21 #167

Closed

mhdawson mentioned this issue Mar 1, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-03-07 #173

Closed

mhdawson mentioned this issue Mar 19, 2018

Node.js Foundation Diagnostics WorkGroup Meeting 2018-03-21 #175

Closed

This was referenced Apr 2, 2018

Diagnostics Channel - Next steps #180

Closed

Node.js Foundation Diagnostics WorkGroup Meeting 2018-04-04 #178

Closed

mike-kaufman closed this as completed Apr 4, 2018

Eliminate Monkey Patching for Diagnostic Instrumentation #134

Eliminate Monkey Patching for Diagnostic Instrumentation #134

Comments

mike-kaufman commented Dec 17, 2017 • edited

AndreasMadsen commented Dec 18, 2017

Flarna commented Dec 18, 2017

AndreasMadsen commented Dec 18, 2017

watson commented Dec 18, 2017

AndreasMadsen commented Dec 18, 2017

Flarna commented Dec 18, 2017 • edited

danielkhan commented Dec 18, 2017

AndreasMadsen commented Dec 18, 2017

mike-kaufman commented Dec 18, 2017

tedsuo commented Dec 18, 2017

vdeturckheim commented Dec 18, 2017 • edited

mike-kaufman commented Dec 19, 2017 • edited

mhdawson commented Jan 3, 2018

SergeyKanzhelev commented Jan 9, 2018

ofrobots commented Jan 9, 2018 • edited

mike-kaufman commented Jan 24, 2018 • edited

ofrobots commented Jan 24, 2018

mcollina commented Jan 25, 2018

mike-kaufman commented Jan 25, 2018 • edited

Flarna commented Jan 25, 2018

mike-kaufman commented Jan 25, 2018

kjin commented Jan 25, 2018

Flarna commented Jan 26, 2018

ofrobots commented Jan 26, 2018

mike-kaufman commented Jan 26, 2018

danielkhan commented Jan 26, 2018

tedsuo commented Jan 26, 2018

Qard commented Jan 27, 2018 • edited

mike-kaufman commented Jan 27, 2018

mike-kaufman commented Jan 27, 2018

LewisJEllis commented Jan 27, 2018 • edited

danielkhan commented Jan 29, 2018

Flarna commented Jan 29, 2018 • edited

mike-kaufman commented Jan 29, 2018

mike-kaufman commented Jan 29, 2018 • edited

isaachier commented Mar 9, 2018

mike-kaufman commented Apr 4, 2018

mike-kaufman commented Dec 17, 2017 •

edited

Flarna commented Dec 18, 2017 •

edited

vdeturckheim commented Dec 18, 2017 •

edited

mike-kaufman commented Dec 19, 2017 •

edited

ofrobots commented Jan 9, 2018 •

edited

mike-kaufman commented Jan 24, 2018 •

edited

mike-kaufman commented Jan 25, 2018 •

edited

Qard commented Jan 27, 2018 •

edited

LewisJEllis commented Jan 27, 2018 •

edited

Flarna commented Jan 29, 2018 •

edited

mike-kaufman commented Jan 29, 2018 •

edited