Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support APM Server intake API version 2 #465

Merged
merged 5 commits into from Aug 2, 2018
Merged

Conversation

watson
Copy link
Member

@watson watson commented Jul 23, 2018

Closes #356

Checklist

  • Implement code
  • Add tests
  • Fix TAV tests

@watson
Copy link
Member Author

watson commented Jul 23, 2018

This PR should technically be ready for testing with the POC. The checklist above is meant for making it production ready

@watson watson mentioned this pull request Jul 23, 2018
@roncohen
Copy link
Contributor

roncohen commented Jul 23, 2018

would be great to get some benchmarks to see if this implementation is better than the current intake protocol, for example from a memory usage perspective.

lib/agent.js Outdated
agent._apmServer.flush(cb)
})
} else {
process.nextTick(cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you log something here so we can tell when someone tries to send something without _apmServer set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

lib/agent.js Outdated
if (this._apmServer) {
this._apmServer.flush(cb)
} else {
process.nextTick(cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, log something so we know when _apmServer is not set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -50,6 +49,15 @@ function Transaction (agent, name, type) {
}
})

Object.defineProperty(this, 'timestamp', {
get: function () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use shorthand here. get () {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -112,17 +121,16 @@ Transaction.prototype.buildSpan = function () {
}

if (this.ended) {
this._agent.logger.debug('transaction already ended - cannot build new span %o', {id: this.id})
this._agent.logger.debug('transaction already ended - cannot build new span %o', {id: this.id}) // TODO: Should this be supported in the new API?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When everything is fully streaming, I don't think there's any reason to prevent further spans. It's only useful without streaming to ensure the transaction doesn't stay open forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I'm hesitant with this is that we might have a transaction that kicks off some form of background job that gets contextually attached to the transaction that started it. Any spans that's created as a result of this would get associated with the transaction. This will make the span waterfall almost useless as it can suddenly cover a period of days or weeks. Am I being too paranoid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be user mode queuing though. We won't actually form that link unless we explicitly bind the callback. The logical continuation to the next job would be out-of-band from the request, it'd be triggered by the completion of another job in the queue.

I could see maybe time-based jobs naively using setTimeout for everything, and that maybe being an issue. But that's just not great design and we probably should be drawing attention to it in some way. 🤔

Copy link
Member Author

@watson watson Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah setTimeout and friends are probably the most common way to trigger this.

Question is however if the spans happening after the transaction ends are relevant to the user at all?

@elastic/apm-agent-devs @roncohen @makwarth In the new intake API, we get the ability to record spans that start after the parent transaction ends. Is this something we want to do or should those just be ignored? I.e. are they adding any value to the user or are they just noise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. are they adding any value to the user or are they just noise?

I think that depends on how the application is making use of that. I'd rather default to not ignore than to ignore these spans.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't ignore them by default, would it be a good idea to have a config option to ignore them?

It could even have 3 settings:

  • Ignore spans started after the transaction ends
  • Ignore spans started before the transaction ends, but ended after
  • Allow all spans

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand you're worried about the case where people would set up a recurring periodic job from inside a transaction, because it would mean spans continue to happen in an endless stream. Two ideas:

  • We could experiment in making the assumption that there are two categories of delayed execution process.nextTick()/setImmediate() and setTimeout()/setInterval(). The first one could indicate that the developer intended this to be part of the transaction, while the second category will start a new transaction.
  • We could set a numerical limit to the number of spans that we will consider, after a transaction has ended - or a wall clock timeout which would be the limit to how much time we'll continue to "monitor" a transaction after is has ended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also include a special escape hatch that specifically calls to setTimeout or setInterval after the transaction ends would not continue the context. That way, anything else in the code path will continue, but anything that could potentially be significantly delayed will not.

@watson
Copy link
Member Author

watson commented Jul 24, 2018

To make the scope of this PR as small as possible, I've made it go to a different branch than master. This way we can merge it when we're ok with the current scope of it and add new v2 features in new PR's to the same branch. Then when we're ready to land support for v2 in master, we make a PR that merges the api-v2 branch into master

@watson
Copy link
Member Author

watson commented Jul 24, 2018

To help this PR getting merge, I've created the meta issue #471 and moved all the non-essential todo items from this PR to that

@watson
Copy link
Member Author

watson commented Jul 31, 2018

jenkins, run tav tests please

this._apmServer.flush(cb)
} else {
this.logger.warn(new Error('cannot flush agent before it is started'))
process.nextTick(cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worthwhile for the flush callback be able to receive that error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it out as this only occurs if the agent isn't started. So from the outside I want the user to use the module in the same way no matter if the agent is started or not. And normally errors coming from the _apmServer is just logged, so I think it's best to maintain that behavior here as well.

Copy link
Member Author

@watson watson Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just realized that the callback given to captureError will pass a similar error along in the callback: https://github.com/elastic/apm-agent-nodejs/pull/465/files#diff-7403da6ce2244c65d641fdfcc095be93R331

But I think maybe that should be avoided in captureError. If so the captureError callback will never be called with an error either.

lib/config.js Outdated
@@ -28,7 +28,9 @@ var DEFAULTS = {
verifyServerCert: true,
active: true,
logLevel: 'info',
hostname: os.hostname(),
hostname: os.hostname(), // TODO: Should we just let the http client default to this?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaulting in the http client sounds reasonable to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

if (!payload) return agent.logger.debug('transaction ignored by filter %o', {id: transaction.id})
truncate.transaction(payload)
agent.logger.debug('sending transaction %o', {id: transaction.id})
if (agent._apmServer) agent._apmServer.sendTransaction(payload)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way for agent._apmServer to be null when this._started is true? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No you're right - if this._started === true, then we'll also have an agent._apmServer. Would that be a better check? The current check is guaranteed to always work, whereas the conditions of the other might change in the future right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made that comment, because this check is already in an outer if block checking this._started, so it seemed redundant. Github cuts off the preview just after that line though. 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes... I'll remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

architecture: process.arch,
platform: process.platform
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like process, system and runtime stuff no longer exists? I don't see it in the http client PR either. Is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see it's a bit confusing given the current state of the all the in-progress dependencies.

As you can see here, this PR depends on a custom branch of the http client called v2-1 that currently lives on my fork: https://github.com/elastic/apm-agent-nodejs/pull/465/files#diff-b9cfc7f2cdf78a7f4b91a753d10865a2R74

In that branch the metadata responsibilities have been moved to the http client: https://github.com/watson/apm-nodejs-http-client/blob/4f02d7eda173b738a309fdca2e3464a1bbae5652/index.js#L293-L332

But there's no PR for this branch yet as I don't want to make the current http client PR too complicated, so once the the current PR is merged, I'll open a new one for the v2-1 branch 😅

Copy link
Member Author

@watson watson Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just merged the first http client PR and created the next based on my v2-1 branch: elastic/apm-nodejs-http-client#6 - soon it should hopefully be less confusing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Gotcha. I was trying to match up the two PRs to make sure the functionality lined up between the two. 😅

client.request('transactions', {}, body, () => {})
req.pipe(zlib.createGunzip()).pipe(ndjson.parse()).on('data', function (data) {
if (req.method !== 'POST') throw new Error(`Unexpected HTTP method: ${req.method}`)
if (req.url !== '/v2/intake') throw new Error(`Unexpected HTTP url: ${req.url}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do the req field validation outside the pipe sequence and data handler. Also, the assert module might be cleaner here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

t.equal(body.errors[0].exception.message, 'with callback')
.on('data-error', function (data) {
t.equal(data.exception.message, 'with callback')
t.end()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does tape work correctly when you explicitly t.end() while using a t.plan(...)?

Copy link
Member Author

@watson watson Aug 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all it does when reaching the t.end is just fail if the number of assertions doesn't match what's planned. This is a way to validate our expectations of the async execution order. I.e. when we reach this point, we should have made the planned number of assertions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, cool. 👍

@@ -54,7 +49,7 @@ module.exports = function mockAgent (cb) {
sharedInstrumentation._agent = agent
agent._instrumentation = sharedInstrumentation
agent._instrumentation.currentTransaction = null
agent._instrumentation._queue._clear()
agent._apmServer = mockClient(expected, cb || noop) // TODO: Expected will not work here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate? What's the issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think you're right. I think this line is completely irrelevant and can just be removed. I'll update it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

t.ok(payload.start > 0)
t.ok(payload.duration > 0)
assert.stacktrace(t, 'myTest3', __filename, payload.stacktrace, agent)
t.deepEqual(Object.keys(payload), ['transactionId', 'timestamp', 'name', 'type', 'start', 'duration'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't think we need to validate the contents of the object here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

t.end()
process.exit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this process.exit() need to be here? I'd rather not have these hiding in the tests as they could cause confusion when adding more tests to the file in the future, if you don't notice them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@Qard
Copy link
Contributor

Qard commented Aug 1, 2018

Finally managed to get a complete review it. Looks good, other than a couple more comments.

@Qard
Copy link
Contributor

Qard commented Aug 2, 2018

Approved for v2 branch. Any remaining issues can be fixed in another PR, before merging v2 to master.

@watson watson merged commit af1dd76 into elastic:api-v2 Aug 2, 2018
@watson watson deleted the v2-api branch August 2, 2018 18:09
watson added a commit that referenced this pull request Aug 2, 2018
watson added a commit that referenced this pull request Aug 6, 2018
watson added a commit that referenced this pull request Aug 9, 2018
watson added a commit that referenced this pull request Aug 10, 2018
watson added a commit that referenced this pull request Aug 29, 2018
Qard pushed a commit to Qard/apm-agent-nodejs that referenced this pull request Aug 29, 2018
Qard pushed a commit to Qard/apm-agent-nodejs that referenced this pull request Aug 30, 2018
watson added a commit that referenced this pull request Aug 30, 2018
Qard pushed a commit to Qard/apm-agent-nodejs that referenced this pull request Aug 30, 2018
watson added a commit that referenced this pull request Sep 6, 2018
watson added a commit that referenced this pull request Sep 13, 2018
Qard pushed a commit to Qard/apm-agent-nodejs that referenced this pull request Sep 13, 2018
Qard pushed a commit to Qard/apm-agent-nodejs that referenced this pull request Sep 13, 2018
watson added a commit that referenced this pull request Oct 19, 2018
watson added a commit to watson/apm-agent-nodejs that referenced this pull request Oct 27, 2018
Qard pushed a commit that referenced this pull request Oct 31, 2018
watson added a commit to watson/apm-agent-nodejs that referenced this pull request Nov 6, 2018
watson added a commit that referenced this pull request Nov 6, 2018
Qard pushed a commit that referenced this pull request Nov 8, 2018
watson added a commit that referenced this pull request Nov 10, 2018
watson added a commit to watson/apm-agent-nodejs that referenced this pull request Nov 13, 2018
watson added a commit to watson/apm-agent-nodejs that referenced this pull request Nov 13, 2018
watson added a commit to watson/apm-agent-nodejs that referenced this pull request Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants