Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 behavior changes when using the auto instrumentation #1609

Closed
zdelagrange opened this issue Jul 21, 2023 · 12 comments · Fixed by open-telemetry/opentelemetry-js#4346
Closed
Labels
bug Something isn't working priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies

Comments

@zdelagrange
Copy link

What version of OpenTelemetry are you using?

"@opentelemetry/auto-instrumentations-node": "^0.37.1",
"@opentelemetry/sdk-node": "^0.40.0",

What version of Node are you using?

v16.20.0

What did you do?

When using the recommended nodejs autoinstrumentation set up found here: https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#setup, While having a global S3 configuration, the autoinstrumentation causes an error to be thrown when the AWS_REGION is set to a region that is not us-east-1 or aws-global. The requests will succeed when the instrumentation.js is not required.

Reproduction here: https://github.com/zdelagrange/otel-nodejs-aws-bug

What did you expect to see?

I expected the S3 functionality to not change when using the auto instrumentation

What did you see instead?

S3 functionality changed when using the auto instrumentation causing a 403/failure to be returned when a success returns without using the auto instrumentation:

SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
    at Request.extractError (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/services/s3.js:711:35)
    at Request.callListeners (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:686:14)
    at Request.transition (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:688:12)
    at Request.callListeners (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'SignatureDoesNotMatch',
  region: null,
  time: 2023-07-21T14:49:13.821Z,
  requestId: 'xxx',
  extendedRequestId: 'xxx',
  cfId: undefined,
  statusCode: 403,
  retryable: false,
  retryDelay: 175.4145547667371
}

Additional context

Reproduction here: https://github.com/zdelagrange/otel-nodejs-aws-bug

@zdelagrange zdelagrange added the bug Something isn't working label Jul 21, 2023
@zdelagrange
Copy link
Author

Tagging component owners @carolabadeer @blumamir

@dyladan
Copy link
Member

dyladan commented Jul 26, 2023

I think this is a P1 bug. I'll assign that priority for now and the code owner can update it if required.

@dyladan dyladan added the priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies label Jul 26, 2023
@Ankcorn
Copy link
Contributor

Ankcorn commented Jul 27, 2023

Hey we ran into this too, its not isolated to s3. Seems to effect requests that are retried too.

Kinesis without retry:
https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/1d5a0a575ff7e392a8e7f9d47e20c912

Kinesis with retry:
https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/2c297889cf56f6d97287ed230852416b

This is using these versions

 "@opentelemetry/api": "^1.4.1"
  "@opentelemetry/exporter-trace-otlp-http": "^0.40.0"
  "@opentelemetry/instrumentation": "^0.40.0"
  "@opentelemetry/instrumentation-aws-lambda": "^0.35.2"
  "@opentelemetry/instrumentation-aws-sdk": "^0.34.2"
  "@opentelemetry/instrumentation-http": "^0.40.0"
  "@opentelemetry/sdk-trace-base": "^1.14.0"
  "@opentelemetry/sdk-trace-node": "^1.14.0"

From this https://github.com/baselime/lambda-node-opentelemetry

I am looking into a fix and if it's helpful I can post anything that I find here

@zdelagrange
Copy link
Author

zdelagrange commented Aug 9, 2023

Howdy,

Any update on this? Would be good to at least get some triage info from the maintainers if possible...

Would upgrading to aws api v3 fix this?

Thank you !!

@ten-ans
Copy link

ten-ans commented Aug 9, 2023

I ran into this issue during our migration to OpenTelemetry as well, the problem is with instrumentation-http and not the AWS SDK instrumentation specifically.

https://github.com/open-telemetry/opentelemetry-js/blob/a4213183b03664cdf29e622932ea4823a9b8de50/experimental/packages/opentelemetry-instrumentation-http/src/http.ts#L587

The _outgoingRequestFunction function here injects tracing headers into optionsParsed.headers, but that property holds a reference to the same object that the instrumented code passed to the nodejs request or get function's options object, so the headers will leak back out into the caller's internal state. This can cause a wide variety of unexpected behavior.

Regarding the specific issue that we ran into with the AWS SDK making calls to S3, we have a bucket located in us-east-2, and when we try to interact with it AWS sends back a redirect to a different API endpoint. The v2 AWS SDK then retries its request against this other endpoint, and in the process of doing so it needs to update the signature on the request's HTTP headers. An injected traceparent header is leaked out into the object that the AWS SDK's internally uses to track the request headers, and this causes the AWS SDK to include the traceparent header in the retry's request signature. The OpenTelemetry instrumentation then injects a different traceparent on this second request, and this causes the request signature to break.

The solution is simple enough, replace the value of optionsParsed.headers with a clone of the object that it points to. We locally hacked in a patch to this effect and it made the problem go away in our environment. Unfortunately it may take a while before I am able to contribute a proper upstream fix for this issue, but if anybody wants to work that description up into a PR against opentelemetry-js in the meantime then please feel free.

@Ankcorn
Copy link
Contributor

Ankcorn commented Aug 10, 2023

@ten-ans hey I also followed up with this bug on the Aws end. I was surprised that it only affected the V2 sdk and not V3.

I found that the V2 sdk cached the signature it passes adds and does not recompute it on retries.

I raised an issue here and got a simpler work around

AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent');

Which I have tested and confirmed that it works.

Hope this helps

aws/aws-sdk-js#4472 (comment)

@carolabadeer
Copy link
Contributor

Thanks @ten-ans and @Ankcorn for your thorough investigation of this issue!

@Ankcorn from reading through the issue you opened in the AWS SDK repo, it seems that they confirmed it is an issue with the AWS SDK v2 (and does not persist in v3). Is the suggested workaround a change that can be added in an application or an instrumentation change?

@Ankcorn
Copy link
Contributor

Ankcorn commented Aug 24, 2023

Hey @carolabadeer,

Thanks for taking a look at this :) aws/aws-sdk-js#4473 The PR to add traceparent has not been looked at yet.

I also don't know if it totally solves the problem, AFAIK would lambda runtimes still pull in an old and incompatible version of the aws-sdk? This would cause some confusion for new OTEL users.

We might be able to fix this by adding the

AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent'); 

to the AWS SDK instrumentation.

Thoughts?

@cartersocha
Copy link

Hey @mhausenblas is there someone from the AWS SDK team we can coordinate with? @martinkuba is going to spend some time here too

@trentm
Copy link
Contributor

trentm commented Nov 25, 2023

The issue -- as @ten-ans pointed out above (#1609 (comment)) -- is that the HttpInstrumentation modifies the headers object passed into http.request() (or http.get()) in-place to add the trace-context header(s):

http.get(someUrl, { headers: { Foo: 'Bar' } }, (res) => { ... });

Often, calling code won't notice this change. However, in aws-sdk code that retries, it keeps a reference to that headers object, so this is what happens:

  1. The first HTTP request is made with, for example, theHeaders = { Foo: 'Bar' }. AWSv4 request signing will include that "Foo" header, and any others in there, to create a signature for the request.
  2. OTel HTTP Instrumentation changes to theHeaders = { Foo: 'Bar', traceparent: 'a-traceparent-for-this-first-request' }. The request is sent. It fails for some reason, but is retryable.
  3. The AWS SDK then makes whatever request changes it needs for the retry. theHeaders still has the old traceparent value. The AWS SDK now includes both Foo and traceparent headers in its request signing. This would normally be fine, however the order is: aws-sdk uses the old traceparent value for the request signature, then OTel changes the traceparent to a new value for this next request, then the request is sent. The server notices the signature mismatch.

This can be seen in this example:

const { NodeSDK, tracing } = require('@opentelemetry/sdk-node');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const sdk = new NodeSDK({
  serviceName: 'make-http-req',
  spanProcessor: new tracing.SimpleSpanProcessor(new tracing.ConsoleSpanExporter()),
  instrumentations: [
    new HttpInstrumentation(),
  ]
})
sdk.start();

const http = require('http');
const myHeaders = {
  Foo: 'Bar'
}
http.get('http://localhost:8200/', { headers: myHeaders }, (res) => {
  console.log('response: %s %s', res.statusCode, res.headers);
  res.resume();
  res.on('end', () => {
    console.log('response end');
  });
});
console.log('http.get() called: was myHeaders mutated? %s', myHeaders);

Running that shows that myHeaders is mutated:

% node make-http-req.js
http.get() called: was myHeaders mutated? {
  Foo: 'Bar',
  traceparent: '00-35f4aeff468b2067ae1d5dba82c63ed4-0afcfcdb409c2d13-01'
}
response: 200 {
  date: 'Sat, 25 Nov 2023 00:17:39 GMT',
  connection: 'close',
  'transfer-encoding': 'chunked'
}
{
  traceId: '35f4aeff468b2067ae1d5dba82c63ed4',
  parentId: undefined,
  traceState: undefined,
  name: 'GET',
  id: '0afcfcdb409c2d13',
  kind: 2,
  timestamp: 1700871459100000,
  duration: 11929.505,
  attributes: {
    'http.url': 'http://localhost:8200/',
    'http.method': 'GET',
    'http.target': '/',
    'net.peer.name': 'localhost',
    'http.host': 'localhost:8200',
    'net.peer.ip': '127.0.0.1',
    'net.peer.port': 8200,
    'http.status_code': 200,
    'http.status_text': 'OK',
    'http.flavor': '1.1',
    'net.transport': 'ip_tcp'
  },
  status: { code: 0 },
  events: [],
  links: []
}
response end

I got lucky here, because I hit this before in Elastic's APM agent. A longer write-up is here: elastic/apm-agent-nodejs#2134

It makes sense why -- as noted in open-telemetry/opentelemetry-js#3922 -- that open-telemetry/opentelemetry-js@4978743 changed the behaviour. The header normalization from before that commit was indirectly creating a clone of the headers object.

possible fix

An alternative to #1813 would be something like this -- again, as @ten-ans suggested above.

diff --git a/experimental/packages/opentelemetry-instrumentation-http/src/http.ts b/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
index 9422bbc9..780f5c32 100644
--- a/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
+++ b/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
@@ -672,6 +672,10 @@ export class HttpInstrumentation extends InstrumentationBase<Http> {

       if (!optionsParsed.headers) {
         optionsParsed.headers = {};
+      } else {
+        // Make a copy of the headers object to avoid mutating an object the
+        // caller might have a reference to.
+        optionsParsed.headers = Object.assign({}, optionsParsed.headers);
       }
       propagation.inject(requestContext, optionsParsed.headers);

@martinkuba Thoughts? Should I make a PR for that?

@trentm
Copy link
Contributor

trentm commented Dec 5, 2023

Should I make a PR for that?

I've added a PR for the alternative fix here: open-telemetry/opentelemetry-js#4346

@trentm
Copy link
Contributor

trentm commented Dec 6, 2023

To be clear, this should be fixed with the next release of @opentelemetry/instrumentation-http (any release after 0.45.1).

dyladan added a commit to open-telemetry/opentelemetry-js that referenced this issue Dec 13, 2023
* Add Trent to approvers (#4311)

* chore(renovate): require dashboard approval for lerna updates (#4276)

* chore(ci): install semver globally to speed up "peer-api" workflow (#4270)

Closes: #4242

* fix(ci): remove token setup via environment variable from .npmrc (#4329)

* fix(instrumentation-http): resume responses when there is no response listener

Fixes a memory leak where unhandled response bodies pile up in node 20

* feat: add script to update changelogs on release preparation (#4315)

* feat: add script to update changelogs on releases

* fix: address comments

* Apply suggestions from code review

Co-authored-by: Trent Mick <trentm@gmail.com>

* fix: apply suggestions from code review

* fix: use packageJson.version instead of version

---------

Co-authored-by: Trent Mick <trentm@gmail.com>

* Fix event name

* test: make rawRequest HTTP-compliant

* Add node 20 to test matrix

* Enable old hash functions on 20

* Fix esm handling for iitm node 20

* Use err.code to make test more reliable

* Changelog

* nit: single import

* Remove unused files

* Add v20 to supported runtimes

* ci: add npm cache in actions/setup-node (#4271)

* feat(sdk-logs): add droppedAttributesCount field to ReadableLogRecord (#4289)

* feat(sdk-logs): add droppedAttributesCount field to ReadableLogRecord

* chore: check droppedAttributesCount value in test case

* feat(otlp-transformer): make toLogRecord() use ReadableLogRecord.droppedAttributesCount

---------

Co-authored-by: Marc Pichler <marc.pichler@dynatrace.com>

* fix(api-logs): allow passing in TimeInput for LogRecord (#4345)

* fix: allow passing in TimeInput for LogRecord

* chore: update changelog

* fix: programmatic url and headers take precedence in metric exporters… (#4334)

* fix: programmatic url and headers take precedence in metric exporters (#2370)

* chore: adjust grpc exporter metrics test

* chore(changelog): update changelog

* fix(instrumentation-http): do not mutate given headers object for outgoing http requests (#4346)

Fixes: open-telemetry/opentelemetry-js-contrib#1609

* chore(deps): update actions/stale action to v9 (#4353)

* fix(deps): update dependency import-in-the-middle to v1.6.0 (#4357)

* chore(deps): update all patch versions (#4306)

* chore(ci): use node 20 in lint workflow (#4359)

* chore(deps): update dependency linkinator to v6 (#4237)

* fix(otlp-exporter-base): decrease default concurrency limit to 30 (#4211)

* fix(otlp-exporter-base): decrease concurrency limit to 30

* fix(changelog): add changelog entry

* chore(deps): use actions/checkout >4 instead of 4.0.0 exactly (#4361)

---------

Co-authored-by: Marc Pichler <marc.pichler@dynatrace.com>
Co-authored-by: strivly <133362191+strivly@users.noreply.github.com>
Co-authored-by: Trent Mick <trentm@gmail.com>
Co-authored-by: lyzlisa <34400837+lyzlisa@users.noreply.github.com>
Co-authored-by: Hyun Oh <hyunnoh01@gmail.com>
Co-authored-by: Siim Kallas <siimkallas@gmail.com>
Co-authored-by: Vladimir Adamić <441333+Vunovati@users.noreply.github.com>
Co-authored-by: Mend Renovate <bot@renovateapp.com>
rdeavila94 pushed a commit to rdeavila94/opentelemetry-js that referenced this issue Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies
Projects
None yet
7 participants