Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility between @opentelemetry/instrumentation-http ^0.35.1 and aws-sdk v2 #3922

Closed
leonardofreitass opened this issue Jun 19, 2023 · 10 comments
Assignees
Labels
bug Something isn't working priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc

Comments

@leonardofreitass
Copy link

leonardofreitass commented Jun 19, 2023

What happened?

Steps to Reproduce

When using @opentelemetry/instrumentation-http version 0.35.1 or higher, whenever aws-sdk v2.x.x retries a request (e.g.: rate limit or timeout happens), the request fails, because aws-sdk adds the traceparent to the SignedHeaders, and as a result AWS throws a InvalidSignatureException 400 back.

Expected Result

Traceparent is not added to SignedHeaders making possible to retry a failed request.

Actual Result

Traceparent is added to SignedHeaders resulting in a HTTP 400 InvalidSignatureException.

Additional Details

This should not be an issue in aws-sdk v3. Also, strange enough, this only happens on request retries. If the request is successful at first attempt, this is not a problem, making this issue even harder to reproduce and debug.

I know this is hard to fix from this package, and a proper fix would probably be aws-sdk to ignore the traceparent header, but this was definitely working prior to version 0.35.1. I managed to consistently reproduce this locally and pinpoint the offender commit to this one: 4978743

For some reason, removing the header normalization breaks request retry on aws-sdk v2. I tried to dig deeper into why this is the case, but I only managed to confirm that this piece of code being removed is definitely the one causing the issue, not why.

If anything, we might want to add somewhere that version 0.35.0 is the latest one where you can safely use with aws-sdk v2, as this is a very hard issue to reproduce and chase down.

OpenTelemetry Setup Code

import { NodeTracerConfig, NodeTracerProvider } from '@opentelemetry/sdk-trace-node'
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api'
import { Resource } from '@opentelemetry/resources'

import { registerInstrumentations } from '@opentelemetry/instrumentation'
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http'

diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO)

const resource = new Resource({
  'service.name': 'foo-bar-service',
})

const tracerConfig: NodeTracerConfig = {
  resource,
}

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation({
      ignoreIncomingPaths: ['/metrics'],
    }),
  ],
  tracerProvider: new NodeTracerProvider(tracerConfig),
})

package.json

{
  "dependencies": {
    "@opentelemetry/api": "^1.4.1",
    "@opentelemetry/context-async-hooks": "^1.14.0",
    "@opentelemetry/core": "^1.14.0",
    "@opentelemetry/exporter-trace-otlp-http": "^0.40.0",
    "@opentelemetry/instrumentation": "^0.40.0",
    "@opentelemetry/instrumentation-amqplib": "^0.32.5",
    "@opentelemetry/instrumentation-dns": "^0.31.5",
    "@opentelemetry/instrumentation-http": "0.35.0",
    "@opentelemetry/instrumentation-net": "^0.31.4",
    "@opentelemetry/resources": "^1.14.0",
    "@opentelemetry/sdk-trace-base": "^1.0.1",
    "@opentelemetry/sdk-trace-node": "^1.14.0",
    "aws-sdk": "^2.1399.0"
  },
}

Relevant log output

No response

@leonardofreitass leonardofreitass added bug Something isn't working triage labels Jun 19, 2023
@leonardofreitass
Copy link
Author

@pichlermarc pichlermarc added priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc and removed triage labels Jun 21, 2023
@weyert
Copy link
Contributor

weyert commented Jun 22, 2023

Looks like someone suggested this: https://github.com/NativeChat/otel-aws-sdk-retry-patch/tree/main

@mugli
Copy link

mugli commented Jun 23, 2023

@weyert I saw that lib, I would not suggest using that. There are multiple problems:

  • It's only aware of the traceparent header, and does not deal with other propagation headers, like baggage, tracestate etc
  • It's not gonna work for people who are using ESM instead of CJS. (I assume one needs https://github.com/DataDog/import-in-the-middle for that)?

Maybe it'd be better to take hints from that lib and not directly use it.

@dyladan dyladan added up-for-grabs Good for taking. Extra help will be provided by maintainers and removed up-for-grabs Good for taking. Extra help will be provided by maintainers labels Jun 28, 2023
@dyladan dyladan self-assigned this Jun 28, 2023
@Ankcorn
Copy link

Ankcorn commented Jul 27, 2023

I also opened an issue on the aws-sdk-js repo.

They ignore the x-amz-trace-id from the signature signing so it wouldn't be uncalled for to ignore traceparent and others either

@Ankcorn
Copy link

Ankcorn commented Jul 31, 2023

Hey, getting your bump on this issue would be great @leonardofreitass - aws/aws-sdk-js#4472

I have done some digging and essentially the aws-sdk-js caches the signature and doesn't recompute it when a request retries.

This leads to a mismatch between the signature and the new traceparent header.

The js v2 SDK is the only SDK with this caching behaviour as far as I know (I only checked botocore and js v3)

The fix on the AWS end is easy but not ideal for lambda users as they would need to start packaging the aws-sdk in their lambdas to get the update.

Would a work around in the instrumentation-http where we check the request is the AWS SDK V2 and skip injecting the propagation be acceptable here?

@weyert
Copy link
Contributor

weyert commented Jul 31, 2023

I think it's better to not have custom logic AWS in the 'vanilla' instrumentation-http.

@Ankcorn
Copy link

Ankcorn commented Jul 31, 2023

To fix this there needs to be some kind of deny list for propagation as described in #1698

This wouldn't give good defaults to people using opentelemetry outside of vendors who are configuring it for people

@martinkuba
Copy link
Contributor

This looks like a duplicate of open-telemetry/opentelemetry-js-contrib#1609.

There is a workaround by using the following:
AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent');

@mugli
Copy link

mugli commented Feb 1, 2024

This PR seems to fix the issue: #4346
which is released in @opentelemetry/instrumentation-http v0.46.0: #4358

@dyladan dyladan closed this as completed Feb 14, 2024
@dyladan
Copy link
Member

dyladan commented Feb 14, 2024

Closed based on #3922 (comment)

Feel free to reopen or open a new issue if that is not correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc
Projects
None yet
Development

No branches or pull requests

7 participants