Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Credentials timeout errors are deceptive #1118

Open
elrob opened this issue Mar 29, 2024 · 8 comments
Open

Credentials timeout errors are deceptive #1118

elrob opened this issue Mar 29, 2024 · 8 comments
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@elrob
Copy link
Contributor

elrob commented Mar 29, 2024

Describe the bug

Today, when testing I've been seeing this error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: TimedOutError(5s), connection: Unknown } })

This is despite the fact that I have this configuration set:

    let aws_sdk_config = aws_config::defaults(BehaviorVersion::latest())
        .retry_config(RetryConfig::standard().with_max_attempts(20))
        .load()
        .await;

Expected Behavior

I assume that this error is transient so should be resolved with retries even if my local network is unstable?
Or based on my RetryConfig, it should have retried 20 times but I assume it didn't based on how quickly the error appears.

Current Behavior

Occasional errors like this which seem to happen immediately rather than retrying up to 20 times.

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Other(None), source: TimedOutError(5s), connection: Unknown } })

Reproduction Steps

Maybe just configure a client as above and run a command with no network connection. The error appears quickly and retries don't seem to be obeyed.

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v1.1.9
│   ├── aws-credential-types v1.1.8
│   │   ├── aws-smithy-async v1.2.0
│   │   ├── aws-smithy-runtime-api v1.2.0
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-sigv4 v1.2.0
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.7
│   │   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-http v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-ssooidc v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-sts v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.7
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.1.8 (*)
├── aws-sdk-elasticloadbalancingv2 v1.19.0
│   ├── aws-credential-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-query v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.7 (*)
│   ├── aws-types v1.1.8 (*)


### Environment details (OS name and version, etc.)

macOS 13.6.3 (22G436)

### Logs

_No response_
@elrob elrob added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 29, 2024
@ysaito1001
Copy link
Collaborator

Hi @elrob, looks like a timeout of 5 seconds kicked in before you observed the desired number of retries. Can you try disabling timeout (using TimeoutConfig::disabled()) and pass it to ConfigLoader::timeout_config to see if the behavior matches your expectation?

@ysaito1001 ysaito1001 removed the needs-triage This issue or PR still needs to be triaged. label Mar 29, 2024
@rcoh
Copy link
Contributor

rcoh commented Apr 1, 2024

This is an interesting one! There is a clue here as to what's going on:

TimedOutError(5s)

the TimedOutError in the connector doesn't have any fields—this is actually the timeout coming from identity cache!

That's the actual problem here! We're trying to load credentials and it's timing out after 5 seconds (this is a pretty sensible timeout for credentials, if it's taking longer than 5 seconds to load credentials, there are other problems!)

The error is a little misleading—it looks like a connector error because we never sent the request (we should fix that), but the timeout here is coming from the credentials chain. That's why it's not retried.

cc @ysaito1001 this appears to be the problematic line: https://github.com/smithy-lang/smithy-rs/blob/e3f0de42db727d5419948b03c7d5a3773b07e34b/rust-runtime/aws-smithy-runtime-api/src/client/orchestrator.rs#L273

We are incorrectly assuming that any dispatch failure is a connector error which isn't the case. The error here was actually a failure to load credentials. We need to figure out how to fix the error type here.

@elrob
Copy link
Contributor Author

elrob commented Apr 2, 2024

@ysaito1001 @rcoh
Thank you!
Yeah, I also thought it might be coming from the credentials provider. There is some annoying flakey malware security software installed on my machine which often causes temporary connection issues.

My assumption remains the same though, I'd expect the credentials provider to retry after a timeout up to the configured number of retries. Or if it isn't the credentials provider, I'd expect the client to retry after a timeout up to the configured number of retries. Perhaps the credentials provider STS client needs a separately configured retry configuration? I actually looked for that but didn't work out how I could configure that.

@ysaito1001
Copy link
Collaborator

Good eyes @rcoh! True, that's definitely something we want to clarify the error type for.

I'd expect the credentials provider to retry after a timeout up to the configured number of retries.

That depends on the underlying retry classifiers (those that implement the ClassifyRetry trait). For instance, if you do not get a response from STS and fail to load credentials, then I don't think it gets retried regardless of the number of retries configured. Can you turn on tracing_subscriber e.g. tracing_subscriber::fmt::init() and see whether an retry occurs or does not retry at all before you error out?

@rcoh
Copy link
Contributor

rcoh commented Apr 4, 2024

Just wanted to clarify some things here. The timeout is on the top level of loading credentials (at the cache level). It is set to 5 seconds (this is configurable), but we don't really recommend messing with it.

Re: ClassifyRetry: Credentials retries do exist, but they're configured specifically for each provider and don't operate the same way as general-purpose retries (nor are they configurable in the same way). They will show up in the logs, however.

This is because credentials taking longer than 5 seconds to load almost always is indicative of other problems or anti patterns.

For example, this can occur if you are simultaneously creating a large number of clients or aws_config instances. There is currently an open issue around credentials being cached per client instead of on the root SdkConfig (but you still should not be creating a large number of clients!)

You can configure this timeout to be longer, but the timeout is almost certainly not the problem here. I would dive deeper into where your credentials are coming from, how often they're being refreshed, etc. — it's also not out of the question that you're triggering an SDK bug around credentials caching!

@elrob
Copy link
Contributor Author

elrob commented Apr 4, 2024

Thanks. I haven't seen this issue appear in EKS but it appeared a few times for me when testing something out on my machine. I can live with it assuming it doesn't appear on EKS in production.

@elrob
Copy link
Contributor Author

elrob commented Apr 4, 2024

I don't seem to see any retries in the logs even with RUST_LOG=trace.
This example happens locally on my machine if I disable my network connection (turn off wifi).
Whether or not this should be retryable is something I'll leave to you to decide.

{"timestamp":"2024-04-04T11:42:33.950270Z","level":"INFO","fields":{"message":"first credentials will be loaded from Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }","base":"Sso { sso_session_name: None, sso_region: \"us-east-1\", sso_start_url: \"REDACTED\", sso_account_id: Some(\"REDACTED\"), sso_role_name: Some(\"REDACTED\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:33.950301Z","level":"INFO","fields":{"message":"which will be used to assume a role","role_arn":"RoleArn { role_arn: \"arn:aws:iam::REDACTED\", external_id: None, session_name: Some(\"sso\") }"},"target":"aws_config::profile::credentials::exec","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.227689Z","level":"WARN","fields":{"message":"failed to load base credentials","error":"an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }))"},"target":"aws_config::profile::credentials","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}
{"timestamp":"2024-04-04T11:42:35.228331Z","level":"WARN","fields":{"message":"provider failed to provide credentials","provider":"Profile","error":"an error occurred while loading credentials: an error occurred while loading credentials: dispatch failure: io error: error trying to connect: dns error: failed to lookup address information: nodename nor servname provided, or not known: dns error: failed to lookup address information: nodename nor servname provided, or not known: failed to lookup address information: nodename nor servname provided, or not known (ProviderError(ProviderError { source: ProviderError(ProviderError { source: DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, ConnectError(\"dns error\", Custom { kind: Uncategorized, error: \"failed to lookup address information: nodename nor servname provided, or not known\" })), connection: Unknown } }) }) }))"},"target":"aws_config::meta::credentials::chain","span":{"name":"lazy_load_identity"},"spans":[{"name":"lazy_load_identity"}]}

@jdisanti
Copy link
Contributor

jdisanti commented Apr 9, 2024

Leaving this open to make the error message less confusing when credentials providers time out (also tracking in smithy-lang/smithy-rs#2950).

@jdisanti jdisanti added the p2 This is a standard priority issue label Apr 9, 2024
@jdisanti jdisanti changed the title Retries not working as expected? Credentials timeout errors are deceptive Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

4 participants