Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upticks of S3 timeout errors after SDK upgrade #1130

Open
xizhem opened this issue Apr 16, 2024 · 4 comments
Open

Upticks of S3 timeout errors after SDK upgrade #1130

xizhem opened this issue Apr 16, 2024 · 4 comments
Labels
bug This issue is a bug. p1 This is a high priority issue

Comments

@xizhem
Copy link

xizhem commented Apr 16, 2024

Describe the bug

Recently after SDK upgrade to aws-s3-sdk(>1.14), we notice upward trend of S3 GET timeout errors in production. We already ruled out the issue from #1118 . In our case, the error message is TransientError due to hitting attempt timeout.

There are correlation with connection timeout setting with the number of errors we've seen.
There are also correlation with the load that we send to S3 to the number of errors.

Expected Behavior

Our timeout setting is as follow:

connection timeout: default to 3.1
attempt timeout: 800ms
operation timeout: 2.6s
total attempts: 3

We expect S3 request to success during this 2.6s.

Current Behavior

SDK did retry 3 times as we check. But still, we timeout after 3 attempts exhausted.

2024-04-16T22:08:15.844Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #1 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 47.57774ms
...(backoff 47.57 then 800ms after)
2024-04-16T22:08:16.693Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #2 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 34.98639ms
...(backoff 34.98 then 800ms after)
2024-04-16T22:08:17.530Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=3 max_attempts=3
2024-04-16T22:08:17.530Z ERROR {redacted} GetObject request failed for key "{redacted}" with error TimeoutError(TimeoutError { source: MaybeTimeoutError { kind: OperationAttempt, duration: 800ms } })
2024-04-16T22:08:17.530Z ERROR {redacted} Get Object request failed: "request has timed out"

Smithy orchestrator typically emit halting line before the TransientError. We couldn't tell whether connection is established successfully within these 800ms or not, as there is no identifier between hyper logs vs. SDK logs.

2024-04-11T21:26:03.775Z DEBUG invoke{service=s3 operation=GetObject}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator: encountered orchestrator error; halting

Reproduction Steps

We ran load test to benchmark S3 client and we found correlation between connection timeout and Timeout errors. The load test is running at max possible of 200 concurrency of S3 gets.

Run 1: 50ms connection timeout

Total errors: 4697.000
Total transactions: 4896.000, Successful S3 transactions 199

Run 2: 400ms connection timeout

Total errors: 3969.000
Total transactions: 9718.000, Successful S3 transactions 5701

Run 3: 1500ms connection timeout

Total errors: 3000.000
Total transactions: 42531.000, Successful S3 transactions 38419

Run 4: 2000ms connection timeout

Total errors: 3201.000
Total transactions: 33302.000,  Successful S3 transactions 29478

Run 5: 3000ms connection timeout

Total errors: 0.000
Total transactions: 128115.000, Successful S3 transactions 118430

Possible Solution

By using linux ss command to observer socket overview. I found that connection created by SDK client to S3 does not have keep alive header. Note 3.5.87.213:https is s3 host as I check from here

ESTAB      0      0       [redacted_ip]:[redacted_port]     3.5.87.213:https users:(("[redacted_user]",pid=3000417,fd=73)) uid:4001 ino:277212367 sk:2b012 cgroup:unreachable:1 <->

whereas a typical connection could look like this, notice the keep alive header:

ESTAB      0      0         [redacted_ip]:[redacted_port]     [redacted_ip]:[redacted_port]    users:(("[redacted_user]",pid=12886,fd=397)) timer:(keepalive,1.580ms,0) uid:99 ino:407978 sk:161ea cgroup:unreachable:1 <->

I suspect this issue is due to inefficient usage of connection reuse down at hyper layer, i.e. previous active connection are closed by S3 randomly due to the lack of header. But I could be wrong.

We also observe this log line appears consistently before the transient error

State { reading: Init, writing: KeepAlive, keep_alive: Busy }

Additional Information/Context

No response

Version

├── aws-config v1.1.10
│   ├── aws-credential-types v1.1.8
│   │   ├── aws-smithy-async v1.2.1
│   │   ├── aws-smithy-runtime-api v1.3.0
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.1.9
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-sigv4 v1.2.0
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.7
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.9
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.19.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.9 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.2.1
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-http v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.9 (*)
│   ├── aws-sdk-ssooidc v1.19.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.9 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.2.1 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.9 (*)
│   ├── aws-sdk-sts v1.19.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.9 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.2.1 (*)
│   │   ├── aws-smithy-runtime-api v1.3.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.7
│   │   ├── aws-types v1.1.9 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.2.1 (*)
│   ├── aws-smithy-runtime-api v1.3.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.1.9 (*)
├── aws-credential-types v1.1.8 (*)
├── aws-lc-rs v1.6.4
│   ├── aws-lc-sys v0.14.1
├── aws-sdk-s3 v1.14.0
│   ├── aws-credential-types v1.1.8 (*)
│   ├── aws-runtime v1.1.9 (*)
│   ├── aws-sigv4 v1.2.0 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-checksums v0.60.7
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.2.1 (*)
│   ├── aws-smithy-runtime-api v1.3.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.7 (*)
│   ├── aws-types v1.1.9 (*)
├── aws-sdk-sts v1.19.0 (*)
├── aws-smithy-async v1.2.1 (*)
├── aws-smithy-runtime-api v1.3.0 (*)
├── aws-types v1.1.9 (*)
├── aws-config v1.1.10 (*)
├── aws-sdk-s3 v1.14.0 (*)
├── aws-sdk-sts v1.19.0 (*)
├── aws-types v1.1.9 (*)


### Environment details (OS name and version, etc.)

Linux 5.10.210-178.852.amzn2int.x86_64 #1 SMP Tue Feb 27 17:09:26 UTC 2024 x86_64 GNU/Linux

### Logs

For complete tracing logs, please see internal thread: https://amzn-aws.slack.com/archives/C0188A52Z7X/p1712868708040879?thread_ts=1710278104.906559&cid=C0188A52Z7X 
@xizhem xizhem added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 16, 2024
@xizhem xizhem changed the title S3 Connection timeout setting causing issue Upticks of S3 timeout errors after SDK upgrade Apr 16, 2024
@Velfi Velfi removed the needs-triage This issue or PR still needs to be triaged. label Apr 17, 2024
@Velfi
Copy link
Contributor

Velfi commented Apr 17, 2024

Hey @xizhem, thanks for submitting this issue. I've added it to our backlog.

@Velfi Velfi added the p1 This is a high priority issue label Apr 17, 2024
@ysaito1001
Copy link
Collaborator

I found that connection created by SDK client to S3 does not have keep alive header

To clarify, the SDK does not alter the keep-alive in the header, so this must've been happening somewhere else.

@xizhem
Copy link
Author

xizhem commented Apr 17, 2024

Does SDK uses HTTP1 or 2? In hyper documentation, keep-alive looks like enabled by default? https://docs.rs/hyper/latest/hyper/server/conn/http1/struct.Builder.html#method.keep_alive

@ysaito1001
Copy link
Collaborator

SDK uses hyper defaults, which is HTTP/1.1 and negotiates to HTTP/2 if the servers wishes to do so.

keep-alive looks like enabled by default?

It is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p1 This is a high priority issue
Projects
None yet
Development

No branches or pull requests

3 participants