You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently after SDK upgrade to aws-s3-sdk(>1.14), we notice upward trend of S3 GET timeout errors in production. We already ruled out the issue from #1118 . In our case, the error message is TransientError due to hitting attempt timeout.
There are correlation with connection timeout setting with the number of errors we've seen.
There are also correlation with the load that we send to S3 to the number of errors.
Expected Behavior
Our timeout setting is as follow:
connection timeout: default to 3.1
attempt timeout: 800ms
operation timeout: 2.6s
total attempts: 3
We expect S3 request to success during this 2.6s.
Current Behavior
SDK did retry 3 times as we check. But still, we timeout after 3 attempts exhausted.
2024-04-16T22:08:15.844Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #1 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 47.57774ms
...(backoff 47.57 then 800ms after)
2024-04-16T22:08:16.693Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #2 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 34.98639ms
...(backoff 34.98 then 800ms after)
2024-04-16T22:08:17.530Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=3 max_attempts=3
2024-04-16T22:08:17.530Z ERROR {redacted} GetObject request failed for key "{redacted}" with error TimeoutError(TimeoutError { source: MaybeTimeoutError { kind: OperationAttempt, duration: 800ms } })
2024-04-16T22:08:17.530Z ERROR {redacted} Get Object request failed: "request has timed out"
Smithy orchestrator typically emit halting line before the TransientError. We couldn't tell whether connection is established successfully within these 800ms or not, as there is no identifier between hyper logs vs. SDK logs.
We ran load test to benchmark S3 client and we found correlation between connection timeout and Timeout errors. The load test is running at max possible of 200 concurrency of S3 gets.
Run 1: 50ms connection timeout
Total errors: 4697.000
Total transactions: 4896.000, Successful S3 transactions 199
Run 2: 400ms connection timeout
Total errors: 3969.000
Total transactions: 9718.000, Successful S3 transactions 5701
Run 3: 1500ms connection timeout
Total errors: 3000.000
Total transactions: 42531.000, Successful S3 transactions 38419
Run 4: 2000ms connection timeout
Total errors: 3201.000
Total transactions: 33302.000, Successful S3 transactions 29478
Run 5: 3000ms connection timeout
Total errors: 0.000
Total transactions: 128115.000, Successful S3 transactions 118430
Possible Solution
By using linux ss command to observer socket overview. I found that connection created by SDK client to S3 does not have keep alive header. Note 3.5.87.213:https is s3 host as I check from here
I suspect this issue is due to inefficient usage of connection reuse down at hyper layer, i.e. previous active connection are closed by S3 randomly due to the lack of header. But I could be wrong.
We also observe this log line appears consistently before the transient error
State { reading: Init, writing: KeepAlive, keep_alive: Busy }
### Environment details (OS name and version, etc.)
Linux 5.10.210-178.852.amzn2int.x86_64 #1 SMP Tue Feb 27 17:09:26 UTC 2024 x86_64 GNU/Linux
### Logs
For complete tracing logs, please see internal thread: https://amzn-aws.slack.com/archives/C0188A52Z7X/p1712868708040879?thread_ts=1710278104.906559&cid=C0188A52Z7X
The text was updated successfully, but these errors were encountered:
Describe the bug
Recently after SDK upgrade to aws-s3-sdk(>1.14), we notice upward trend of S3 GET timeout errors in production. We already ruled out the issue from #1118 . In our case, the error message is
TransientError
due to hitting attempt timeout.There are correlation with connection timeout setting with the number of errors we've seen.
There are also correlation with the load that we send to S3 to the number of errors.
Expected Behavior
Our timeout setting is as follow:
connection timeout: default to
3.1
attempt timeout:
800ms
operation timeout:
2.6s
total attempts:
3
We expect S3 request to success during this 2.6s.
Current Behavior
SDK did retry 3 times as we check. But still, we timeout after 3 attempts exhausted.
Smithy orchestrator typically emit halting line before the
TransientError
. We couldn't tell whether connection is established successfully within these 800ms or not, as there is no identifier between hyper logs vs. SDK logs.Reproduction Steps
We ran load test to benchmark S3 client and we found correlation between connection timeout and Timeout errors. The load test is running at max possible of 200 concurrency of S3 gets.
Possible Solution
By using linux ss command to observer socket overview. I found that connection created by SDK client to S3 does not have keep alive header. Note
3.5.87.213:https
is s3 host as I check from herewhereas a typical connection could look like this, notice the keep alive header:
I suspect this issue is due to inefficient usage of connection reuse down at
hyper
layer, i.e. previous active connection are closed by S3 randomly due to the lack of header. But I could be wrong.We also observe this log line appears consistently before the transient error
Additional Information/Context
No response
Version
The text was updated successfully, but these errors were encountered: