Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable StalledStreamProtection for S3 CopyObject #1141

Open
mlartz opened this issue May 8, 2024 · 13 comments
Open

Disable StalledStreamProtection for S3 CopyObject #1141

mlartz opened this issue May 8, 2024 · 13 comments
Labels
bug This issue is a bug. p1 This is a high priority issue

Comments

@mlartz
Copy link

mlartz commented May 8, 2024

Describe the bug

In running a S3 CopyObject of a fairly large object (2+ GB), which takes some time (30s+) to complete, the Rust SDK runs into the following ThroughputBelowMinimum error, which gets retried a couple of times (3 total attempts) and always fails:

{"level":"DEBUG","fields":{"message":"entering minimum throughput grace period"}, ...
{"level":"DEBUG","fields":{"message":"current throughput: 0 B/s is below minimum: 1 B/s"} ...
...
{"level":"ERROR","fields":{"message":"ResponseError(ResponseError { source: ThroughputBelowMinimum ...

The S3 CopyObject eventually succeeds as I see the copied file in the new location, but the Rust SDK keeps "failing".
I was thinking that this was a timeout, but also see:

{"level":"DEBUG","fields":{"message":"timeout settings for this operation: TimeoutConfig { connect_timeout: Set(3.1s), read_timeout: Disabled, operation_timeout: Disabled, operation_attempt_timeout: Disabled }"}

In talking to the experts, this is a result of StalledStreamProtection, which doesn't seem valid for potentially long-running "silent" operations like S3's CopyObject. Could you disable this protection by default for (at least) S3's CopyObject?

Expected Behavior

Long-running AWS API calls (like S3 CopyObject) do not get terminated unless associated with a timeout configuration.

Current Behavior

Long-running S3 CopyObject API call fails due to StalledStreamProtection

Reproduction Steps

Execute a long-running API call, such as a S3 CopyObject on a large file

Possible Solution

Disable StalledStreamProtection for potentially long-running (and "silent") AWS API calls.

Additional Information/Context

No response

Version

── aws-config v1.1.9
│   ├── aws-credential-types v1.1.8
│   │   ├── aws-smithy-async v1.2.0
│   │   ├── aws-smithy-runtime-api v1.2.0
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-sigv4 v1.2.0
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-http v0.60.7
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8
│   │   │   ├── aws-credential-types v1.1.8 (*)
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-sdk-sso v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8
│   │   │   ├── aws-smithy-async v1.2.0 (*)
│   │   │   ├── aws-smithy-http v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-ssooidc v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-sdk-sts v1.18.0
│   │   ├── aws-credential-types v1.1.8 (*)
│   │   ├── aws-runtime v1.1.8 (*)
│   │   ├── aws-smithy-async v1.2.0 (*)
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-json v0.60.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-runtime v1.1.8 (*)
│   │   ├── aws-smithy-runtime-api v1.2.0 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   │   ├── aws-smithy-xml v0.60.7
│   │   ├── aws-types v1.1.8 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-types v1.1.8 (*)
├── aws-sdk-s3 v1.21.0
│   ├── aws-credential-types v1.1.8 (*)
│   ├── aws-runtime v1.1.8 (*)
│   ├── aws-sigv4 v1.2.0 (*)
│   ├── aws-smithy-async v1.2.0 (*)
│   ├── aws-smithy-checksums v0.60.7
│   │   ├── aws-smithy-http v0.60.7 (*)
│   │   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.7 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.1.8 (*)
│   ├── aws-smithy-runtime-api v1.2.0 (*)
│   ├── aws-smithy-types v1.1.8 (*)
│   ├── aws-smithy-xml v0.60.7 (*)
│   ├── aws-types v1.1.8 (*)
├── aws-smithy-types v1.1.8 (*)
├── aws-smithy-types-convert v0.60.8
│   ├── aws-smithy-types v1.1.8 (*)

Environment details (OS name and version, etc.)

Amazon Linux 2, AWS Lambda

Logs

No response

@mlartz mlartz added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 8, 2024
@xanather
Copy link

xanather commented May 9, 2024

PutObject also has this issue.

Wasted hours on this how did it get through the tests?

Increase priority of this. It occurs over a high latency connection with a large payload.

@xanather
Copy link

xanather commented May 9, 2024

Offender:

config = config.upload_enabled(false);

Seems related to upload only.

@xanather
Copy link

xanather commented May 9, 2024

Workaround:

let config = aws_config::defaults(BehaviorVersion::latest())
        .stalled_stream_protection(StalledStreamProtectionConfig::disabled()).load().await;

I don't like this feature anyway, so ill keep it off and let default timeouts do their job. The consequence of this is that it dramatically increases the chance the upload succeeds but the client still reports a failure.

@Velfi
Copy link
Contributor

Velfi commented May 10, 2024

I'm sorry to hear that this feature hasn't been helpful for you.

Timeouts and stalled stream protection both serve different purposes. We chose to enable both by default because we believe that it provides a better user experience for the majority of users.

Timeouts and stalled stream protection is generally a benefit. However, you've run into the one case where both may not be helpful: requests that are expected to be very long-running. Timeouts set a timer that may be too short for the request you want to send. Stalled stream protection is better for long-running requests because it's only checking that the transfer is still ongoing, but it will close a connection that sees no activity for the grace period (five seconds by default.) Instead of disabling stalled stream protection, I'd advise you to fiddle with the grace_period setting to find a value that's appropriate for your workload.

One more thing: I've written a guide on configuring stalled streams that's available here. Let me know if you'd like me to add anything to it.

@Velfi Velfi removed the needs-triage This issue or PR still needs to be triaged. label May 10, 2024
@Velfi
Copy link
Contributor

Velfi commented May 10, 2024

I think the best thing to do in this case is to automatically disable stalled stream protection for any API that's modeled without the @httpPayload trait.

@Velfi Velfi added the p2 This is a standard priority issue label May 10, 2024
@Velfi
Copy link
Contributor

Velfi commented May 10, 2024

I've added this issue to our backlog.

@xanather
Copy link

I think the implementation is broken, if it was implemented correctly then it would probably make sense.

Looking at

looks like its only tracking the transfer of bytes downstream, not upstream.

This error may occur on any upload that takes longer than the grace period since its not counting those bytes...?

@aajtodd
Copy link
Contributor

aajtodd commented May 13, 2024

Upload throughput is here.

@xanather
Copy link

xanather commented May 14, 2024

What about during a long running upstream request and there were no bytes downstream for a while this is occurring?

@xanather
Copy link

Again, this really isn't something that should be done at the application layer, do other language AWS SDK's support it? TCP keep alive at the transport layer is more applicable and let application layer do its normal timeout.

@xanather
Copy link

xanather commented May 14, 2024

This should be P1. It breaks all upload API's that take longer than the grace periods - always.

@aajtodd aajtodd added p1 This is a high priority issue and removed p2 This is a standard priority issue labels May 14, 2024
github-merge-queue bot pushed a commit to smithy-lang/smithy-rs that referenced this issue May 17, 2024
…lete (#3644)

## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here -->
* awslabs/aws-sdk-rust#1141
* awslabs/aws-sdk-rust#1146
* awslabs/aws-sdk-rust#1148


## Description
* Disables stalled stream upload protection for requests with an
empty/zero length body.
* Disables stalled stream upload throughput checking once the request
body has been read and handed off to the HTTP layer.


## Testing
Additional integration tests added covering empty bodies and completed
uploads.

Tested SQS issue against latest runtime and can see it works now. The S3
`CopyObject` issue is related to downloads and will need a different
solution.

## Checklist
<!--- If a checkbox below is not applicable, then please DELETE it
rather than leaving it unchecked -->
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the
smithy-rs codegen or runtime crates
- [x] I have updated `CHANGELOG.next.toml` if I made changes to the AWS
SDK, generated SDK code, or SDK runtime crates

----

_By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice._

---------

Co-authored-by: Zelda Hessler <zhessler@amazon.com>
Co-authored-by: ysaito1001 <awsaito@amazon.com>
@Velfi
Copy link
Contributor

Velfi commented May 21, 2024

We just released the runtime crate updates: https://github.com/smithy-lang/smithy-rs/releases/tag/release-2024-05-21
Once a release for the SDK crates runs later today (in about two hours) they'll include this fix.

@ysaito1001
Copy link
Collaborator

Let us know whether today's release has improved the behavior of your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p1 This is a high priority issue
Projects
None yet
Development

No branches or pull requests

5 participants