Standard retry backoff calculation may panic #1133

trueb2 · 2024-04-24T14:48:23Z

Describe the bug

The standard retry calculation may cause panic by attempting to convert an invalid float into a Duration.

Expected Behavior

No panics. The operation should fail with an error gracefully.

Current Behavior

thread 'actix-server worker 7' panicked at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/time.rs:762:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::time::Duration::from_secs_f64::panic_cold_display
   3: <aws_smithy_runtime::client::retries::strategy::standard::StandardRetryStrategy as aws_smithy_runtime_api::client::retries::RetryStrategy>::should_attempt_retry
   4: aws_smithy_runtime::client::orchestrator::try_op::{{closure}}::{{closure}}
   5: <aws_smithy_async::future::timeout::Timeout<T,S> as core::future::future::Future>::poll
   6: <aws_smithy_runtime::client::timeout::MaybeTimeoutFuture<InnerFuture> as core::future::future::Future>::poll
   7: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   8: aws_smithy_runtime::client::orchestrator::invoke_with_stop_point::{{closure}}
   9: aws_sdk_s3::operation::get_object::builders::GetObjectFluentBuilder::send::{{closure}}

Reproduction Steps

Here is an example retry configuration that eventually hit panics on get object requests.

aws-config = { version = "1.1.3", features = ["behavior-version-latest"] }
aws-sdk-s3 = "1.13.0"

      let config: SdkConfig = aws_config::from_env()
          .region(region)
          .endpoint_url(host.clone())
          .retry_config(
              RetryConfig::standard()
                  .with_initial_backoff(Duration::from_millis(1))
                  .with_max_attempts(100)
                  .with_reconnect_mode(
                      aws_sdk_s3::config::retry::ReconnectMode::ReconnectOnTransientError,
                  ),
          )
          .timeout_config(
              TimeoutConfig::builder()
                  .operation_attempt_timeout(Duration::from_secs(180))
                  .operation_timeout(Duration::from_secs(600))
                  .build(),
          )
          .load()
          .await;

      let s3_client: Client = Client::new(&config);

Possible Solution

Err if the duration is invalid before attempting to create the Duration at

aws-sdk-rust/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs

Line 150 in 2f2715b

Ok(Duration::from_secs_f64(backoff).min(retry_cfg.max_backoff()))

Additional Information/Context

Timeouts typically happen sporadically over time and depend on external factors. I noticed the panics after many requests ran happily through the common code paths. It is difficult to exercise these code paths in real world tests because 99.999 ... % of requests succeed within N attempts and a reasonable time window.

Version

├── aws-config v1.1.4
│   ├── aws-credential-types v1.1.4
│   │   ├── aws-smithy-async v1.1.4
│   │   ├── aws-smithy-runtime-api v1.1.4
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4
│   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-runtime v1.1.4
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-sigv4 v1.1.4
│   │   │   ├── aws-credential-types v1.1.4 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   │   ├── aws-smithy-http v0.60.4
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4
│   │   │   ├── aws-credential-types v1.1.4 (*)
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-sdk-sso v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4
│   │   │   └── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4
│   │   │   ├── aws-smithy-async v1.1.4 (*)
│   │   │   ├── aws-smithy-http v0.60.4 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-sdk-ssooidc v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-sdk-sts v1.12.0
│   │   ├── aws-credential-types v1.1.4 (*)
│   │   ├── aws-runtime v1.1.4 (*)
│   │   ├── aws-smithy-async v1.1.4 (*)
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-json v0.60.4 (*)
│   │   ├── aws-smithy-query v0.60.4
│   │   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-runtime v1.1.4 (*)
│   │   ├── aws-smithy-runtime-api v1.1.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   │   ├── aws-smithy-xml v0.60.4
│   │   ├── aws-types v1.1.4 (*)
│   ├── aws-smithy-async v1.1.4 (*)
│   ├── aws-smithy-http v0.60.4 (*)
│   ├── aws-smithy-json v0.60.4 (*)
│   ├── aws-smithy-runtime v1.1.4 (*)
│   ├── aws-smithy-runtime-api v1.1.4 (*)
│   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-types v1.1.4 (*)
├── aws-sdk-s3 v1.14.0
│   ├── aws-credential-types v1.1.4 (*)
│   ├── aws-runtime v1.1.4 (*)
│   ├── aws-sigv4 v1.1.4 (*)
│   ├── aws-smithy-async v1.1.4 (*)
│   ├── aws-smithy-checksums v0.60.4
│   │   ├── aws-smithy-http v0.60.4 (*)
│   │   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-smithy-eventstream v0.60.4 (*)
│   ├── aws-smithy-http v0.60.4 (*)
│   ├── aws-smithy-json v0.60.4 (*)
│   ├── aws-smithy-runtime v1.1.4 (*)
│   ├── aws-smithy-runtime-api v1.1.4 (*)
│   ├── aws-smithy-types v1.1.4 (*)
│   ├── aws-smithy-xml v0.60.4 (*)
│   ├── aws-types v1.1.4 (*)
│   ├── aws-config v1.1.4 (*)
│   ├── aws-sdk-s3 v1.14.0 (*)

Environment details (OS name and version, etc.)

macOS 14.3

Logs

thread 'actix-server worker 7' panicked at /rustc/7cf61ebde7b22796c69757901dd346d0fe70bd97/library/core/src/time.rs:762:23:
can not convert float seconds to Duration: value is either too big or NaN
stack backtrace:
   2: core::time::Duration::from_secs_f64::panic_cold_display
   3: <aws_smithy_runtime::client::retries::strategy::standard::StandardRetryStrategy as aws_smithy_runtime_api::client::retries::RetryStrategy>::should_attempt_retry

The text was updated successfully, but these errors were encountered:

trueb2 · 2024-04-24T15:07:35Z

I looked into this a little more. If too many attempts are made, then the backoff will be f64::MAX from

https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L291

f64::MAX then is used immediately causing the panic. The work around for now is to never specify attempts and initial backoffs that could result in f64::MAX calculated.

jdisanti · 2024-04-24T18:05:08Z

Thank you for filing this issue.

ysaito1001 · 2024-05-09T20:59:20Z

Hi @trueb2, we're working on a fix to address this panic issue. In the meantime, we have a question for your use case.

In general, specifying the number of retry attempts to be 100 does not seem to be a normal workflow. What is the motivation behind this number? Is that out of necessity or are you simply testing the SDK's behavior to see what happens with the number of attempts being 100?

trueb2 · 2024-05-13T14:50:15Z

The use case for this is that there can be network failures or outages that are random but eventually should succeed. So the initial retry duration should be very low and many reattempts should be possible. Choosing the appropriate maximum input values that do not induce panic is opaque, so 100 seems a reasonable number of max attempts given that the max backoff is an input and the total operation timeout can bounded by an absolute maximum duration. Exponential backoff does not make sense after a certain number of backoffs, but the other parameters exposed by the SDK should make it possible to avoid panicking or unreasonable delays.

I did find that the high retry attempt limits were out of necessity as I have hit the maximum backoff limit causing this panic multiple times over the last month. I was not testing the SDK's behavior.

…rom a large float (#3621) ## Motivation and Context Avoids panic when [Duration for exponential backoff](https://github.com/smithy-lang/smithy-rs/blob/main/rust-runtime/aws-smithy-runtime/src/client/retries/strategy/standard.rs#L150) could not be created from a large float. ## Description [Duration::from_secs_f64](https://doc.rust-lang.org/core/time/struct.Duration.html#method.from_secs_f64) may panic. This PR switches to use a fallible sibling `Duration::try_from_secs_f64` to avoid panics. If `Duration::try_from_secs_f64` returns an `Err` we fallback to `max_backoff` for subsequent retries. Furthermore, we learned from internal discussion that jitter also needs to be applied to `max_backoff`. This PR updates `calculate_exponential_backoff` to handle all the said business logic in one place. ## Testing - Added a unit test `should_not_panic_when_exponential_backoff_duration_could_not_be_created` - Manually verified reproduction steps provided [in the original PR](awslabs/aws-sdk-rust#1133) <details> <summary>More details</summary> ``` #[tokio::test] async fn repro_1133() { let config = aws_config::from_env() .region(Region::new("non-existing-region")) // forces retries .retry_config( RetryConfig::standard() .with_initial_backoff(Duration::from_millis(1)) .with_max_attempts(100), ) .timeout_config( TimeoutConfig::builder() .operation_attempt_timeout(Duration::from_secs(180)) .operation_timeout(Duration::from_secs(600)) .build(), ) .load() .await; let client: Client = Client::new(&config); let res = client .list_objects_v2() .bucket("bucket-name-does-not-matter") .send() .await; dbg!(res); } ``` Without changes in this PR: ``` ---- repro_1133 stdout ---- thread 'repro_1133' panicked at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/time.rs:741:23: can not convert float seconds to Duration: value is either too big or NaN stack backtrace: ... failures: repro_1133 test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 9 filtered out; finished in 338.18s ``` With changes in this PR: ``` // no panic ---- repro_1133 stdout ---- res = Err( TimeoutError( TimeoutError { source: MaybeTimeoutError { kind: Operation, duration: 600s, }, }, ), ) ``` </details> ## Checklist  - [x] I have updated `CHANGELOG.next.toml` if I made changes to the smithy-rs codegen or runtime crates - [x] I have updated `CHANGELOG.next.toml` if I made changes to the AWS SDK, generated SDK code, or SDK runtime crates ## Appendix <details> <summary>runtime-versioner bug fix</summary> This PR also fixes a limitation in `runtime-versioner audit`. I included the fix in the PR because the issue occurred with special conditions, and we don't get to reproduce it every day. The way the issue manifests is this. 1. We have a branch from the main whose latest release tag at the time was `release-2024-04-30` 2. The main has moved ahead and a new `smithy-rs` release has been made `release-2024-05-08` 3. We perform `git merge main`, pre-commit hooks run, and we then get `audit` failures from `runtime-versioner`: ``` 2024-05-10T16:54:36.434968Z WARN runtime_versioner::command::audit: there are newer releases since 'release-2024-04-30' aws-config was changed and version bumped, but the new version number (1.4.0) has already been published to crates.io. Choose a new version number. aws-runtime was changed and version bumped, but the new version number (1.2.2) has already been published to crates.io. Choose a new version number. aws-smithy-runtime-api was changed and version bumped, but the new version number (1.6.0) has already been published to crates.io. Choose a new version number. aws-smithy-types was changed and version bumped, but the new version number (1.1.9) has already been published to crates.io. Choose a new version number. aws-types was changed and version bumped, but the new version number (1.2.1) has already been published to crates.io. Choose a new version number. Error: there are audit failures in the runtime crates ``` This happens because when the latest `main` is being merged to our branch, `runtime-versioner audit` should use `previous_release_tag` `release-2024-05-08` to perform audit but [pre-commit hooks run the tool](https://github.com/smithy-lang/smithy-rs/blob/main/.pre-commit-hooks/runtime-versioner.sh#L8) using the latest previous release tag that can be traced back from `HEAD` of our branch, which is `release-2024-04-30`. Hence the error. The fix adds an environment variable `SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG` to specify a previous release tag override, in addition to a `--previous-release-tag` command-line argument. Furthermore, the fix has relaxed certain checks in `audit`. Taking our example for instance, when `HEAD` is now behind `release-2024-05-08`, it's OK to fail even if `release-2024-05-08` is not the ancestor of `HEAD` (as stated above, `git merge-base --is-ancestor` does not know that while main is being merged) as long as `release-2024-04-28` (the latest release seen from `HEAD`) is the ancestor of `release-2024-05-08`. ``` release-2024-04-28 release-2024-05-08 │ │ ────────────┼───────────────┬────────────────┼───────x─────► main │ │HEAD │ x │ of x │branch x │ x │ x │ xxxxxxxxxxx │ x │ x git merge main │ x │ ▼ │ └──────────────────► feature branch ``` To use the fix, set the environment variable with the new release tag and perform `git merge main`: ``` ➜ smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ export SMITHY_RS_RUNTIME_VERSIONER_AUDIT_PREVIOUS_RELEASE_TAG=release-2024-05-08 ➜ smithy-rs git:(ysaito/fix-panic-in-exp-backoff) ✗ git merge main ... Running `/Users/awsaito/src/smithy-rs/tools/target/debug/runtime-versioner audit` 2024-05-10T19:32:26.665578Z WARN runtime_versioner::tag: expected previous release to be 'release-2024-04-30', but 'release-2024-05-08' was specified. Proceeding with 'release-2024-05-08'. SUCCESS ``` </summary> ---- _By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice._

ysaito1001 · 2024-05-13T17:45:44Z

Thank you for your response. smithy-lang/smithy-rs#3621 will fix a panic in a way that it should still allow you to retry the desired number of times.

Exponential backoff does not make sense after a certain number of backoffs,

Correct, once an exponential backoff duration becomes large enough to exceed MAX_BACKOFF (by default 20 seconds), the SDK will use that duration for subsequent retries (jitter is applied to MAX_BACKOFF though).

trueb2 · 2024-05-14T20:33:20Z

@ysaito1001 Thanks for your thorough resolution!

ysaito1001 · 2024-05-22T20:48:17Z

The fix is included in release-2024-05-22.

github-actions · 2024-05-22T20:48:34Z

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

trueb2 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Apr 24, 2024

jdisanti added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Apr 24, 2024

ysaito1001 mentioned this issue May 1, 2024

Fix panic when failing to create Duration for exponential backoff from a large float smithy-lang/smithy-rs#3621

Merged

2 tasks

ysaito1001 self-assigned this May 1, 2024

ysaito1001 added the pending-release This issue will be fixed by an approved PR that hasn't been released yet. label May 13, 2024

ysaito1001 closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standard retry backoff calculation may panic #1133

Standard retry backoff calculation may panic #1133

trueb2 commented Apr 24, 2024 •

edited

trueb2 commented Apr 24, 2024

jdisanti commented Apr 24, 2024

ysaito1001 commented May 9, 2024

trueb2 commented May 13, 2024 •

edited

ysaito1001 commented May 13, 2024 •

edited

trueb2 commented May 14, 2024

ysaito1001 commented May 22, 2024

github-actions bot commented May 22, 2024

Standard retry backoff calculation may panic #1133

Standard retry backoff calculation may panic #1133

Comments

trueb2 commented Apr 24, 2024 • edited

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

Version

Environment details (OS name and version, etc.)

Logs

trueb2 commented Apr 24, 2024

jdisanti commented Apr 24, 2024

ysaito1001 commented May 9, 2024

trueb2 commented May 13, 2024 • edited

ysaito1001 commented May 13, 2024 • edited

trueb2 commented May 14, 2024

ysaito1001 commented May 22, 2024

github-actions bot commented May 22, 2024

trueb2 commented Apr 24, 2024 •

edited

trueb2 commented May 13, 2024 •

edited

ysaito1001 commented May 13, 2024 •

edited