Add context deadline exceeded to default retryable errors #3086

tdharris · 2024-04-22T20:22:16Z

Describe the solution you'd like

Consider adding context deadline exceeded to the DEFAULT_RETRYABLE_ERRORS. For additional information, see Auto-Retry.

Describe alternatives you've considered

We've considered adding a retryable_errors block to the root terragrunt.hcl config using the auto-retry feature, but felt reservations about it for two reasons:

It seems like this might be encountered by others and potentially a known error that would be valuable to have included in the terragrunt defaults, which may benefit everyone.
This custom error we want to include isn't appended to the valuable DEFAULT_RETRYABLE_ERRORS in terragrunt. So it felt awkward to completely overwrite and then potentially being out of sync with any updates made to those default known errors in the future, but likely the best path forward for us in the interim.

Additional context

I'm not sure if anyone else has had similar experiences, but this appears to be a transient error that we've experienced intermittently with at least the following resources across various providers: helm_release, kubernetes_namespace, aws_s3_bucket. It occurs after the timeout for that resource, sometimes on apply, other times on destroy, and increasing the timeout doesn't have any effect. Simply retrying the invocation has been successful.

This context deadline exceeded error is common in Go when a connection's context times out before an action completes - I've found related issues with searches across provider repos and from HashiCorp, and the consensus seems to be on possible causes being: network latency, firewall rules, resource contention, slow i/o, etc. For example, see Why am I seeing `context deadline exceeded` errors – HashiCorp Help Center.

The text was updated successfully, but these errors were encountered:

tdharris · 2024-04-25T21:37:25Z

I've chased down another scenario where this occurs and thought it might be useful to share for more context.

Situation

A particular deployment fails due to context deadline exceeded error: Pod restarts repeatedly, never reaching healthy state.

Cause

App can't fetch certificate from an upstream service due to No address associated with hostname error.
external-dns fails to update related DNS records due to AWS Route53 throttling.
After several retries, DNS updates eventually succeed, but deployment already failed, requiring manual re-apply.

Auto-Retry

We've observed that terragrunt successfully retries when encountering this error when adding the following block to our terragrunt.hcl :

retryable_errors = [
  # Terragrunt Auto-Retry: https://terragrunt.gruntwork.io/docs/features/auto-retry/
  # Terragrunt DEFAULT_RETRYABLE_ERRORS: https://github.com/gruntwork-io/terragrunt/blob/master/options/auto_retry_options.go
  "(?s).*Failed to load state.*tcp.*timeout.*",
  "(?s).*Failed to load backend.*TLS handshake timeout.*",
  "(?s).*Creating metric alarm failed.*request to update this alarm is in progress.*",
  "(?s).*Error installing provider.*TLS handshake timeout.*",
  "(?s).*Error configuring the backend.*TLS handshake timeout.*",
  "(?s).*Error installing provider.*tcp.*timeout.*",
  "(?s).*Error installing provider.*tcp.*connection reset by peer.*",
  "NoSuchBucket: The specified bucket does not exist",
  "(?s).*Error creating SSM parameter: TooManyUpdates:.*",
  "(?s).*app.terraform.io.*: 429 Too Many Requests.*",
  "(?s).*ssh_exchange_identification.*Connection closed by remote host.*",
  "(?s).*Client\\.Timeout exceeded while awaiting headers.*",
  "(?s).*Could not download module.*The requested URL returned error: 429.*",
  "(?s).*net/http: TLS.*handshake timeout.*",
  # Custom Retryable Errors
  # context deadline exceeded - https://github.com/gruntwork-io/terragrunt/issues/3086
  "(?s).*context deadline exceeded.*",
]

Also, we were hoping there was an additive input to just append to the existing default terragrunt retryable_errors.

tdharris added the enhancement New feature or request label Apr 22, 2024

ZachGoldberg assigned denis256 Apr 25, 2024

ZachGoldberg added the terragrunt label Apr 25, 2024 — with Linear

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add context deadline exceeded to default retryable errors #3086

Add context deadline exceeded to default retryable errors #3086

tdharris commented Apr 22, 2024 •

edited

tdharris commented Apr 25, 2024 •

edited

Add context deadline exceeded to default retryable errors #3086

Add context deadline exceeded to default retryable errors #3086

Comments

tdharris commented Apr 22, 2024 • edited

tdharris commented Apr 25, 2024 • edited

Situation

Cause

Auto-Retry

tdharris commented Apr 22, 2024 •

edited

tdharris commented Apr 25, 2024 •

edited