Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context deadline exceeded to default retryable errors #3086

Open
tdharris opened this issue Apr 22, 2024 · 1 comment
Open

Add context deadline exceeded to default retryable errors #3086

tdharris opened this issue Apr 22, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request terragrunt

Comments

@tdharris
Copy link

tdharris commented Apr 22, 2024

Describe the solution you'd like

Consider adding context deadline exceeded to the DEFAULT_RETRYABLE_ERRORS. For additional information, see Auto-Retry.

Describe alternatives you've considered

We've considered adding a retryable_errors block to the root terragrunt.hcl config using the auto-retry feature, but felt reservations about it for two reasons:

  1. It seems like this might be encountered by others and potentially a known error that would be valuable to have included in the terragrunt defaults, which may benefit everyone.

  2. This custom error we want to include isn't appended to the valuable DEFAULT_RETRYABLE_ERRORS in terragrunt. So it felt awkward to completely overwrite and then potentially being out of sync with any updates made to those default known errors in the future, but likely the best path forward for us in the interim.

Additional context

I'm not sure if anyone else has had similar experiences, but this appears to be a transient error that we've experienced intermittently with at least the following resources across various providers: helm_release, kubernetes_namespace, aws_s3_bucket. It occurs after the timeout for that resource, sometimes on apply, other times on destroy, and increasing the timeout doesn't have any effect. Simply retrying the invocation has been successful.

This context deadline exceeded error is common in Go when a connection's context times out before an action completes - I've found related issues with searches across provider repos and from HashiCorp, and the consensus seems to be on possible causes being: network latency, firewall rules, resource contention, slow i/o, etc. For example, see Why am I seeing `context deadline exceeded` errors – HashiCorp Help Center.

@tdharris tdharris added the enhancement New feature or request label Apr 22, 2024
@ZachGoldberg ZachGoldberg added the terragrunt label Apr 25, 2024 — with Linear
@tdharris
Copy link
Author

tdharris commented Apr 25, 2024

I've chased down another scenario where this occurs and thought it might be useful to share for more context.

Situation

A particular deployment fails due to context deadline exceeded error: Pod restarts repeatedly, never reaching healthy state.

Cause

  1. App can't fetch certificate from an upstream service due to No address associated with hostname error.

  2. external-dns fails to update related DNS records due to AWS Route53 throttling.

  3. After several retries, DNS updates eventually succeed, but deployment already failed, requiring manual re-apply.

Auto-Retry

We've observed that terragrunt successfully retries when encountering this error when adding the following block to our terragrunt.hcl :

retryable_errors = [
  # Terragrunt Auto-Retry: https://terragrunt.gruntwork.io/docs/features/auto-retry/
  # Terragrunt DEFAULT_RETRYABLE_ERRORS: https://github.com/gruntwork-io/terragrunt/blob/master/options/auto_retry_options.go
  "(?s).*Failed to load state.*tcp.*timeout.*",
  "(?s).*Failed to load backend.*TLS handshake timeout.*",
  "(?s).*Creating metric alarm failed.*request to update this alarm is in progress.*",
  "(?s).*Error installing provider.*TLS handshake timeout.*",
  "(?s).*Error configuring the backend.*TLS handshake timeout.*",
  "(?s).*Error installing provider.*tcp.*timeout.*",
  "(?s).*Error installing provider.*tcp.*connection reset by peer.*",
  "NoSuchBucket: The specified bucket does not exist",
  "(?s).*Error creating SSM parameter: TooManyUpdates:.*",
  "(?s).*app.terraform.io.*: 429 Too Many Requests.*",
  "(?s).*ssh_exchange_identification.*Connection closed by remote host.*",
  "(?s).*Client\\.Timeout exceeded while awaiting headers.*",
  "(?s).*Could not download module.*The requested URL returned error: 429.*",
  "(?s).*net/http: TLS.*handshake timeout.*",
  # Custom Retryable Errors
  # context deadline exceeded - https://github.com/gruntwork-io/terragrunt/issues/3086
  "(?s).*context deadline exceeded.*",
]

Also, we were hoping there was an additive input to just append to the existing default terragrunt retryable_errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request terragrunt
Projects
None yet
Development

No branches or pull requests

3 participants