Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s) #1112

Open
t0yv0 opened this issue Oct 23, 2023 · 7 comments
Labels
impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec service/ecs

Comments

@t0yv0
Copy link
Member

t0yv0 commented Oct 23, 2023

What happened?

I'm receiving this error a lot when trying to test examples locally:


Diagnostics:
  pulumi:pulumi:Stack (ecs-node-p-it-antons-mac-nodejs-69324370):
    error: update failed

  aws:ecs:Service (my-service):
    error: 1 error occurred:
        * creating urn:pulumi:p-it-antons-mac-nodejs-69324370::ecs-node::awsx:ecs:FargateService$aws:ecs/service:Service::my-service: 1 error occurred:
        * waiting for ECS service (arn:aws:ecs:us-west-2:616138583583:service/cluster-e6c5e93/my-service-3c9c1de) to reach steady state after creation: timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)

Outputs:
    url: "nginx-lb-f66cb5d-2145136225.us-west-2.elb.amazonaws.com"

Resources:
    + 33 created

Duration: 22m32s

This timeout happens when trying to record example baseline behavior, say for ecs/nodejs/ on AWS 5.42.0 and AWSX 1.x.x, but also when running examples on latest versions or the dependencies.

I have seen this affect the aws:ecs/service:Service through FargateService and other component resource wrappers.

For users affected by this issue, the current workaround per @danielrbradley is to apply a transformation that increases the custom timeout for the ECS service, see #1118 for a fully worked out example.

Please upvote this issue if this affects your workflow, and we can consider increasing default timeouts in the AWS provider.

Example

N/A

Output of pulumi about

CLI          
Version      3.86.0
Go Version   go1.21.1
Go Compiler  gc

Host     
OS       darwin
Version  14.0
Arch     x86_64

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/t0yv0
User           t0yv0
Organizations  t0yv0, pulumi
Token type     personal

Pulumi locates its logs in /var/folders/gk/cchgxh512m72f_dmkcc3d09h0000gp/T/ by default

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@t0yv0 t0yv0 added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Oct 23, 2023
@t0yv0
Copy link
Member Author

t0yv0 commented Oct 24, 2023

Possibly related:

#300
#391
#354

@t0yv0
Copy link
Member Author

t0yv0 commented Oct 24, 2023

Following the links I've found this prior art:

pulumi/terraform-provider-aws#59

One possibility here is to raise default timeouts again.

@t0yv0
Copy link
Member Author

t0yv0 commented Oct 24, 2023

pulumi/terraform-provider-aws#59 I've found has some prior art on editing default timeouts. Perhaps we could increase the values found in https://github.com/hashicorp/terraform-provider-aws/blob/master/internal/service/ecs/service.go#L50

@t0yv0 t0yv0 removed the needs-triage Needs attention from the triage team label Oct 24, 2023
@t0yv0
Copy link
Member Author

t0yv0 commented Oct 24, 2023

I'm leaving this in the tracker to accumulate upvotes, and if it does we can circle back to pulumi-aws and increase default timeouts by patching upstream. For the moment issues with flaky tests and examples in this repository can be resolved by applying the custom timeout transformation suggested by @danielrbradley .

@t0yv0 t0yv0 added the needs-triage Needs attention from the triage team label Oct 24, 2023
@mikhailshilkov mikhailshilkov added impact/reliability Something that feels unreliable or flaky and removed needs-triage Needs attention from the triage team labels Oct 26, 2023
@thomas11
Copy link
Contributor

thomas11 commented Nov 1, 2023

Looking into this further by checking in on the AWS console, I realized that this isn't really a timeout issue. The container fails to come up due to a configuration issue, and the provider gives up waiting after 20m so it looks like a timeout.

In this case, it was a Cloudwatch issue, but presumably it could be other reasons.

ResourceInitializationError: failed to validate logger args: create stream has been retried 7 times: failed to create Cloudwatch log stream:
RequestError: send request failed caused by: Post "https://logs.undefined.amazonaws.com/": 
dial tcp: lookup logs.undefined.amazonaws.com on 172.31.0.2:53: no such host : exit status 1

We should look into detecting such issues and notifying the user promptly and correctly.

@KyleMoran138
Copy link

KyleMoran138 commented Jan 9, 2024

I'm currently experencing this issue.
I have temporarily resolved the issue by adding the following configuration values to my taskDefinition container values

const loggroup = new aws.cloudwatch.LogGroup(
  `testLoggroup`,
  {
    name: `testLoggroup`,
    retentionInDays: 7,
  }
);
logConfiguration: {
    logDriver: 'awslogs',
    options: {
      'awslogs-group': loggroup.name,
      'awslogs-region': 'us-east-1',
      'awslogs-stream-prefix': 'ecs',
    },
  },

If I understand correctly (which I may not, still learning a bunch) it seems like the awsx implementation of the fargate service needs to update how it handles logConfigurations and creating log groups when no logConfiguration is provided.

@lambdakris
Copy link

lambdakris commented Feb 2, 2024

I ran into this as well, deployments kept timing out but then I would retry immediately and it would complete successfully almost instantly, yet the service was unavailable. Spent a good deal of time thinking it was some network config issue, but turns out that the whole thing was due to the task failing to start due to the missing log group issue.

I think 3 things could be improved here:

  1. Somehow fail fast by detecting the task/log error and reporting to user
  2. Somehow prevent identical deployment retry from succeeding since the service is not in fact in a healthy state
  3. Somehow updating the awsx.ecs module with better defaults to prevent the problem (or at least docs)

Frankly, I would prioritize 1 and 2, since they really gave me a sense of "spooky action", making it difficult to reason about how Pulumi works with AWS and eventually making me consider that there was something wrong with Pulumi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec service/ecs
Projects
None yet
Development

No branches or pull requests

5 participants