Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s) #1112

t0yv0 · 2023-10-23T18:53:24Z

What happened?

I'm receiving this error a lot when trying to test examples locally:


Diagnostics:
  pulumi:pulumi:Stack (ecs-node-p-it-antons-mac-nodejs-69324370):
    error: update failed

  aws:ecs:Service (my-service):
    error: 1 error occurred:
        * creating urn:pulumi:p-it-antons-mac-nodejs-69324370::ecs-node::awsx:ecs:FargateService$aws:ecs/service:Service::my-service: 1 error occurred:
        * waiting for ECS service (arn:aws:ecs:us-west-2:616138583583:service/cluster-e6c5e93/my-service-3c9c1de) to reach steady state after creation: timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s)

Outputs:
    url: "nginx-lb-f66cb5d-2145136225.us-west-2.elb.amazonaws.com"

Resources:
    + 33 created

Duration: 22m32s

This timeout happens when trying to record example baseline behavior, say for ecs/nodejs/ on AWS 5.42.0 and AWSX 1.x.x, but also when running examples on latest versions or the dependencies.

I have seen this affect the aws:ecs/service:Service through FargateService and other component resource wrappers.

For users affected by this issue, the current workaround per @danielrbradley is to apply a transformation that increases the custom timeout for the ECS service, see #1118 for a fully worked out example.

Please upvote this issue if this affects your workflow, and we can consider increasing default timeouts in the AWS provider.

Example

N/A

Output of `pulumi about`

CLI          
Version      3.86.0
Go Version   go1.21.1
Go Compiler  gc

Host     
OS       darwin
Version  14.0
Arch     x86_64

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/t0yv0
User           t0yv0
Organizations  t0yv0, pulumi
Token type     personal

Pulumi locates its logs in /var/folders/gk/cchgxh512m72f_dmkcc3d09h0000gp/T/ by default

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

t0yv0 · 2023-10-24T01:52:16Z

Possibly related:

#300
#391
#354

t0yv0 · 2023-10-24T14:22:46Z

Following the links I've found this prior art:

pulumi/terraform-provider-aws#59

One possibility here is to raise default timeouts again.

t0yv0 · 2023-10-24T14:24:56Z

pulumi/terraform-provider-aws#59 I've found has some prior art on editing default timeouts. Perhaps we could increase the values found in https://github.com/hashicorp/terraform-provider-aws/blob/master/internal/service/ecs/service.go#L50

t0yv0 · 2023-10-24T15:06:25Z

I'm leaving this in the tracker to accumulate upvotes, and if it does we can circle back to pulumi-aws and increase default timeouts by patching upstream. For the moment issues with flaky tests and examples in this repository can be resolved by applying the custom timeout transformation suggested by @danielrbradley .

thomas11 · 2023-11-01T12:41:22Z

Looking into this further by checking in on the AWS console, I realized that this isn't really a timeout issue. The container fails to come up due to a configuration issue, and the provider gives up waiting after 20m so it looks like a timeout.

In this case, it was a Cloudwatch issue, but presumably it could be other reasons.

ResourceInitializationError: failed to validate logger args: create stream has been retried 7 times: failed to create Cloudwatch log stream:
RequestError: send request failed caused by: Post "https://logs.undefined.amazonaws.com/": 
dial tcp: lookup logs.undefined.amazonaws.com on 172.31.0.2:53: no such host : exit status 1

We should look into detecting such issues and notifying the user promptly and correctly.

KyleMoran138 · 2024-01-09T21:56:08Z

I'm currently experencing this issue.
I have temporarily resolved the issue by adding the following configuration values to my taskDefinition container values

const loggroup = new aws.cloudwatch.LogGroup(
  `testLoggroup`,
  {
    name: `testLoggroup`,
    retentionInDays: 7,
  }
);

logConfiguration: {
    logDriver: 'awslogs',
    options: {
      'awslogs-group': loggroup.name,
      'awslogs-region': 'us-east-1',
      'awslogs-stream-prefix': 'ecs',
    },
  },

If I understand correctly (which I may not, still learning a bunch) it seems like the awsx implementation of the fargate service needs to update how it handles logConfigurations and creating log groups when no logConfiguration is provided.

lambdakris · 2024-02-02T03:47:32Z

I ran into this as well, deployments kept timing out but then I would retry immediately and it would complete successfully almost instantly, yet the service was unavailable. Spent a good deal of time thinking it was some network config issue, but turns out that the whole thing was due to the task failing to start due to the missing log group issue.

I think 3 things could be improved here:

Somehow fail fast by detecting the task/log error and reporting to user
Somehow prevent identical deployment retry from succeeding since the service is not in fact in a healthy state
Somehow updating the awsx.ecs module with better defaults to prevent the problem (or at least docs)

Frankly, I would prioritize 1 and 2, since they really gave me a sense of "spooky action", making it difficult to reason about how Pulumi works with AWS and eventually making me consider that there was something wrong with Pulumi.

t0yv0 added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Oct 23, 2023

t0yv0 removed the needs-triage Needs attention from the triage team label Oct 24, 2023

t0yv0 added the needs-triage Needs attention from the triage team label Oct 24, 2023

danielrbradley mentioned this issue Oct 24, 2023

Extend ECS service create timeout #1118

Merged

mikhailshilkov added impact/reliability Something that feels unreliable or flaky and removed needs-triage Needs attention from the triage team labels Oct 26, 2023

t0yv0 mentioned this issue Apr 1, 2024

Impossible to implement custom timeouts #1252

Open

t0yv0 mentioned this issue May 7, 2024

Run the ECS example, see error of https://logs.undefined.amazonaws.com/ preventing service up #1279

Open

t0yv0 added the service/ecs label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s) #1112

Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s) #1112

t0yv0 commented Oct 23, 2023 •

edited

t0yv0 commented Oct 24, 2023

t0yv0 commented Oct 24, 2023

t0yv0 commented Oct 24, 2023 •

edited

t0yv0 commented Oct 24, 2023

thomas11 commented Nov 1, 2023 •

edited

KyleMoran138 commented Jan 9, 2024 •

edited

lambdakris commented Feb 2, 2024 •

edited

Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s) #1112

Timeout while waiting for state to become 'tfSTABLE' (last state: 'tfPENDING', timeout: 20m0s) #1112

Comments

t0yv0 commented Oct 23, 2023 • edited

What happened?

Example

Output of pulumi about

Additional context

Contributing

t0yv0 commented Oct 24, 2023

t0yv0 commented Oct 24, 2023

t0yv0 commented Oct 24, 2023 • edited

t0yv0 commented Oct 24, 2023

thomas11 commented Nov 1, 2023 • edited

KyleMoran138 commented Jan 9, 2024 • edited

lambdakris commented Feb 2, 2024 • edited

t0yv0 commented Oct 23, 2023 •

edited

Output of `pulumi about`

t0yv0 commented Oct 24, 2023 •

edited

thomas11 commented Nov 1, 2023 •

edited

KyleMoran138 commented Jan 9, 2024 •

edited

lambdakris commented Feb 2, 2024 •

edited