timeout while waiting for state to become 'success' (timeout: 2m0s) #780

erose96 · 2023-12-04T16:36:32Z

#777 attempted to fix this issue but it persists in my environment.

I do not believe this is an issue caused by the rate limit.

Here is the section of the debug log with where the exception in the title occurs:

2023-12-04T16:13:52.730Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState timeout after 2m0s: timestamp=2023-12-04T16:13:52.730Z
2023-12-04T16:13:52.730Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState starting 30s refresh grace period: timestamp=2023-12-04T16:13:52.730Z
2023-12-04T16:13:57.308Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState timeout after 2m0s: timestamp=2023-12-04T16:13:57.308Z
2023-12-04T16:13:57.308Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState starting 30s refresh grace period: timestamp=2023-12-04T16:13:57.308Z
2023-12-04T16:13:57.524Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState timeout after 2m0s: timestamp=2023-12-04T16:13:57.523Z
2023-12-04T16:13:57.524Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState starting 30s refresh grace period: timestamp=2023-12-04T16:13:57.524Z
2023-12-04T16:14:22.732Z [ERROR] provider.terraform-provider-pagerduty_v3.2.2: WaitForState exceeded refresh grace period: timestamp=2023-12-04T16:14:22.731Z
2023-12-04T16:14:22.732Z [ERROR] vertex "module.{pagerduty_service_name}" error: timeout while waiting for state to become 'success' (timeout: 2m0s)
2023-12-04T16:14:22.733Z [ERROR] vertex "module.{pagerduty_service_name} (expand)" error: timeout while waiting for state to become 'success' (timeout: 2m0s)

The 200 that occurs right before this indicates the rate limit is not about to be hit:

Ratelimit-Limit: 960
Ratelimit-Remaining: 919
Ratelimit-Reset: 58

The WaitForState messages in the logs makes me think it's related to an issue upstream in the terraform-plugin-sdk. A fix was submitted for that issue a few years ago but was never reviewed.

See past issues: #765 #760

The text was updated successfully, but these errors were encountered:

tgoodsell-tempus · 2023-12-08T19:41:26Z

Could be useful to introduce these to the operations as well, for additional control: https://developer.hashicorp.com/terraform/language/resources/syntax#operation-timeouts

ingwarsw · 2023-12-27T11:29:58Z

Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action.

And seems that with each fail it have "random" number of failed items, so its maybe related to some PG rate limiting at host level or something?
And just had even stranger error

│ Error: Get "https://api.pagerduty.com/users/XXX": read tcp 10.1.0.4:33696->44.237.102.140:443: read: connection reset by peer

Overall this issue is annoying as hell.

austinpray-mixpanel · 2024-01-02T16:27:14Z

Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action.

Same here. Our developers apply terraform via a github action and we are seeing the same thing.

gunzy83 · 2024-01-10T07:20:25Z

Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action.

We have just ran into this as well with our first GH actions deploy using an scoped OAuth client credential (app) which only this one project is using for one deployment at a time.

No issues during development on local machine with multiple deploys and tear downs of the stack but going to staging and prod with this errored. I retried the staging job twice (the second time after waiting while reading issues on Github) and then the prod one went through.

Seems I may have got lucky on Github actions with a new runner or exit IP... there may be an undocumented IP address based limit in play?

imjaroiswebdev · 2024-01-10T15:59:07Z

@erose96 are you facing this issue in a local machine or inside a GH action runner as @ingwarsw describes? Additionally, could any of you please provide a example of the TF code facing this issue for me to try replicating it? So, I can come up ASAP with a solution. Thanks in advance folks!

ingwarsw · 2024-01-11T00:04:23Z

To test if its GH (network) issue i have created self hosted runner.
And the same pipeline now works 100% of cases.. while if I use GH runners it fails in 99% of cases (it randomply passes from time to time)...

I will try to create simple case run yesterday.
But should be easy to catch..
In most cases it fails with

pagerduty_tag_assignment

Something like that failed on second run.

locals {
  teams = {
    "a"  = "aa",
    "a1" = "aa1",
#    "a2" = "aa2",
#    "a3" = "aa3",
#    "a4" = "aa4",
#    "a5" = "aa5",
#    "a6" = "aa6",
#    "a7" = "aa7",
  }
}

import {
  id = "escalation_policies.xxx.yyy"
  to = pagerduty_tag_assignment.test["a"]
}
import {
  id = "escalation_policies.xxx.yyy"
  to = pagerduty_tag_assignment.test["a1"]
}


resource "pagerduty_tag" "tf_managed" {
  label = "test-me"
}

resource "pagerduty_team" "tf_teams" {
  for_each    = local.teams
  name        = each.key
  description = each.value
}

resource "pagerduty_tag_assignment" "test" {
  for_each = local.teams
  tag_id      = pagerduty_tag.tf_managed.id
  entity_type = "teams"
  entity_id   = pagerduty_team.tf_teams[each.key].id
}

provider "pagerduty" {
  token      = var.pagerduty_api_token
  user_token = var.pagerduty_user_api_token
}

variable "pagerduty_api_token" {
  type        = string
  description = "api token for pagerduty"
}

variable "pagerduty_user_api_token" {
  type        = string
  description = "api user token for pagerduty"
}

output "test" {
  value = pagerduty_tag_assignment.test
}```

austinpray-mixpanel · 2024-01-11T00:07:07Z

To test if its GH (network) issue i have created self hosted runner. And the same pipeline now works 100% of cases.. while if I use GH runners it fails in 99% of cases (it randomply passes from time to time)...

We are also testing moving our terraform actions to self-hosted runners and are monitoring to see if the timeouts go away

gunzy83 · 2024-01-11T00:10:17Z

@ingwarsw legend, you just saved me from testing a self-hosted runner.

I have even seen an error on this:

data "pagerduty_vendor" "datadog" {
  name = "Datadog"
}

which fails after 5mins of spinning. Only seen this on Github actions, local machines work 100% of the time.

imjaroiswebdev · 2024-01-12T22:24:19Z

Hey folks! I prepared this repository for trying to replicate the error and after several intends (New commits and Actions re-runs), I can that tell I haven't had success 😅

If I captured correctly what you all being noting, the repository meets following condition for trying the reproduce the error:

Terraform code project using PagerDuty TF provider.
Terraform code is executed inside the GH Action Runner.
I used PD Tags as @ingwarsw said.

On top of that, I added verbose (secured) logging for debugging the error and at the end find out what's going on.

As you being pointing out, locally the TF plan/apply works flawlessly, even in TF Cloud runners too (I did the test just in case).

Therefore, I would really appreciate if any you could submit a few PRs, so you can help me to replicate this error and find the culprit please 🙏🏽. I'll do my best staying tune and promptly merging your PRs till We reproduce the error and hopefully catch the bug in the logs. Thanks in advance for your help and patience.

austinpray-mixpanel · 2024-01-12T22:29:18Z

@imjaroiswebdev can you try adding a bunch of user / team lookups? We are suspicious that our pagerduty schedule definitions cause a bunch of cascading requests having to look up each user by email and such

austinpray-mixpanel · 2024-01-12T22:32:16Z

Here's a sanitized example of how we define teams and schedules.

locals {
  team = "DevInfra"
  members = [
    "bogus1@pagerduty.com",
    "bogus2@pagerduty.com",
    "bogus3@pagerduty.com",
    "bogus4@pagerduty.com",
    "bogus5@pagerduty.com",
  ]
  start = "2023-11-27T14:30:00-07:00"
  manager = "bogus1@pagerduty.com"
}

resource "pagerduty_team" "default" {
  name = local.team
}

data "pagerduty_user" "team" {
  for_each = toset(local.members)
  email    = each.key
}

data "pagerduty_user" "manager" {
  email = local.manager
}

resource "pagerduty_schedule" "default" {
  name      = "${local.team} schedule"
  time_zone = "America/Los_Angeles"

  layer {
    name                         = "${local.team} Ops Leads"
    start                        = local.start
    rotation_virtual_start       = local.start
    rotation_turn_length_seconds = 60 * 60 * 24 * 7
    users                        = [for member in local.members : data.pagerduty_user.team[member].id]
  }
  teams = [pagerduty_team.default.id]
}

resource "pagerduty_escalation_policy" "default" {
  name  = "${local.team} Escalation Policy"
  teams = [pagerduty_team.default.id]

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.default.id
    }
  }
  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "user_reference"
      id   = data.pagerduty_user.manager.id
    }
  }
}

edit: PR imjaroiswebdev/pd-tfprovider-issue-780-experiment#1

imjaroiswebdev · 2024-01-12T23:26:09Z

Hey @austinpray-mixpanel thank you very much for your help, however, this configuration wasn't enough to replicate the error look 😩

gunzy83 · 2024-01-13T09:58:27Z

Hey @austinpray-mixpanel thank you very much for your help, however, this configuration wasn't enough to replicate the error look 😩

We appreciate the effort @imjaroiswebdev. Are you able to check internally if there is any rate limiting at the host/ip level in addition to the new rate limiting rules published publically last year? That may explain this issue better than a standard reproduction.

I have only had 1/5 new deployments fail since I posted, however that job was failing repeatedly on pagerduty_vendor datasource until I waited another hour to retry. Our account is so small with very little API use so far (we are not hitting the documented limits) but this kind of random flakiness will kill any notion of packaging Pagerduty service config with app deployment code if we want reliable automated deploys.

imjaroiswebdev · 2024-01-18T22:46:20Z

I finally was able to reproduce the issue here, I decided to re-run the job until it failed because of this. I believe the last time I didn't try it enough. I just meant to update you all for letting you know I'm researching further into this with other engineering teams to catch the culprit and get back to you with a solution, workaround or something 💪🏽

erose96 · 2024-01-19T17:27:36Z

@imjaroiswebdev sorry for the late reply, I run into the issue when running from an Azure Devops MS hosted agent (similar to a gh runner). Issue has not presented itself locally.

I see someone else provided code but here's what I'm running:

terraform {
  required_providers {
    azurerm = {
      source = "hashicorp/azurerm"
    }
    pagerduty = {
      source = "pagerduty/pagerduty"
    }
  }
}

resource "pagerduty_service" "tsc_pagerduty_service" {
  name                    = "[TF] ${var.service_name}"
  description             = "[Managed by Terraform] - ${var.pagerduty_description}"
  auto_resolve_timeout    = var.pagerduty_auto_resolve_timeout
  acknowledgement_timeout = var.pagerduty_acknowledgement_timeout
  escalation_policy       = var.pagerduty_escalation_policy_id
  alert_creation          = "create_alerts_and_incidents"


  incident_urgency_rule {
    type    = var.pagerduty_incident_urgency == "high" ? "constant" : "use_support_hours"
    urgency = var.pagerduty_incident_urgency == "high" ? "high" : ""

    dynamic "during_support_hours" {
      for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
      content {
        type    = "constant"
        urgency = "high"
      }
    }

    dynamic "outside_support_hours" {
      for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
      content {
        type    = "constant"
        urgency = "low"
      }
    }
  }

  dynamic "support_hours" {
    for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
    content {
      type         = "fixed_time_per_day"
      time_zone    = "America/New_York"
      days_of_week = ["1", "2", "3", "4", "5"]
      start_time   = "09:00:00"
      end_time     = "17:00:00"
    }
  }

  dynamic "scheduled_actions" {
    for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
    content {
      type       = "urgency_change"
      to_urgency = "high"

      at {
        type = "named_time"
        name = "support_hours_start"
      }
    }
  }
}

resource "pagerduty_service_integration" "tsc_pagerduty_azure_service_integration" {
  name    = "Microsoft Azure"
  vendor  = var.pagerduty_microsoft_azure_vendor_id
  service = pagerduty_service.tsc_pagerduty_service.id
}

resource "pagerduty_slack_connection" "tsc_pagerduty_slack_connection" {
  source_id         = pagerduty_service.tsc_pagerduty_service.id
  source_type       = "service_reference"
  workspace_id      = var.slack_workspace_id
  channel_id        = var.slack_channel_id
  notification_type = "responder"
  config {
    events = [
      "incident.triggered",
      "incident.escalated",
      "incident.resolved",
      "incident.priority_updated",
      "incident.responder.added",
      "incident.responder.replied",
      "incident.status_update_published",
      "incident.reopened"
    ]
    priorities = ["*"]
  }
}

resource "azurerm_monitor_action_group" "tsc_pagerduty_action_group" {
  name                = "${trim(var.service_name,":<>+/&%?@")} PagerDuty Action Group"
  resource_group_name = var.action_group_resource_group_name
  short_name          = "PD${var.pagerduty_incident_urgency}${substr(var.service_name, 0, 5)}"

  webhook_receiver {
    name                    = "PagerDuty"
    service_uri             = "https://events.pagerduty.com/integration/${pagerduty_service_integration.tsc_pagerduty_azure_service_integration.integration_key}/enqueue"
    use_common_alert_schema = true
  }

  lifecycle {
	ignore_changes = [
	  tags["Environment"],
	  tags["CostCenter"],
	  tags["Product"],
	  tags["lastModified"],
          tags["lastModifiedBy"]
	]
  }
}

output "pagerduty_service_integration_id" {
  value = pagerduty_service_integration.tsc_pagerduty_azure_service_integration.id
}

output "tsc_pagerduty_action_group_id" {
  value = azurerm_monitor_action_group.tsc_pagerduty_action_group.id
}

I then invoke the module as so:

module "tsc_services_action_group_high" {
  source                           = "../modules/pagerduty-action-group"
  for_each                         = toset(distinct(local.tscServices))
  service_name                     = "${each.value} - High"
  pagerduty_description            = "These are the high urgency alerts for ${each.value}"
  pagerduty_escalation_policy_id   = local.it_system_engineers_escalation_policy_id
  pagerduty_incident_urgency       = "high"
  action_group_resource_group_name = azurerm_resource_group.tscmonitoring_live.name
}

tscServices being an array of our service names.

ATM I am working around then by breaking out my monitoring terraform into separate workspaces and then importing the Azure Actions Groups as data objects in the other workspaces. Using the data objects for the pagerduty services themselves still leads to the same timeouts unfortunately, thankfully my setup we don't need to alter the PD services very often. But it would be extremely useful to be able to do so in order to enable my team to rename things + update dependencies etc. as they wish.

imjaroiswebdev · 2024-01-19T20:41:35Z

Thank you so much for all your help, because it was very valuable for figuring this out.

This is not an issue affecting GH Runners exclusively as @erose96 detected, the reason of this issue is that TF provider's API client doesn't have a configured timeout for API calls.

A patch for solving this will be released on next Monday Jan 22th. Again, thanks for all your support and patience folks.

ingwarsw · 2024-01-24T09:42:24Z

@imjaroiswebdev I think we should reopen that issue..

Just tested v3.5.0 and still have same issue.

- Installing pagerduty/pagerduty v3.5.0...
- Installed pagerduty/pagerduty v3.5.0 (signed by a HashiCorp partner, key ID 027C6DD1F0707B45)
...
│ Error: timeout while waiting for state to become 'success' (timeout: 2m0s)
│
│   with module.pd_service_backend_query.pagerduty_service_integration.c_pd_service_events_integration["backend_query"],
│   on modules/c-pd-service/main.tf line 52, in resource "pagerduty_service_integration" "c_pd_service_events_integration":
│   52: resource "pagerduty_service_integration" "c_pd_service_events_integration" {
│

Second run second set of "random" errors.

│ Error: Get "https://api.pagerduty.com/users/PWXXXX": read tcp 10.1.0.14:58282->52.36.64.228:443: read: connection reset by peer
│
│   with pagerduty_tag_assignment.tag_user_team["memberof_cloud"],
│   on pd_users.tf line 34, in resource "pagerduty_tag_assignment" "tag_user_team":
│   34: resource "pagerduty_tag_assignment" "tag_user_team" {
│
╵
╷
│ Error: timeout while waiting for state to become 'success' (timeout: 2m0s)
│
│   with module.pd_service_infra_tools.pagerduty_slack_connection.c_pd_service_slack_integration["infra_tools"],
│   on modules/c-pd-service/main.tf line 60, in resource "pagerduty_slack_connection" "c_pd_service_slack_integration":
│   60: resource "pagerduty_slack_connection" "c_pd_service_slack_integration" {

ioSpark · 2024-01-24T11:56:17Z

Can confirm that it looks like this issue is still present (maybe worse, since the timeout is lowered).

Error: timeout while waiting for state to become 'success' (timeout: 30s)

  with pagerduty_schedule.redacted,
  on redacted.tf line 354, in resource "pagerduty_schedule" "redacted":
 354: resource "pagerduty_schedule" "redacted" {

Is the issue here the lack of retries? I'm not too bothered about the timeout itself, but it doesn't appear that the provider/http client attempts a retry. It seems like any network issue during a run would be enough to err the terraform run. (it seems that some resources have retries baked-in, and others do not. though none at the network-level)

imjaroiswebdev · 2024-01-24T15:58:36Z

Yes @ingwarsw, reopening for further investigation and tests. Please stay tuned, I'll get back to you ASAP with a patch or ETA for it.

tgoodsell-tempus · 2024-01-24T16:14:13Z

I'm also seeing specific 500 errors on the new version as well for some endpoints, such as:

│ Error: GET API call to https://app.pagerduty.com/integration-slack/workspaces/ID-REDACTED/connections/ID-REDACTED failed: 502 Bad Gateway
│ 
│   with module.network_monitor.module.pagerduty_slack_connection.pagerduty_slack_connection.this,
│   on .terraform/modules/network_monitor.pagerduty_slack_connection/modules/pagerduty/slack_connection/main.tf line 1, in resource "pagerduty_slack_connection" "this":
│    1: resource "pagerduty_slack_connection" "this" {
│ 
╵

tgoodsell-tempus · 2024-01-24T16:34:08Z

I'm also seeing specific 500 errors on the new version as well for some endpoints, such as:

│ Error: GET API call to https://app.pagerduty.com/integration-slack/workspaces/ID-REDACTED/connections/ID-REDACTED failed: 502 Bad Gateway
│ 
│   with module.network_monitor.module.pagerduty_slack_connection.pagerduty_slack_connection.this,
│   on .terraform/modules/network_monitor.pagerduty_slack_connection/modules/pagerduty/slack_connection/main.tf line 1, in resource "pagerduty_slack_connection" "this":
│    1: resource "pagerduty_slack_connection" "this" {
│ 
╵

My issue is likely related to https://status.pagerduty.com/incident_details/PTUPX96

imjaroiswebdev · 2024-01-25T23:02:56Z

Hey folks! I encourage you to do the upgrade to v3.5.2, hopefully, the issue should be finally addressed. Again, I want to thank you all for your patience and support providing helpful error outputs for us to better figure out how to solve this.

@tgoodsell-tempus the issue you were experience was due to a partial outage with Slack Integration at that moment, however, as far as I know, it should be working as usual again.

Feel free to re-open this thread if any form of this error continues to appear.

ingwarsw · 2024-01-26T10:01:30Z

@imjaroiswebdev I have run our pipeline 5 times and it didnt failed once.. so we can consider it big success 🥇

imjaroiswebdev · 2024-01-26T10:58:16Z

Great! Thank you so much for the feedback @ingwarsw. Very appreciated 🎉

tanguyantoine · 2024-01-30T15:28:33Z

upgrading to 3.6.0 fixed the issue. Thank you

erose96 · 2024-02-01T17:31:33Z

Working perfectly using v3.7.0, thank you @imjaroiswebdev and everyone else who helped solve this issue!

chenrui333 mentioned this issue Dec 22, 2023

Support OpenTofu #793

Open

imjaroiswebdev mentioned this issue Jan 24, 2024

[CSGI-2592] Add API Client timeouts configuration for http transport #802

Merged

imjaroiswebdev closed this as completed in #802 Jan 24, 2024

imjaroiswebdev reopened this Jan 24, 2024

imjaroiswebdev mentioned this issue Jan 25, 2024

Address: timeout while waiting for state to become 'success' #807

Merged

imjaroiswebdev closed this as completed in #807 Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timeout while waiting for state to become 'success' (timeout: 2m0s) #780

timeout while waiting for state to become 'success' (timeout: 2m0s) #780

erose96 commented Dec 4, 2023

tgoodsell-tempus commented Dec 8, 2023

ingwarsw commented Dec 27, 2023 •

edited

austinpray-mixpanel commented Jan 2, 2024

gunzy83 commented Jan 10, 2024 •

edited

imjaroiswebdev commented Jan 10, 2024

ingwarsw commented Jan 11, 2024 •

edited

austinpray-mixpanel commented Jan 11, 2024

gunzy83 commented Jan 11, 2024

imjaroiswebdev commented Jan 12, 2024

austinpray-mixpanel commented Jan 12, 2024

austinpray-mixpanel commented Jan 12, 2024 •

edited

imjaroiswebdev commented Jan 12, 2024

gunzy83 commented Jan 13, 2024

imjaroiswebdev commented Jan 18, 2024

erose96 commented Jan 19, 2024 •

edited

imjaroiswebdev commented Jan 19, 2024

ingwarsw commented Jan 24, 2024 •

edited

ioSpark commented Jan 24, 2024

imjaroiswebdev commented Jan 24, 2024

tgoodsell-tempus commented Jan 24, 2024

tgoodsell-tempus commented Jan 24, 2024

imjaroiswebdev commented Jan 25, 2024 •

edited

ingwarsw commented Jan 26, 2024

imjaroiswebdev commented Jan 26, 2024

tanguyantoine commented Jan 30, 2024

erose96 commented Feb 1, 2024

timeout while waiting for state to become 'success' (timeout: 2m0s) #780

timeout while waiting for state to become 'success' (timeout: 2m0s) #780

Comments

erose96 commented Dec 4, 2023

tgoodsell-tempus commented Dec 8, 2023

ingwarsw commented Dec 27, 2023 • edited

austinpray-mixpanel commented Jan 2, 2024

gunzy83 commented Jan 10, 2024 • edited

imjaroiswebdev commented Jan 10, 2024

ingwarsw commented Jan 11, 2024 • edited

austinpray-mixpanel commented Jan 11, 2024

gunzy83 commented Jan 11, 2024

imjaroiswebdev commented Jan 12, 2024

austinpray-mixpanel commented Jan 12, 2024

austinpray-mixpanel commented Jan 12, 2024 • edited

imjaroiswebdev commented Jan 12, 2024

gunzy83 commented Jan 13, 2024

imjaroiswebdev commented Jan 18, 2024

erose96 commented Jan 19, 2024 • edited

imjaroiswebdev commented Jan 19, 2024

ingwarsw commented Jan 24, 2024 • edited

ioSpark commented Jan 24, 2024

imjaroiswebdev commented Jan 24, 2024

tgoodsell-tempus commented Jan 24, 2024

tgoodsell-tempus commented Jan 24, 2024

imjaroiswebdev commented Jan 25, 2024 • edited

ingwarsw commented Jan 26, 2024

imjaroiswebdev commented Jan 26, 2024

tanguyantoine commented Jan 30, 2024

erose96 commented Feb 1, 2024

ingwarsw commented Dec 27, 2023 •

edited

gunzy83 commented Jan 10, 2024 •

edited

ingwarsw commented Jan 11, 2024 •

edited

austinpray-mixpanel commented Jan 12, 2024 •

edited

erose96 commented Jan 19, 2024 •

edited

ingwarsw commented Jan 24, 2024 •

edited

imjaroiswebdev commented Jan 25, 2024 •

edited