Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timeout while waiting for state to become 'success' (timeout: 2m0s) #780

Closed
erose96 opened this issue Dec 4, 2023 · 26 comments · Fixed by #802 or #807
Closed

timeout while waiting for state to become 'success' (timeout: 2m0s) #780

erose96 opened this issue Dec 4, 2023 · 26 comments · Fixed by #802 or #807

Comments

@erose96
Copy link

erose96 commented Dec 4, 2023

#777 attempted to fix this issue but it persists in my environment.

I do not believe this is an issue caused by the rate limit.

Here is the section of the debug log with where the exception in the title occurs:

2023-12-04T16:13:52.730Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState timeout after 2m0s: timestamp=2023-12-04T16:13:52.730Z
2023-12-04T16:13:52.730Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState starting 30s refresh grace period: timestamp=2023-12-04T16:13:52.730Z
2023-12-04T16:13:57.308Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState timeout after 2m0s: timestamp=2023-12-04T16:13:57.308Z
2023-12-04T16:13:57.308Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState starting 30s refresh grace period: timestamp=2023-12-04T16:13:57.308Z
2023-12-04T16:13:57.524Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState timeout after 2m0s: timestamp=2023-12-04T16:13:57.523Z
2023-12-04T16:13:57.524Z [WARN]  provider.terraform-provider-pagerduty_v3.2.2: WaitForState starting 30s refresh grace period: timestamp=2023-12-04T16:13:57.524Z
2023-12-04T16:14:22.732Z [ERROR] provider.terraform-provider-pagerduty_v3.2.2: WaitForState exceeded refresh grace period: timestamp=2023-12-04T16:14:22.731Z
2023-12-04T16:14:22.732Z [ERROR] vertex "module.{pagerduty_service_name}" error: timeout while waiting for state to become 'success' (timeout: 2m0s)
2023-12-04T16:14:22.733Z [ERROR] vertex "module.{pagerduty_service_name} (expand)" error: timeout while waiting for state to become 'success' (timeout: 2m0s)

The 200 that occurs right before this indicates the rate limit is not about to be hit:

Ratelimit-Limit: 960
Ratelimit-Remaining: 919
Ratelimit-Reset: 58

The WaitForState messages in the logs makes me think it's related to an issue upstream in the terraform-plugin-sdk. A fix was submitted for that issue a few years ago but was never reviewed.

See past issues: #765 #760

@tgoodsell-tempus
Copy link

Could be useful to introduce these to the operations as well, for additional control: https://developer.hashicorp.com/terraform/language/resources/syntax#operation-timeouts

@ingwarsw
Copy link

ingwarsw commented Dec 27, 2023

Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action.

And seems that with each fail it have "random" number of failed items, so its maybe related to some PG rate limiting at host level or something?
And just had even stranger error

│ Error: Get "https://api.pagerduty.com/users/XXX": read tcp 10.1.0.4:33696->44.237.102.140:443: read: connection reset by peer

Overall this issue is annoying as hell.

@austinpray-mixpanel
Copy link

Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action.

Same here. Our developers apply terraform via a github action and we are seeing the same thing.

@gunzy83
Copy link

gunzy83 commented Jan 10, 2024

Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action.

We have just ran into this as well with our first GH actions deploy using an scoped OAuth client credential (app) which only this one project is using for one deployment at a time.

No issues during development on local machine with multiple deploys and tear downs of the stack but going to staging and prod with this errored. I retried the staging job twice (the second time after waiting while reading issues on Github) and then the prod one went through.

Seems I may have got lucky on Github actions with a new runner or exit IP... there may be an undocumented IP address based limit in play?

@imjaroiswebdev
Copy link
Contributor

@erose96 are you facing this issue in a local machine or inside a GH action runner as @ingwarsw describes? Additionally, could any of you please provide a example of the TF code facing this issue for me to try replicating it? So, I can come up ASAP with a solution. Thanks in advance folks!

@ingwarsw
Copy link

ingwarsw commented Jan 11, 2024

To test if its GH (network) issue i have created self hosted runner.
And the same pipeline now works 100% of cases.. while if I use GH runners it fails in 99% of cases (it randomply passes from time to time)...

I will try to create simple case run yesterday.
But should be easy to catch..
In most cases it fails with

pagerduty_tag_assignment

Something like that failed on second run.

locals {
  teams = {
    "a"  = "aa",
    "a1" = "aa1",
#    "a2" = "aa2",
#    "a3" = "aa3",
#    "a4" = "aa4",
#    "a5" = "aa5",
#    "a6" = "aa6",
#    "a7" = "aa7",
  }
}

import {
  id = "escalation_policies.xxx.yyy"
  to = pagerduty_tag_assignment.test["a"]
}
import {
  id = "escalation_policies.xxx.yyy"
  to = pagerduty_tag_assignment.test["a1"]
}


resource "pagerduty_tag" "tf_managed" {
  label = "test-me"
}

resource "pagerduty_team" "tf_teams" {
  for_each    = local.teams
  name        = each.key
  description = each.value
}

resource "pagerduty_tag_assignment" "test" {
  for_each = local.teams
  tag_id      = pagerduty_tag.tf_managed.id
  entity_type = "teams"
  entity_id   = pagerduty_team.tf_teams[each.key].id
}

provider "pagerduty" {
  token      = var.pagerduty_api_token
  user_token = var.pagerduty_user_api_token
}

variable "pagerduty_api_token" {
  type        = string
  description = "api token for pagerduty"
}

variable "pagerduty_user_api_token" {
  type        = string
  description = "api user token for pagerduty"
}

output "test" {
  value = pagerduty_tag_assignment.test
}```

@austinpray-mixpanel
Copy link

To test if its GH (network) issue i have created self hosted runner. And the same pipeline now works 100% of cases.. while if I use GH runners it fails in 99% of cases (it randomply passes from time to time)...

We are also testing moving our terraform actions to self-hosted runners and are monitoring to see if the timeouts go away

@gunzy83
Copy link

gunzy83 commented Jan 11, 2024

@ingwarsw legend, you just saved me from testing a self-hosted runner.

I have even seen an error on this:

data "pagerduty_vendor" "datadog" {
  name = "Datadog"
}

which fails after 5mins of spinning. Only seen this on Github actions, local machines work 100% of the time.

@imjaroiswebdev
Copy link
Contributor

Hey folks! I prepared this repository for trying to replicate the error and after several intends (New commits and Actions re-runs), I can that tell I haven't had success 😅

If I captured correctly what you all being noting, the repository meets following condition for trying the reproduce the error:

  • Terraform code project using PagerDuty TF provider.
  • Terraform code is executed inside the GH Action Runner.
  • I used PD Tags as @ingwarsw said.

On top of that, I added verbose (secured) logging for debugging the error and at the end find out what's going on.

As you being pointing out, locally the TF plan/apply works flawlessly, even in TF Cloud runners too (I did the test just in case).

Therefore, I would really appreciate if any you could submit a few PRs, so you can help me to replicate this error and find the culprit please 🙏🏽. I'll do my best staying tune and promptly merging your PRs till We reproduce the error and hopefully catch the bug in the logs. Thanks in advance for your help and patience.

@austinpray-mixpanel
Copy link

@imjaroiswebdev can you try adding a bunch of user / team lookups? We are suspicious that our pagerduty schedule definitions cause a bunch of cascading requests having to look up each user by email and such

@austinpray-mixpanel
Copy link

austinpray-mixpanel commented Jan 12, 2024

Here's a sanitized example of how we define teams and schedules.

locals {
  team = "DevInfra"
  members = [
    "bogus1@pagerduty.com",
    "bogus2@pagerduty.com",
    "bogus3@pagerduty.com",
    "bogus4@pagerduty.com",
    "bogus5@pagerduty.com",
  ]
  start = "2023-11-27T14:30:00-07:00"
  manager = "bogus1@pagerduty.com"
}

resource "pagerduty_team" "default" {
  name = local.team
}

data "pagerduty_user" "team" {
  for_each = toset(local.members)
  email    = each.key
}

data "pagerduty_user" "manager" {
  email = local.manager
}

resource "pagerduty_schedule" "default" {
  name      = "${local.team} schedule"
  time_zone = "America/Los_Angeles"

  layer {
    name                         = "${local.team} Ops Leads"
    start                        = local.start
    rotation_virtual_start       = local.start
    rotation_turn_length_seconds = 60 * 60 * 24 * 7
    users                        = [for member in local.members : data.pagerduty_user.team[member].id]
  }
  teams = [pagerduty_team.default.id]
}

resource "pagerduty_escalation_policy" "default" {
  name  = "${local.team} Escalation Policy"
  teams = [pagerduty_team.default.id]

  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.default.id
    }
  }
  rule {
    escalation_delay_in_minutes = 10
    target {
      type = "user_reference"
      id   = data.pagerduty_user.manager.id
    }
  }
}

edit: PR imjaroiswebdev/pd-tfprovider-issue-780-experiment#1

@imjaroiswebdev
Copy link
Contributor

Hey @austinpray-mixpanel thank you very much for your help, however, this configuration wasn't enough to replicate the error look 😩

@gunzy83
Copy link

gunzy83 commented Jan 13, 2024

Hey @austinpray-mixpanel thank you very much for your help, however, this configuration wasn't enough to replicate the error look 😩

We appreciate the effort @imjaroiswebdev. Are you able to check internally if there is any rate limiting at the host/ip level in addition to the new rate limiting rules published publically last year? That may explain this issue better than a standard reproduction.

I have only had 1/5 new deployments fail since I posted, however that job was failing repeatedly on pagerduty_vendor datasource until I waited another hour to retry. Our account is so small with very little API use so far (we are not hitting the documented limits) but this kind of random flakiness will kill any notion of packaging Pagerduty service config with app deployment code if we want reliable automated deploys.

@imjaroiswebdev
Copy link
Contributor

I finally was able to reproduce the issue here, I decided to re-run the job until it failed because of this. I believe the last time I didn't try it enough. I just meant to update you all for letting you know I'm researching further into this with other engineering teams to catch the culprit and get back to you with a solution, workaround or something 💪🏽

@erose96
Copy link
Author

erose96 commented Jan 19, 2024

@imjaroiswebdev sorry for the late reply, I run into the issue when running from an Azure Devops MS hosted agent (similar to a gh runner). Issue has not presented itself locally.

I see someone else provided code but here's what I'm running:

terraform {
  required_providers {
    azurerm = {
      source = "hashicorp/azurerm"
    }
    pagerduty = {
      source = "pagerduty/pagerduty"
    }
  }
}

resource "pagerduty_service" "tsc_pagerduty_service" {
  name                    = "[TF] ${var.service_name}"
  description             = "[Managed by Terraform] - ${var.pagerduty_description}"
  auto_resolve_timeout    = var.pagerduty_auto_resolve_timeout
  acknowledgement_timeout = var.pagerduty_acknowledgement_timeout
  escalation_policy       = var.pagerduty_escalation_policy_id
  alert_creation          = "create_alerts_and_incidents"


  incident_urgency_rule {
    type    = var.pagerduty_incident_urgency == "high" ? "constant" : "use_support_hours"
    urgency = var.pagerduty_incident_urgency == "high" ? "high" : ""

    dynamic "during_support_hours" {
      for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
      content {
        type    = "constant"
        urgency = "high"
      }
    }

    dynamic "outside_support_hours" {
      for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
      content {
        type    = "constant"
        urgency = "low"
      }
    }
  }

  dynamic "support_hours" {
    for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
    content {
      type         = "fixed_time_per_day"
      time_zone    = "America/New_York"
      days_of_week = ["1", "2", "3", "4", "5"]
      start_time   = "09:00:00"
      end_time     = "17:00:00"
    }
  }

  dynamic "scheduled_actions" {
    for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
    content {
      type       = "urgency_change"
      to_urgency = "high"

      at {
        type = "named_time"
        name = "support_hours_start"
      }
    }
  }
}

resource "pagerduty_service_integration" "tsc_pagerduty_azure_service_integration" {
  name    = "Microsoft Azure"
  vendor  = var.pagerduty_microsoft_azure_vendor_id
  service = pagerduty_service.tsc_pagerduty_service.id
}

resource "pagerduty_slack_connection" "tsc_pagerduty_slack_connection" {
  source_id         = pagerduty_service.tsc_pagerduty_service.id
  source_type       = "service_reference"
  workspace_id      = var.slack_workspace_id
  channel_id        = var.slack_channel_id
  notification_type = "responder"
  config {
    events = [
      "incident.triggered",
      "incident.escalated",
      "incident.resolved",
      "incident.priority_updated",
      "incident.responder.added",
      "incident.responder.replied",
      "incident.status_update_published",
      "incident.reopened"
    ]
    priorities = ["*"]
  }
}

resource "azurerm_monitor_action_group" "tsc_pagerduty_action_group" {
  name                = "${trim(var.service_name,":<>+/&%?@")} PagerDuty Action Group"
  resource_group_name = var.action_group_resource_group_name
  short_name          = "PD${var.pagerduty_incident_urgency}${substr(var.service_name, 0, 5)}"

  webhook_receiver {
    name                    = "PagerDuty"
    service_uri             = "https://events.pagerduty.com/integration/${pagerduty_service_integration.tsc_pagerduty_azure_service_integration.integration_key}/enqueue"
    use_common_alert_schema = true
  }

  lifecycle {
	ignore_changes = [
	  tags["Environment"],
	  tags["CostCenter"],
	  tags["Product"],
	  tags["lastModified"],
          tags["lastModifiedBy"]
	]
  }
}

output "pagerduty_service_integration_id" {
  value = pagerduty_service_integration.tsc_pagerduty_azure_service_integration.id
}

output "tsc_pagerduty_action_group_id" {
  value = azurerm_monitor_action_group.tsc_pagerduty_action_group.id
}

I then invoke the module as so:

module "tsc_services_action_group_high" {
  source                           = "../modules/pagerduty-action-group"
  for_each                         = toset(distinct(local.tscServices))
  service_name                     = "${each.value} - High"
  pagerduty_description            = "These are the high urgency alerts for ${each.value}"
  pagerduty_escalation_policy_id   = local.it_system_engineers_escalation_policy_id
  pagerduty_incident_urgency       = "high"
  action_group_resource_group_name = azurerm_resource_group.tscmonitoring_live.name
}

tscServices being an array of our service names.

ATM I am working around then by breaking out my monitoring terraform into separate workspaces and then importing the Azure Actions Groups as data objects in the other workspaces. Using the data objects for the pagerduty services themselves still leads to the same timeouts unfortunately, thankfully my setup we don't need to alter the PD services very often. But it would be extremely useful to be able to do so in order to enable my team to rename things + update dependencies etc. as they wish.

@imjaroiswebdev
Copy link
Contributor

Thank you so much for all your help, because it was very valuable for figuring this out.

This is not an issue affecting GH Runners exclusively as @erose96 detected, the reason of this issue is that TF provider's API client doesn't have a configured timeout for API calls.

A patch for solving this will be released on next Monday Jan 22th. Again, thanks for all your support and patience folks.

@ingwarsw
Copy link

ingwarsw commented Jan 24, 2024

@imjaroiswebdev I think we should reopen that issue..

Just tested v3.5.0 and still have same issue.

- Installing pagerduty/pagerduty v3.5.0...
- Installed pagerduty/pagerduty v3.5.0 (signed by a HashiCorp partner, key ID 027C6DD1F0707B45)
...
│ Error: timeout while waiting for state to become 'success' (timeout: 2m0s)
│
│   with module.pd_service_backend_query.pagerduty_service_integration.c_pd_service_events_integration["backend_query"],
│   on modules/c-pd-service/main.tf line 52, in resource "pagerduty_service_integration" "c_pd_service_events_integration":
│   52: resource "pagerduty_service_integration" "c_pd_service_events_integration" {
│

Second run second set of "random" errors.

│ Error: Get "https://api.pagerduty.com/users/PWXXXX": read tcp 10.1.0.14:58282->52.36.64.228:443: read: connection reset by peer
│
│   with pagerduty_tag_assignment.tag_user_team["memberof_cloud"],
│   on pd_users.tf line 34, in resource "pagerduty_tag_assignment" "tag_user_team":
│   34: resource "pagerduty_tag_assignment" "tag_user_team" {
│
╵
╷
│ Error: timeout while waiting for state to become 'success' (timeout: 2m0s)
│
│   with module.pd_service_infra_tools.pagerduty_slack_connection.c_pd_service_slack_integration["infra_tools"],
│   on modules/c-pd-service/main.tf line 60, in resource "pagerduty_slack_connection" "c_pd_service_slack_integration":
│   60: resource "pagerduty_slack_connection" "c_pd_service_slack_integration" {

@ioSpark
Copy link

ioSpark commented Jan 24, 2024

Can confirm that it looks like this issue is still present (maybe worse, since the timeout is lowered).

Error: timeout while waiting for state to become 'success' (timeout: 30s)

  with pagerduty_schedule.redacted,
  on redacted.tf line 354, in resource "pagerduty_schedule" "redacted":
 354: resource "pagerduty_schedule" "redacted" {

Is the issue here the lack of retries? I'm not too bothered about the timeout itself, but it doesn't appear that the provider/http client attempts a retry. It seems like any network issue during a run would be enough to err the terraform run. (it seems that some resources have retries baked-in, and others do not. though none at the network-level)

@imjaroiswebdev
Copy link
Contributor

Yes @ingwarsw, reopening for further investigation and tests. Please stay tuned, I'll get back to you ASAP with a patch or ETA for it.

@tgoodsell-tempus
Copy link

I'm also seeing specific 500 errors on the new version as well for some endpoints, such as:

│ Error: GET API call to https://app.pagerduty.com/integration-slack/workspaces/ID-REDACTED/connections/ID-REDACTED failed: 502 Bad Gateway
│ 
│   with module.network_monitor.module.pagerduty_slack_connection.pagerduty_slack_connection.this,
│   on .terraform/modules/network_monitor.pagerduty_slack_connection/modules/pagerduty/slack_connection/main.tf line 1, in resource "pagerduty_slack_connection" "this":
│    1: resource "pagerduty_slack_connection" "this" {
│ 
╵

@tgoodsell-tempus
Copy link

I'm also seeing specific 500 errors on the new version as well for some endpoints, such as:

│ Error: GET API call to https://app.pagerduty.com/integration-slack/workspaces/ID-REDACTED/connections/ID-REDACTED failed: 502 Bad Gateway
│ 
│   with module.network_monitor.module.pagerduty_slack_connection.pagerduty_slack_connection.this,
│   on .terraform/modules/network_monitor.pagerduty_slack_connection/modules/pagerduty/slack_connection/main.tf line 1, in resource "pagerduty_slack_connection" "this":
│    1: resource "pagerduty_slack_connection" "this" {
│ 
╵

My issue is likely related to https://status.pagerduty.com/incident_details/PTUPX96

@imjaroiswebdev
Copy link
Contributor

imjaroiswebdev commented Jan 25, 2024

Hey folks! I encourage you to do the upgrade to v3.5.2, hopefully, the issue should be finally addressed. Again, I want to thank you all for your patience and support providing helpful error outputs for us to better figure out how to solve this.

@tgoodsell-tempus the issue you were experience was due to a partial outage with Slack Integration at that moment, however, as far as I know, it should be working as usual again.

Feel free to re-open this thread if any form of this error continues to appear.

@ingwarsw
Copy link

@imjaroiswebdev I have run our pipeline 5 times and it didnt failed once.. so we can consider it big success 🥇

@imjaroiswebdev
Copy link
Contributor

Great! Thank you so much for the feedback @ingwarsw. Very appreciated 🎉

@tanguyantoine
Copy link

upgrading to 3.6.0 fixed the issue. Thank you

@erose96
Copy link
Author

erose96 commented Feb 1, 2024

Working perfectly using v3.7.0, thank you @imjaroiswebdev and everyone else who helped solve this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants