Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

second destroy (after successful first) returns 'The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created.' #32126

Closed
jaffel-lc opened this issue Oct 31, 2022 · 16 comments · Fixed by #32208
Assignees
Labels
bug new new issue not yet triaged

Comments

@jaffel-lc
Copy link

Terraform Version

Terraform v1.3.3                                                                                                                                                                                           on linux_amd64                                                                                                                                                                                             + provider registry.terraform.io/hashicorp/aws v4.34.0

Terraform Configuration Files

variable "external_acm" {
  type        = bool
  default     = false
}

variable "lb_certificate_arn" {
  type        = string
  default     = ""
  description = "If running HTTPS then a valid certificate arn must be provided."
}

variable "route53_zone_id" {
  type        = string
  default     = null
  description = "Route53 Zone ID for the domain served by this load balancer"
}

resource "aws_acm_certificate_validation" "certval" {
  count                   = (var.external_acm || var.lb_certificate_arn != "" || var.route53_zone_id == null) ? 0 : 1
  certificate_arn         = aws_acm_certificate.cert[0].arn
  validation_record_fqdns = [aws_route53_record.cert_validation[0].fqdn]

  lifecycle {
    create_before_destroy = true
  }
}


Debug Output

https://gist.github.com/jaffel-lc/78be590f8fddae0426adbffba844f374

Expected Behavior

second destroy should complete without error (just like the first) and without destroying anything.

Actual Behavior

Do you really want to destroy all resources? 
Terraform will destroy all your managed infrastructure, as shown above. 
There is no undo. Only 'yes' will be accepted to confirm. 
Enter a value: yes
╷                                                                                                                                                                                                          │ Error: Invalid count argument
│
on ../certificate.tf line 30, in resource "aws_acm_certificate_validation" "certval":
│   30:   count                   = (var.external_acm || var.lb_certificate_arn != "" || var.route53_zone_id == null) ? 0 : 1 
│
│ The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how 
any instances will be created. To work around this, use the -target argument to
│ first apply only the resources that the count depends on.                                                                                                                                                ╵                                                                                                                                  

Steps to Reproduce

  1. terraform init
  2. terraform apply
  3. terraform destroy
  4. terraform destroy

Additional Context

No response

References

No response

@jaffel-lc jaffel-lc added bug new new issue not yet triaged labels Oct 31, 2022
@jaffel-lc
Copy link
Author

While this appears to be a harmless error, the erroneously failing tf destroy command's exit status is 1 which makes our build/tests pipeline report failure form its last - global cleanup - step, despite having successfully destroyed all of the resources in the previous step.

@apparentlymart
Copy link
Member

Hi @jaffel-lc! Thanks for reporting this.

I think what's going on here is that Terraform creates a normal plan as a precursor to creating a destroy plan because a normal plan refreshes the previous run state and can therefore detect if something has already been destroyed which therefore doesn't need to be "re-destroyed". However, a normal plan also needs to expand all resource blocks that have repetition arguments, and so it can run into this problem if the repetition depends on something that hasn't been created yet, in this case because you've literally just destroyed it.

If I'm right about the cause then the good news is that we changed the approach to that in #32051 for an unrelated reason, and so as of the next release Terraform will internally use a refresh-only plan for that initial refreshing step. Terraform didn't behave this way before just because the destroy feature has been around longer than the possibility of refresh-only plans, and so it was implemented in terms of the primitives that were available at the time.

That change was backported into the v1.3 branch and so will be included in the forthcoming v1.3.4 release. Once that's out (which should be in the next week or so), could you give that a try and see if the problem still occurs? Thanks!

@jaffel-lc
Copy link
Author

That is great news.
I'll test again once I notice 1.3.4

@jaffel-lc jaffel-lc reopened this Nov 8, 2022
@jaffel-lc
Copy link
Author

jaffel-lc commented Nov 8, 2022

Tested.

Now I am getting several "invalid index" errors.

Here is one example:

│ Error: Invalid index                                                                                                                                                                 
│                                                                                                                                                                                       
│   on .terraform/modules/t3_task1/main.tf line 5, in locals:
│    5:   log_group_name = var.log_group_name != "" ? var.log_group_name : aws_cloudwatch_log_group.task-log-group[0].name
│      ├────────────────
│      │ aws_cloudwatch_log_group.task-log-group is empty tuple
│
│ The given key does not identify an element in this collection value: the collection has no elements.                 

[updated error formatting for readability]

@apparentlymart
Copy link
Member

Thanks @jaffel-lc!

It seems that Terraform is noticing that there are now no instances of aws_cloudwatch_log_group.task-log-group and so is rejecting this access of element zero as invalid.

I would agree that this doesn't seem right, but I'm also not really sure what Terraform ought to do instead here. It is true that there is not an index zero in the state, but there is still presumably an index zero declared in the configuration. It's weird to evaluate something in the configuration against the current state rather than the desired state, but in this context Terraform isn't actually building a desired state and so it can't refer to that.

With that said then: it does seem like there's an opportunity to improve this case, but it's not clear to me exactly what change is valid to make here while allowing the refresh phase prior to destroy still work. We'll need to think more about that before deciding how to proceed here.

In the meantime I think unfortunately the most viable strategy with today's Terraform is to somehow avoid running terraform destroy a second time. One possible answer to that would be to run terraform show -json and count how many resource instances are listed in the resulting JSON representation of the state; if you find none then you know that everything has already been destroyed and can skip running terraform destroy again.

@jaffel-lc
Copy link
Author

With that said then: it does seem like there's an opportunity to improve this case, but it's not clear to me exactly what
change is valid to make here while allowing the refresh phase prior to destroy still work. We'll need to think more about
that before deciding how to proceed here.

One thought is that an error caused by outputs could have a different exit value than one cause by e.g. syntax error, and then our automations could choose to treat that specific exit code as not-an-error.

Another is that TF could not bother to lookup an output value if its originating resource is about to be destroyed, or does not exist, since the output will be cleared.

We have a cleanup task that runs a second destroy as a workaround failures we have encountered when ASGs, ECS services, and/or ecsclusters do not complete their destroys before terraform times out waiting for them to complete. The second destroy usually clears the reseources, or manages to resync the state file.

@jaffel-lc
Copy link
Author

But it occurs to me now that I could wrap the resource[0].name reference in a try() call.

@jbardin
Copy link
Member

jbardin commented Nov 10, 2022

I'm going to take a look into this because it's very similar to some related destroy time errors. The problem generally arises while attempting to refresh the instances, which during destroy is primarily used to ensure providers have the most recent values for their configuration, and remove any instances which may have already been deleted. In the meantime, a better workaround maybe to use -refresh=false to skip this process, which is probably not very useful to begin with if the destroy operations are run back-to-back.

@jbardin jbardin self-assigned this Nov 10, 2022
@MattJeanes
Copy link

MattJeanes commented Nov 10, 2022

This has also hit us too on 1.3.4, downgrading to 1.2.9 appears to work for now, fingers crossed for a proper fix soon! 😄

Interestingly I found that downgrading to 1.3.2 initially appeared to fix the issue for one case I was seeing but then it failed in other cases whereas 1.2.9 seems to work properly.

@sbocinec
Copy link

Now I am getting several "invalid index" errors.
The given key does not identify an element in this collection value: the collection has no elements.

Spent half of my day today investigating our integration test suite failing with the following error after upgrading from 1.2.9 to 1.3.4:

The given key does not identify an element in this collection value: the collection has no elements.
Error: Invalid index
  on main.tf line 49, in locals:
  49:     ? module.network[0].vpc_id
    ├────────────────
    │ module.network is empty tuple
The given key does not identify an element in this collection value: the collection has no elements.

trying to hunt down the issue and now I see the culprit is indeed in the destroy fixes in the 1.3.x causing this nasty bugs.. I hope this is going to be fixed soon 🤞

@RichardGTsl
Copy link

Hi,
I've just run into an "empty tuple" problem as well. In my case, it prevents a second destroy from cleaning up if the first destroy left some resources behind.

In my case the resource that triggers the fault is https://github.com/terraform-aws-modules/terraform-aws-sqs/blob/cf30bb3498d39969590e4d47bbce56b02f1dc9a5/main.tf#L30. This is the line that fails:
arn = aws_sqs_queue.this[0].arn
I couldn't find a way to work round this[0] being undefined.

This is only broken in 1.3.4 which suggests that there's a fundamental change in behavior between 1.3.3 and 1.3.4. The other versions I tested were 1.3.3, 1.3.2, 1.3.0, 1.2.9 and 1.2.1.

cheers,
Richard

@sbocinec
Copy link

In my case, even the first destroy fails, if the root modules uses a child module that uses a data source and terraform destroy with -refresh=false is executed.

In our use-case we always destroy without refreshing to avoid data source failures and using the 1.3.4 version fails consistently, always, even on first destroy with this error.

@jbardin
Copy link
Member

jbardin commented Nov 16, 2022

@sbocinec, Thanks, that would be a different issue, and is unlikely to be affected by any fix to this one. The current problem happens during the pre-destroy refresh which is skipped with -refresh=false. If you have an example you could post in a new issue, it would be helpful.

@timblaktu
Copy link

timblaktu commented Nov 17, 2022

Happy to join your ranks fellows. Like @sbocinec I've been troubleshooting this issue for days now. I was first focusing on the changes I had made in forks of the modules I'm using from my root module (eks_blueprints, terraform-aws-eks, eks_blueprints_kubernetes_addons). After I fixed a few unrelated bugs, I started seeing the 'the collection has no elements' error pattern at destroy-time, which i fixed by applying this pattern to each case that popped up, whack-a-mole style:

  # Harden all references to variable-length list resources using splat and coalesce()
  # to prevent "the collection has no elements" errors.
  # node_security_group_id = local.create_node_sg ?              aws_security_group.node[0].id :           var.node_security_group_id
  node_security_group_id =   local.create_node_sg ? coalescelist(aws_security_group.node[*].id, [""])[0] : var.node_security_group_id
  #                                                                                      ^          ^
  #                                                           splat list yields null if  |          |
  #                                                           there are no elements -----           |
  #                                                                                                 |
  #                                                 coalescelist(any_list, [""])` will always       |
  #                                                 return a list with at least one element ---------

After "hardening" all the direct references to the first element of variable-length resource lists in my modules, I then started seeing the same pattern in the outer modules I hadn't made any changes to.

Finally convinced "It's you, not me" I was able to find this issue. Happy to be here. :-)

What can I do to help?

I've inferred from the comments of @apparentlymart and @jbardin that this "variable-length list hardening" should be unnecessary, and that this misbehavior is caused by the new "destroy-time refresh-only plan" feature in 1.3.4.

For now I will try @jbardin's recommendation of passing -refresh=false arg to all terraform destroy calls.

Nota Bene

Terraform Destroy Sequence

In my case, and many other people I've seen hitting this issue who are using eks_blueprints module, there is not a single terraform destroy call. A sequence of targeted terraform destroy calls, followed by a final non-targeted terraform destroy call, is required to "stage" the top-down destruction of the layers of infrastructure (and applications) that terraform is managing. This is recommended by the eks_blueprints project, but I can endorse this approach when managing k8s clusters using terraform. There are numerous race conditions that can occur if you do not stage things like this. Tools like terragrunt and scripts orchestrated by gnu make this easier.

Same Misbehavior Reported in EKS Blueprints

This failure mode is being tracked here for eks_blueprints.

Same Misbehavior Reported in terraform-aws-eks

Here is a closed 3 year old issue in terraform-aws-eks which I thought the core devs/contributors @apparentlymart @jbardin may find interesting, since it's the same misbehavior. The changeset that resolved that issue is where I got the idea for my "variable-length list hardening" patch above.

Module Forks/Branches Used to Test my Patches

My root module currently uses this fork/branch of eks_blueprints which originally was created to add support for crossplane helm and terraform providers. I had validated all it's functionality and was about to submit a PR when I started noticing "collection has no elements" errors. (I upgraded to 1.3.4 somewhere in that process.) My eks_blueprints fork crossplane-helm-provider branch uses my terraform-aws-eks fork 568-redux branch, which also contains these hardening patches.

@timblaktu
Copy link

timblaktu commented Nov 17, 2022

@jbardin in my first test using -refresh=false for every terraform destroy call prevents the "collection has no elements" errors I was seeing just prior using same code without this arg. :-)

However :-( there remains a problem with not refreshing during destroy: after successfully completing my destroy sequence, data sources are not destroyed and remain in the state. I've tried to work around this by appending YADS (Yet Another Destroy Stage) to the end of my sequence to run a normal, non-targeted, with-refresh, terraform destroy. However, these data sources were still not destroyed:

main  | 2022-11-17T15:54:05.033699600Z Terraform state for workspace pr-tim-usw1 now contains:
main  | 2022-11-17T15:54:05.033703400Z
main  | 2022-11-17T15:54:05.033706000Z     data.aws_availability_zones.available
main  | 2022-11-17T15:54:05.033708800Z     data.aws_caller_identity.current
main  | 2022-11-17T15:54:05.033711300Z     data.aws_iam_policy_document.managed_ng_assume_role_policy
main  | 2022-11-17T15:54:05.033714100Z     data.aws_region.current
main  | 2022-11-17T15:54:05.033716900Z     data.aws_secretsmanager_secret_version.cluster_unsealed["cluster-unsealed-argocd2022-gitlab-repo-cred"]                                                                                                                                                                                          main  | 2022-11-17T15:54:05.033719900Z     data.aws_secretsmanager_secrets.cluster_unsealed                                                                                                                                                                                                                                                                     main  | 2022-11-17T15:54:05.033722400Z     data.kubectl_path_documents.sealed_secrets
main  | 2022-11-17T15:54:05.033724800Z     module.eks_blueprints.data.aws_caller_identity.current
main  | 2022-11-17T15:54:05.033727200Z     module.eks_blueprints.data.aws_iam_policy_document.eks_key
main  | 2022-11-17T15:54:05.033729600Z     module.eks_blueprints.data.aws_iam_session_context.current
main  | 2022-11-17T15:54:05.033732100Z     module.eks_blueprints.data.aws_partition.current
main  | 2022-11-17T15:54:05.033742900Z     module.eks_blueprints.data.aws_region.current
main  | 2022-11-17T15:54:05.033745900Z     module.eks_blueprints_kubernetes_addons.data.aws_caller_identity.current                                                                                                                                                                                                                                                                 main  | 2022-11-17T15:54:05.033748500Z     module.eks_blueprints_kubernetes_addons.data.aws_partition.current                                                                                                                                                                                                                                                                       main  | 2022-11-17T15:54:05.033750800Z     module.eks_blueprints_kubernetes_addons.data.aws_region.current
main  | 2022-11-17T15:54:05.033753300Z     module.eks_blueprints.module.aws_eks.data.aws_caller_identity.current                                                                                                                                                                                                                                                                    main  | 2022-11-17T15:54:05.033755700Z     module.eks_blueprints.module.aws_eks.data.aws_default_tags.current                                                                                                                                                                                                                                                                       main  | 2022-11-17T15:54:05.033758200Z     module.eks_blueprints.module.aws_eks.data.aws_iam_policy_document.assume_role_policy[0]                                                                                                                                                                                                                                                  main  | 2022-11-17T15:54:05.033760700Z     module.eks_blueprints.module.aws_eks.data.aws_partition.current
main  | 2022-11-17T15:54:05.033763200Z     module.eks_blueprints_kubernetes_addons.module.crossplane[0].data.aws_iam_policy_document.s3_policy                                                                                                                                                                                                                                      main  | 2022-11-17T15:54:05.033766000Z     module.eks_blueprints.module.aws_eks.module.kms.data.aws_caller_identity.current                                                                                                                                                                                                                                                         main  | 2022-11-17T15:54:05.033768700Z     module.eks_blueprints.module.aws_eks.module.kms.data.aws_partition.current                                                                                                                                                                                                                                                               

The complete destroy sequence I'm using is:

echo "Commencing destroy sequence, using -refresh=false to work around https://github.com/hashicorp/terraform/issues/32126..."
echo "Destroying eks_blueprints_kubernetes_addons..."
time terraform destroy -refresh=false -target="module.eks_blueprints_kubernetes_addons" -input=false -auto-approve -compact-warnings ${TF_EXTRA_ARGS:+"${TF_EXTRA_ARGS}"}
echo "Destroying eks_blueprints..."
time terraform destroy -refresh=false -target="module.eks_blueprints" -input=false -auto-approve -compact-warnings ${TF_EXTRA_ARGS:+"${TF_EXTRA_ARGS}"}
echo "Clean up remaining resources in terraform state with a non-targeted destroy..."
echo "    ${RED}Note: because we're using -refresh=false to work around destroy-time misbehaviors, data sources will not get cleaned up here${RESET}"
time terraform destroy -refresh=false -input=false -auto-approve -compact-warnings ${TF_EXTRA_ARGS:+"${TF_EXTRA_ARGS}"}
echo "Finally, clean up any remaining objects in terraform state (should be only data sources at this point) with a non-targeted destroy WITH -refresh=true..."
time terraform destroy -input=false -auto-approve -compact-warnings ${TF_EXTRA_ARGS:+"${TF_EXTRA_ARGS}"}

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug new new issue not yet triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants