Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE] Removing a resource now causes a cycle. Terraform plan fails #1648

Open
nihil0 opened this issue Sep 30, 2022 · 7 comments
Open

[ISSUE] Removing a resource now causes a cycle. Terraform plan fails #1648

nihil0 opened this issue Sep 30, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@nihil0
Copy link

nihil0 commented Sep 30, 2022

This issue started after I updated my terraform state with terraform state replace-provider databrickslabs/databricks databricks/databricks

My terraform code is structured as follows:

I have a module called workspace which creates all the necessary steps involved in deploying an MWS workspace. I then invoke the module

module "dev_ws" {
  source = "git::ssh://****/terraform-modules.git//workspace"

  providers = {
    databricks = databricks.mws
  }

  databricks_account_id = var.databricks_account_id
  account_password      = var.account_password
  workspace_function    = "dev"
  vpc_id                = "***"
  pvt_subnet_1_id       = "***"
  pvt_subnet_2_id       = "***"
  security_group_id     = "***"
}

I then create a new provider based on this workspace

provider "databricks" {
  alias    = "dev_ws"
  host     = module.dev_ws.workspace_url
  username =  var.root_user_name
  password = var.account_password
}

I create a cluster

resource "databricks_cluster" "kadp_job_cluster" {
    ...
}

I then run plan and apply.

Sometime back, I updated the provider namespace to databricks/databricks. Now when I try to remove the cluster by e.g., commenting out the code, I get the following error when running terraform plan:

Error: Cycle: module.dev_ws.aws_iam_role.cross_account_role, module.dev_ws.aws_iam_role_policy.this, module.dev_ws.databricks_mws_credentials.this, module.dev_ws.aws_s3_bucket.root_storage_bucket, module.dev_ws.databricks_mws_storage_configurations.this, databricks_cluster.kadp_job_cluster (destroy), databricks_permissions.kadp_cluster (destroy), module.dev_ws.databricks_mws_networks.this, module.dev_ws.databricks_mws_workspaces.this, module.dev_ws.output.workspace_url (expand), provider["registry.terraform.io/databricks/databricks"].dev_ws

However, I can't see any cycles when I run terraform graph -type=apply -draw-cycles | grep red

I am not entirely sure if this is due to the latest version of the provider or that the state file was modified with terraform state replace-provider databrickslabs/databricks databricks/databricks

Output of terraform -version

Terraform v1.3.0
on linux_amd64
+ provider registry.terraform.io/databricks/databricks v1.3.1
+ provider registry.terraform.io/hashicorp/aws v4.32.0
+ provider registry.terraform.io/hashicorp/time v0.7.2

Your version of Terraform is out of date! The latest version
is 1.3.1. You can update by downloading from https://www.terraform.io/downloads.html
@jschra
Copy link

jschra commented Oct 3, 2022

I am experiencing the same issue.

Similar to OP's situation, I run two modules that (1) provisions a workspace and (2) provisions users, clusters and other things within the workspace provisioned by module 1.

For me, the cycle error occurs when I am trying to remove a user from my deployment. I work with two users groups, developers and frontend. Based on variables I pass at runtime, it will add and/or destroy users from these groups.

So for me, this worked perfectly fine throughout the last few months. What I have discovered, however, is that based on the Terraform version that is being used for my deployments, it either bugs out or it doesn't. When I use terraform 1.2.3 (which I have been using for a while), my configurations work just fine. However, when I switched to terraform 1.3.0 or higher, I run into the cycle error.

EDIT: I checked from which Terraform version this cycle error starts occuring and it is indeed 1.3.0. When I use Terraform 1.2.9, it still runs perfectly fine.

Below you can find (parts of, pruned for relevance) the configurations I am running, along with examplatory input:

databricks_users.tf

locals {
  workspace_directories = ["shared", "playground"]
  developers            = var.databricks_users.developers
  frontend              = var.databricks_users.frontend
}

########################################################################################
#                                                                                      #
#                                  Developers                                          #
#                                                                                      #
########################################################################################

# Create developers group
resource "databricks_group" "developers" {
  display_name               = "developers"
  allow_cluster_create       = true
  allow_instance_pool_create = false
  databricks_sql_access      = true
  workspace_access           = true
}

# Create developers
resource "databricks_user" "developer" {
  for_each = local.developers

  user_name    = each.key
  display_name = each.value

  # No access by default, only through groups
  allow_cluster_create       = false
  allow_instance_pool_create = false
  databricks_sql_access      = false
}

# Add users to developers group
resource "databricks_group_member" "developers" {
  depends_on = [databricks_user.developer]
  for_each   = databricks_user.developer

  group_id  = databricks_group.developers.id
  member_id = each.value.id
}

# Create shared directories and give permissions to developers group
resource "databricks_directory" "this" {
  for_each = toset(local.workspace_directories)

  path = "/${each.key}"
}

resource "databricks_permissions" "folder_usage" {
  for_each = databricks_directory.this

  directory_path = each.value.path
  depends_on     = [databricks_directory.this]

  access_control {
    group_name       = databricks_group.developers.display_name
    permission_level = "CAN_MANAGE"
  }
}

########################################################################################
#                                                                                      #
#                                   Frontend                                           #
#                                                                                      #
########################################################################################

# Create frontend group
resource "databricks_group" "frontend" {
  display_name               = "frontend"
  allow_cluster_create       = false
  allow_instance_pool_create = false
  databricks_sql_access      = false
  workspace_access           = false
}


# Create frontend users
resource "databricks_user" "frontend" {
  for_each = local.frontend

  user_name    = each.key
  display_name = each.value

  # No access by default, only through groups
  allow_cluster_create       = false
  allow_instance_pool_create = false
  databricks_sql_access      = false
}

# Add users to frontend group
resource "databricks_group_member" "frontend" {
  depends_on = [databricks_user.frontend]
  for_each   = databricks_user.frontend

  group_id  = databricks_group.frontend.id
  member_id = each.value.id
}

########################################################################################
#                                                                                      #
#                                 Group rights                                         #
#                                                                                      #
########################################################################################

# Provide rights to generate PATs
resource "databricks_permissions" "token_usage" {
  authorization = "tokens"

  access_control {
    group_name       = databricks_group.developers.display_name
    permission_level = "CAN_USE"
  }

  access_control {
    group_name       = databricks_group.frontend.display_name
    permission_level = "CAN_USE"
  }
}

# Manage cluster rights
resource "databricks_permissions" "cluster_access" {
  for_each = databricks_cluster.this

  cluster_id = each.value.cluster_id
  access_control {
    group_name       = databricks_group.developers.display_name
    permission_level = "CAN_MANAGE"
  }

  dynamic "access_control" {
    for_each = each.key == "sql_cluster" ? [true] : []

    content {
      group_name       = databricks_group.frontend.display_name
      permission_level = "CAN_RESTART"
    }
  }
}

variables.tfvars

databricks_users     = {
  "developers" : {"jschra@deloitte.nl":"Jorik Schra"}
, "frontend" : {} 
}

If I run this on 1.2.3 and remove my work email address from the var file, it works perfectly fine. However, from 1.3.0, Terraform throws a cycle error for whatever reason. The error looks as follows:

image

Any help in this matter would be highly appreciated, as all my automated builds currently run 1.3.0 (part of default runtime) and I'd like to have this functionality (remove users) working with that version as well

Version outputs

1.2.3 setup

Terraform v1.2.3
on darwin_amd64
+ provider registry.terraform.io/databricks/databricks v1.4.0
+ provider registry.terraform.io/hashicorp/aws v4.33.0
+ provider registry.terraform.io/hashicorp/random v3.4.3
+ provider registry.terraform.io/hashicorp/template v2.2.0
+ provider registry.terraform.io/hashicorp/time v0.8.0
+ provider registry.terraform.io/microsoft/azuredevops v0.2.2

Your version of Terraform is out of date! The latest version
is 1.3.1. You can update by downloading from https://www.terraform.io/downloads.html

1.3.0 setup (cycle error)

Terraform v1.3.0
on darwin_amd64
+ provider registry.terraform.io/databricks/databricks v1.4.0
+ provider registry.terraform.io/hashicorp/aws v4.33.0
+ provider registry.terraform.io/hashicorp/random v3.4.3
+ provider registry.terraform.io/hashicorp/template v2.2.0
+ provider registry.terraform.io/hashicorp/time v0.8.0
+ provider registry.terraform.io/microsoft/azuredevops v0.2.2

@nkvuong nkvuong added the bug Something isn't working label Oct 5, 2022
@nkvuong
Copy link
Contributor

nkvuong commented Oct 7, 2022

looking through Terraform changelogs, the only relevant change is hashicorp/terraform#31917

@nihil0 @jschra I assume you both have this pattern in your code

provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  username = var.databricks_account_username
  password = var.databricks_account_password
}

// create the workspace using mws

provider "databricks" {
  host  = module.e2.workspace_url
  token = module.e2.token_value
}

// create other resources within the workspace

If that is the case, then the new cycle-detection logic in Terraform somehow linked the resources under the 2 provider and end up with a cycle.

A simple check is if you replace the host & token for the provider with variables/const, and see if that resolves the issue

@michaelvonderbecke
Copy link

We are having the same issue as above, using a module that creates a workspace and passing the host url and token output to the provider block for the module that configures the workspace, and getting a cycle error any time we try to remove a user. With terraform 1.3.0 and 1.3.1 we also got cycle errors trying to make other changes, like remove an instance profile permission. With terraform 1.3.2, some of the cycle errors (like removing the instance profile permission) no longer occur, but removing a user still generates the error.

I did test the above and hard coded the token and host url in the provider block as a test, and this removed the cycle error (although it also caused some strange update-in-place changes I wasn't expecting with IAM policy statement blocks that are generated through databricks data sources).

That being said, I don't know that this is a solution for us as it would require heavy rework on our entire CI/CD pipeline and I'm not even sure how we would pass a sensitive value like a token between two separate deployments without storing it somewhere insecure in between these deployments. Right now the token output is marked sensitive so it isn't recorded anywhere. Recording it between the workspace deployment to use as an input for the workspace config is already significantly less secure as we'd have to allow outputting it in the first place. Or, we would have to have 2 separate deployments that require human intervention between them so someone could generate a PAT and store it as a github secret (and in that case we would have to ensure that the PAT is always valid before deploying the workspace config terraform).

@jschra
Copy link

jschra commented Oct 7, 2022

looking through Terraform changelogs, the only relevant change is hashicorp/terraform#31917

@nihil0 @jschra I assume you both have this pattern in your code

provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  username = var.databricks_account_username
  password = var.databricks_account_password
}

// create the workspace using mws

provider "databricks" {
  host  = module.e2.workspace_url
  token = module.e2.token_value
}

// create other resources within the workspace

If that is the case, then the new cycle-detection logic in Terraform somehow linked the resources under the 2 provider and end up with a cycle.

A simple check is if you replace the host & token for the provider with variables/const, and see if that resolves the issue

@nkvuong, actually no, that is not exactly the same pattern I use.

Yes, I do use two separate modules to (1) provision a workspace and (2) provision things within it, but I do not use a token generated in 1 to enter the workspace. For both providers, I use the email and password of the Databricks account, only changing the URL in between steps (accounts URL for 1, workspace URL for 2).

But other than that, yes. I chain the two modules to be able to run everything in 1 deployment, passing the workspace URL from module 1 to module 2. Hence using hardcoded credentials would require substantial refactoring of our platform setup, similar to @michaelvonderbecke.

@nihil0
Copy link
Author

nihil0 commented Oct 9, 2022

looking through Terraform changelogs, the only relevant change is hashicorp/terraform#31917

@nihil0 @jschra I assume you both have this pattern in your code

provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  username = var.databricks_account_username
  password = var.databricks_account_password
}

// create the workspace using mws

provider "databricks" {
  host  = module.e2.workspace_url
  token = module.e2.token_value
}

// create other resources within the workspace

If that is the case, then the new cycle-detection logic in Terraform somehow linked the resources under the 2 provider and end up with a cycle.

A simple check is if you replace the host & token for the provider with variables/const, and see if that resolves the issue

@nkvuong : I don't use the exact same pattern, but a similar one. In my case, removing the reference to the account-level provider solved the cycle issue. @jschra : I can also confirm that it works fine when I use 1.2.9. I'll document both workarounds here for future reference:

Workaround 1: Use terraform<1.3.0
The problem originated from this change in Terraform 1.3.0. Setting the version to 1.2.9 does not cause cycles.

Workaround 2: Remove dependency between account-level and workspace-level providers

// Account level provider
provider "databricks" {
  alias    = "mws"
  host     = "https://accounts.cloud.databricks.com"
  username = "***"
  password = var.account_password
}

// Create workspace here with module. Module outputs workspace URL
module "dev_ws" {
  providers = {
      databricks = databricks.mws
  }
 ...
}

// Workspace level provider
provider "databricks" {
  alias    = "dev_ws"
  // BEFORE: host = module.dev_ws.workspace_url
  host     = "my-dev-workspace.cloud.databricks.com"
  username = "***"
  password = var.account_password
}

I am going with Workaround 1 as it is the least disruptive. However, these are both workarounds and I would consider this issue resolved when cycles don't appear when I run terraform plan with Terraform 1.3 after removing resources. Not sure whether this will be fixed in Terraform itself or the Databricks Provider.

@nkvuong
Copy link
Contributor

nkvuong commented Oct 10, 2022

@nihil0 - this is an issue with the Terraform binary itself, as similar errors were also reported for different providers - hashicorp/terraform#31843

@jschwellnus92
Copy link

I also had this same issue. I can confirm updating from Terraform version 1.3.0 to 1.3.6 resolved this for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants