Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer to lazy client initialization to prevent dialing localhost #1078

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pfault
Copy link

@pfault pfault commented Nov 29, 2020

Description

Defers the initialization of the actual client to first use to prevent dialling http://localhost instead of late configured credentials.
Fixes cases where using parameters like config_path that are generated during the state apply lead to use of an empty default config.

Acceptance tests

  • Have you added an acceptance test for the functionality being added?
  • Have you run the acceptance tests on this branch?

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestProvider'

==> Checking that code complies with gofmt requirements...
rm -rf /home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraform /home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraform.lock.hcl || true
mkdir /home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraform
mkdir -p /tmp/.terraform.d/localhost/test/kubernetes/9.9.9/linux_amd64 || true
go clean -cache
go build -o /tmp/.terraform.d/localhost/test/kubernetes/9.9.9/linux_amd64/terraform-provider-kubernetes_9.9.9_linux_amd64
cd /home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers && TF_CLI_CONFIG_FILE=/home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraformrc TF_PLUGIN_CACHE_DIR=/home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraform terraform init -upgrade

Initializing the backend...

Initializing provider plugins...
- Finding localhost/test/kubernetes versions matching "9.9.9"...
- Installing localhost/test/kubernetes v9.9.9...
- Installed localhost/test/kubernetes v9.9.9 (unauthenticated)

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.
TF_CLI_CONFIG_FILE=/home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraformrc TF_PLUGIN_CACHE_DIR=/home/pfault/git/terraform-provider-kubernetes/kubernetes/test-infra/external-providers/.terraform TF_ACC=1 go test "/home/pfault/git/terraform-provider-kubernetes/kubernetes" -v -run=TestProvider -timeout 120m
=== RUN   TestProvider
--- PASS: TestProvider (0.03s)
=== RUN   TestProvider_impl
--- PASS: TestProvider_impl (0.00s)
=== RUN   TestProvider_configure
2020/11/29 16:10:27 [DEBUG] Trying to load configuration from file
2020/11/29 16:10:27 [DEBUG] Configuration file is: test-fixtures/kube-config.yaml
2020/11/29 16:10:27 [DEBUG] Using custom current context: "gcp"
2020/11/29 16:10:27 [DEBUG] Using overidden context: api.Context{LocationOfOrigin:"", Cluster:"", AuthInfo:"", Namespace:"", Extensions:map[string]runtime.Object(nil)}
2020/11/29 16:10:27 [INFO] Successfully initialized config
--- PASS: TestProvider_configure (0.00s)
PASS
ok  	github.com/hashicorp/terraform-provider-kubernetes/kubernetes	0.067s

Release Note

Release note for CHANGELOG:

NONE

References

#758
#1028

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

@ghost ghost added size/M labels Nov 29, 2020
@hashicorp-cla
Copy link

hashicorp-cla commented Nov 29, 2020

CLA assistant check
All committers have signed the CLA.

@alexsomesan
Copy link
Member

Thanks for taking a shot a this issue! The solution looks good and seems to follow the same pattern we applied to the Kubernetes alpha provider.

I'll be testing this a bit to see if I can come up with an automated test scenario.

@alexsomesan
Copy link
Member

This looks good.

We're going to hold off merging for a bit. Some wider ranging credentials changes are in flight here: #1052. Once those merge, we can go ahead and merge this too. I think it's easier in that order since this is smaller and easier to rebase if it runs into conflicts.

@mroche89
Copy link

Any movement on this? #1052 was merged a month ago

@pduchnovsky
Copy link

pduchnovsky commented Mar 3, 2021

Is this going to be merged soon ? We're pending on this fix in order to actually utilize terraform for kubernetes deployments with GPU on GKE.. (older provider version doesn't work with taint "nvidia.com/gpu")

@alexsomesan

Base automatically changed from master to main March 23, 2021 15:53
@petewilcock
Copy link

@alexsomesan +1 request to see this merged after @pfault can resolve merge conflicts.

This bug appears surface in any workflow that would lead to cluster recreation, so is of particularly great pain to anyone iterating development in this matter (per: #1028 (comment))

@Karthik-13
Copy link

Hey guys, we are facing this problem as well.
any ETA on when this will be merged to the provider?

@pduchnovsky
Copy link

Please, for the love of terraform, merge this already.

@jw-maynard
Copy link

@pfault Do you have time to fix the merge conflicts?

@ermik
Copy link

ermik commented Aug 27, 2021

Happy to resolve conflicts for this, just let me know how to collaborate. As a simple route, I can create a new PR by forking the PR branch. cc @flavioab

@pfault
Copy link
Author

pfault commented Aug 31, 2021

Re-based to main, fixing merge conflict

@jw-maynard
Copy link

@alexsomesan @flavioab can we get this merged?

@RichiCoder1
Copy link

Not to bump, but bump. Currently stuck on v1 due to the way this blocks migration and iteration.

@DutchEllie
Copy link

What the other person above me said. I don't wanna bump, but uh... Bump.

This thing needs to get merged.

@ermik
Copy link

ermik commented Oct 5, 2021

Pinging @alexsomesan and @jrhouston 👋🏻

@dambros
Copy link

dambros commented Nov 4, 2021

Please merge this. It has been driving me nuts on a daily basis

@raven-wing
Copy link

I'm pretty sure there is a reason why it is still not merged, but...
Could anyone provide such information?

@Errox
Copy link

Errox commented Nov 25, 2021

When can we expect the merge?

@alexsomesan
Copy link
Member

I think the effect of this change is focused around the removal of this part:

	if cfg == nil {
		// This is a TEMPORARY measure to work around https://github.com/hashicorp/terraform/issues/24055
		// IMPORTANT: this will NOT enable a workaround of issue: https://github.com/hashicorp/terraform/issues/4149
		// IMPORTANT: if the supplied configuration is incomplete or invalid
		///IMPORTANT: provider operations will fail or attempt to connect to localhost endpoints
		cfg = &restclient.Config{}
	}

The rest of the code shuffle looks like it wouldn't make a difference in behaviour.

However, as noted by those comment notes, that part was added there a workaround. It's likely that removing it will cause side-effects which are hard to predict now.

To move this conversation forward, we would need some solid reproduction configurations. If anyone is willing to contribute some that would demonstrate how this change makes an impact, we can move from there.

@raven-wing
Copy link

I'm afraid that it would not be easy for solid reproduction configuration since this error happens "sometimes"...

Nonetheless our configuration is copied from hashicorp example:

provider "kubernetes" {
host = data.azurerm_kubernetes_cluster.default.kube_config.0.host
client_certificate = base64decode(data.azurerm_kubernetes_cluster.default.kube_config.0.client_certificate)
client_key = base64decode(data.azurerm_kubernetes_cluster.default.kube_config.0.client_key)
cluster_ca_certificate = base64decode(data.azurerm_kubernetes_cluster.default.kube_config.0.cluster_ca_certificate)
}

The point is that cluster is applied in other root dir. We use AKS.

What "helps" to get this error is to run apply azurerm_kubernetes_cluster and cluster setup from different machines.

@dskecse
Copy link

dskecse commented Dec 24, 2021

hey @alexsomesan @flavioab, any ETA here? could this be merged in any time soon, please?

@ebauman
Copy link

ebauman commented Jan 18, 2022

Could we please get this merged sometime soon?

@yellowmegaman
Copy link

yellowmegaman commented Feb 1, 2022

@apparentlymart @radeksimko sorry for bothering, but you guys are awesome, and this renders kubernetes provider unusable for multiple-cluster setup when you're changing some cluster property (e.g. disk_size for node).

Maybe you can somehow influence this ? 🥲

@apparentlymart
Copy link
Member

I don't have sufficient knowledge about Kubernetes or this provider to weigh in on when and how this could be merged, I'm afraid.

Seeing this does make me think a bit about the solution though: Terraform Core should be sending an unknown value to the provider in the situation where the configuration is derived from unknown values from elsewhere, which can in principle allow a provider to differentiate between totally unset vs. not yet known.

However, I believe the old SDK hides that detail when you access the configuration through its API, making it seem like the unknown values are not set at all. If backward compatibility is a concern for this change (I'm not sure) then it might be worth considering how the SDK could potentially expose information about whether particular arguments are known and then have this special new behavior occur in the unknown case, leaving the unset case behaving however it currently does. (Sounds like it currently defaults to localhost in that case.)

@ScratZa
Copy link

ScratZa commented Feb 2, 2022

bump

@yellowmegaman
Copy link

I don't have sufficient knowledge about Kubernetes or this provider to weigh in on when and how this could be merged, I'm afraid.

Seeing this does make me think a bit about the solution though: Terraform Core should be sending an unknown value to the provider in the situation where the configuration is derived from unknown values from elsewhere, which can in principle allow a provider to differentiate between totally unset vs. not yet known.

However, I believe the old SDK hides that detail when you access the configuration through its API, making it seem like the unknown values are not set at all. If backward compatibility is a concern for this change (I'm not sure) then it might be worth considering how the SDK could potentially expose information about whether particular arguments are known and then have this special new behavior occur in the unknown case, leaving the unset case behaving however it currently does. (Sounds like it currently defaults to localhost in that case.)

Thank you for your input! Too bad that it seems that the root cause is a lot deeper.

For anyone who stumbles upon this issue, please note that this code alone can break kubernetes provider right now at version 2.7.1

provider "google" {
  project = "yourproject"
  region  = "us-central1-a"
}

data "google_client_config" "default" {}

resource "google_container_cluster" "hello" {
  name               = "hello"
  location           = "us-central1-a"
  initial_node_count = 1
}

provider "kubernetes" {
  host                   = "https://${google_container_cluster.hello.endpoint}"
  token                  = data.google_client_config.default.access_token
  cluster_ca_certificate = base64decode(google_container_cluster.hello.master_auth[0].cluster_ca_certificate)
}

resource "kubernetes_namespace" "traefik" {
  metadata {
    name = "traefik"
  }
}

Just change the initial_node_count parameter and it fail with message about localhost.
(same thing with data resource in the middle).

At this point I don't think kubernetes operator can be used without some extraordinary operational detours (e.g. separate management of cluster and k8s, separate states and remote state data source)

@orgads
Copy link

orgads commented Feb 2, 2022

In my case, if I run (adapted to your example) terraform apply -target google_container_cluster.hello, and then terraform apply, it works as expected.

Can you check if it works for you?

@yellowmegaman
Copy link

-target google_container_cluster.hello

Yes, it does work.
Was also considering this option, but in this case I'll need to have external tf state checker mechanism to have additional pipeline step for container cluster changes.

@orgads
Copy link

orgads commented Feb 2, 2022

Well, it's clearly a workaround, not a solution.

@apparentlymart
Copy link
Member

FWIW, in other cases where folks have employed that workaround it's typically only been needed for initial creation and not for subsequent updates, because the argument in question stays stable after the object has been created. Again, I don't know this particular situation well enough to say for certain, but I'd guess that the hostname of a cluster would stay constant for the life of that cluster and so the extra run with -target would only be needed on the first iteration to get that cluster created, and then after that you can use a normal workflow for ongoing maintenence.

@yellowmegaman
Copy link

yellowmegaman commented Feb 3, 2022

FWIW, in other cases where folks have employed that workaround it's typically only been needed for initial creation and not for subsequent updates, because the argument in question stays stable after the object has been created. Again, I don't know this particular situation well enough to say for certain, but I'd guess that the hostname of a cluster would stay constant for the life of that cluster and so the extra run with -target would only be needed on the first iteration to get that cluster created, and then after that you can use a normal workflow for ongoing maintenence.

Well actually in the example above first run isn't a problem at all, but on the contrary the subsequent runs are the problem if you change any cluster parameter that will trigger it's recreation.

depends_on parameter for provider hashicorp/terraform#2430 could also help here

If we create outputs for example configuration above, outputting cluster endpoint and ca_cert, this is what I get after initial creation and after update with -target specified.
Outputs:

output "hello_endpoint" {
  value = google_container_cluster.hello.endpoint
}
output "hello_ca" {
  value = google_container_cluster.hello.master_auth[0].cluster_ca_certificate
}

First run:

hello_ca = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVMVENDQXBXZ0F3SUJBZ0lSQUpIWTFUOVZVbVA1TE5VN3ZBckxtbDB3RFFZSktvWklodmNOQVFFTEJRQXcKTHpFdE1Dc0dBMVVFQXhNa09ESmtPVEUwWVdRdE56RmxOeTAwWW1FeUxUaGpabUV0TW1RMVl6Z3lOelExTURBMQpNQ0FYRFRJeU1ESXdNekEzTVRFd01sb1lEekl3TlRJd01USTNNRGd4TVRBeVdqQXZNUzB3S3dZRFZRUURFeVE0Ck1tUTVNVFJoWkMwM01XVTNMVFJpWVRJdE9HTm1ZUzB5WkRWak9ESTNORFV3TURVd2dnR2lNQTBHQ1NxR1NJYjMKRFFFQkFRVUFBNElCandBd2dnR0tBb0lCZ1FEYzBlZW1QMms2YVM5Mi9kLzh2dkhTLzZlVUR4T0djRjRoMUJiVApnb09FSjZhWjlLVFB4Zkdldms5eFBFcEVxMGhIMlVtYVVSKzk2ZG9UTWgzMzZGSTBmeUYrWk5ka2lGcTNjazJwCkhlTEtsUGlvcmFPZWF4MEYya204d1ZmQTJuTE5JZzk0SGF2d3RZRHQ5L0hEVnFVNjNKWmpPVzZPYVBQNEppU3gKTGNUbmo4TzZVZnExTnJTTU80bU5mV1M0aCs5RE1ZSXNDZlM0TXczS0lKK3E4R2hGVGpjVHZiTEdyU01NKzRiSQoybm5oMGxyMFk2Vzh4YS9KeU55WEljNWtoVEYwb3pqUzByUUZ3WFhqeXYxRmQxZWRaOUlRR243N0F2MHJqWVQ2CmZzTWZBSTlFOE5NanRJVjByN1NpYVI3eVhtaE9GUEhTcmNOQ0EzdUJCMG5LdjRpN0hPUXo4WElWQUlOK0oyVUYKYys2dW5yZmxtQlhHc1dWZko3dUNFWW41VmNzN2t3Z21JRnRMa0dZS2syclBzN2x2MFhNZ3d6MXFrdzVLUEIweApHTlprTll6dzA4dS9oYkw4d0loellmclpFZ2RHQS9zM2J1N0ZoWlh3OHpBc3JCSGd0R0UxTlZHaWZ0cEQzQ3JPClBxTzBkUXVQcFJIUlkveUNuMTl5eXQxMElaVUNBd0VBQWFOQ01FQXdEZ1lEVlIwUEFRSC9CQVFEQWdJRU1BOEcKQTFVZEV3RUIvd1FGTUFNQkFmOHdIUVlEVlIwT0JCWUVGTldJL21rQ3djbS9NaVIzVkZuQmxtZU5aM1pDTUEwRwpDU3FHU0liM0RRRUJDd1VBQTRJQmdRQ1Fqc2pnNUdtN1hsOXg4RXR6bmpWRlFYaTRCY1FaQnQyTXowdzV4bWhlCktLaDZaNjRJQ2UxSEt5bFlLbXVrbDJhMVRZNEdqd2xsanNLT2lqUXpnWFZVTng5bjF2UFB1WktFQ1p3OVY1L3kKRTBQWHEzaE5paTd1QVJXbkR3TUlMaFl1TTVZaGtRa0VRUlNJdFdrc0p0Mko4ajNsNHRoQnRzUzFENVlySjczVwpmdHMxU241czlSWngvQ1JDeUhFaWZMUTZKcDZ1czBtTmY5SHFUWXkrZngyY1E5Z0RXMEFod2FZbStCVFZzckdzCkY5YWNaZEFOWllxTDdPL3QvWFJmR1hQWERUSFFsV2NMdVNCc2FOdjdQeEtKckExcjUzVEZpNDViZVZud0pIa2MKQXhXNHlseCtUNzk1TFBKell4NUY5T3EyTjBialZ2R01WWHViSE8wSXFkZkZSU1kxMStvWDQyNFVzN2ZzU3hRNQpUR2ZFSXE2eDJycjVyTmNac2VkeHMwOC8zbHcrRnZnS3ZUU3MzbXVTbDVWZlN0ZW4waHl1M0hkWU5TVm94S1ltCk54NWpEMU85YlJ0MktHN1ZzNXFjYm1aL2YyWEhtTk5UVC9sTmlBUzY0R1U0MFRIaWJmQ3NQQ2RDM0M2OVRXNEUKUUVZa3g1SFptS2lGRy9CYS8vaFVxaFE9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
hello_endpoint = "34.67.118.138"

Changed initial_node_count to 2 and specified -target=google_container_cluster.hello:

hello_ca = "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUVMVENDQXBXZ0F3SUJBZ0lSQUo0cmEwZTEwa2s3dEF3SEpIN01VMzB3RFFZSktvWklodmNOQVFFTEJRQXcKTHpFdE1Dc0dBMVVFQXhNa01XUmpZbUk1WmpndFptTTJOUzAwTWpNMkxUZ3hOR1F0TURVNE9XUXhPR1ZoTmpaaQpNQ0FYRFRJeU1ESXdNekEzTVRnd01sb1lEekl3TlRJd01USTNNRGd4T0RBeVdqQXZNUzB3S3dZRFZRUURFeVF4ClpHTmlZamxtT0MxbVl6WTFMVFF5TXpZdE9ERTBaQzB3TlRnNVpERTRaV0UyTm1Jd2dnR2lNQTBHQ1NxR1NJYjMKRFFFQkFRVUFBNElCandBd2dnR0tBb0lCZ1FETFREUmhIdW1VUkhIU3hQUlFjcVZBek41NElSeGMzNUpEV1VOMwpna1ExS1cweXBZZVhnc1R6bmxsVjBvQnNFcXlkTGRnVjJ2RWZha0FCQTU4TWMvZ2VpSWpMcVIrVHppM2ZMWFFjCmZBanV4dTMrcGdYUHNNR2VIdlNsWmppbGdTa3VFMEN0VC9Ca3lnUFhwTjlJL2l4OEx0UVZ4Q0sxOHgwc1Vxc2EKMzQ3Rmc0dUxla0VJV09PUkthOWVlZmpkZXhrV3Fhdi9zZjJ6d2pMb0VnSUZXUEd0QkhJRmdTOHhGMVlhcGdJcQpsWElLWkNzZDc3RWhkaXNvREVBUEY5VmRPRVY2d3pjZHFxMmNtOUhYc29oWlJuYzYvVW1PTEJuSVRDRGlJQmZVCnNwYkNKK2Z5YkhNU3pLbmJualppLytnSjEreXIvQnNGaTdyRFJzb2UwTThQOUJmeXNRVHJTMjJVWUQ1WUlWMUgKbEFuT1lxanoyOVp6VHNvdzlxWG5NclNOa2J4MzA2RDlPdStTOFRJRzdQVC9PQWFWekJLdnJuVnJFYkJ5OEgvagprK0JUd3BBbVh6d1ZoRUg3amlSRU9vUURrcmFOQmxNbDJrYmJjVGhXY2g5elJlNnJKVFh4N2wzRVhvZUZ5Wk0wCjh1R3d2VEdHVkZ2VUFFTDdmbGVXMzBRTk5tOENBd0VBQWFOQ01FQXdEZ1lEVlIwUEFRSC9CQVFEQWdJRU1BOEcKQTFVZEV3RUIvd1FGTUFNQkFmOHdIUVlEVlIwT0JCWUVGQ3pIYW8wSjZEbHh4OHFuQWxoRThqUDVRKzY0TUEwRwpDU3FHU0liM0RRRUJDd1VBQTRJQmdRQkZMY1J4UThJeTI5dml6TEZoaUp5SzhseVovUU5sR0d0U2VwbStSczVZClI3dUc2RDdYbk1yUXpxRmlwbldQY3JpUGRjeWdSV3A1YUlPdWFlL1k0QmlobnJZc1U2OUNkWXo4dCs1a2owVGIKSHpJb0F2WWQrc01RSjVRSFZOLythS042dGZ4U0dkNzFzcVZOY2JONms2cWZpbGt4UHZuMXlJbnlmb0FHczh5dApudnJzdVdFUXBQUkFQVDJ3LzkvZVo4by9BbDRLL2JDT0ppUGdUWXJnbFh1NGZVWTJabmxGSm9Dd2JKak95dHBLCmh2WmVuOXdNU3U2QS9va1hKZ1JCVWhpMFEvWThNU3B2M05uVGtLTjVOWnRkdjgzem1vSXNub1pKTHdxcG95OHkKcXlKYTBmYk9YbzdEa3dja2Y4Y3BpWXE3c3hVY2JvNE0wMmxSVHExM3VvTFkxL0x5bFoxSSs1VlBkN2xaKzdpbQpoQzUyeDVHMzl3NEU0RG1BRTV5dEE0U0UvTGFzSWoxZXh5NDRkQmtEeXdjSWpySVdURUZNQlV4RkZEK1h1WUJuCklscFpkMkVUWkJnNFRIME5JbVQxWmM1MUd2azJDYWRRSTRwd243WmVrOEhONW5lRHlnYm0xd3hlUTczV0l0KzAKbW5NMzBQZUJYTGFsNGtqM1ZzYmdwdzA9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K"
hello_endpoint = "35.188.61.80"

So kubernetes cluster was re-created.
Provider was not having trouble waiting for new cluster to appear and use endpoint and ca from it to authenticate, but at re-creation time it's not waiting, but defaults to localhost instead. @pfault do you think the PR in question can help with exactly this?

I was also hoping to circumvent this by using exec authentication for kubernetes provider, but I guess this won't help since even if it worked (#1176), it would only get old endpoint and ca at cluster recreation time.

@jw-maynard
Copy link

@apparentlymart As @yellowmegaman points out creation is fine, it's anything that reconfigures with the cluster that seems to cause the problem. I use the word reconfigures instead of recreates because in my case the cluster is not actually recreated.

We use EKS for our k8s clusters and if I try to update the cluster version this problem appears. I'm not sure what black magic AWS pulls behind the curtains when you upgrade and EKS cluster but it seems to trigger this dialing of localhost instead of the cluster credentials.

As a workaround I go into the EKS console in AWS and upgrade my cluster from there then I go to terraform and change the EKS version number in our terraform code after the fact. If I do this I circumvent the problem in my use case. It's obviously not a scalable or convenient solution but we have few enough EKS cluster that we can manage.

@yellowmegaman
Copy link

Just noticed that terraform plan -target=xxx approach also has some enormous drawbacks. If you filter out all kubernetes provider related resources, terraform plan won't pick up new ones.

Even if we split CI/CD pipeline to all vs kubernetes-based lists of targets, it won't be enough to create new infrastructure. It won't be IaC anymore, because of number of manual actions required.

Almost same goes for terraform_remote_state data source, you'll need separate repo's to push (in correct order) infra / k8s workloads configurations to avoid conflicts. This way adding a new cluster with terraform-managed k8s workload also becomes a great hassle.

At this point null_resources doesn't seem like last resort anymore.

Running out of ideas on how to use this provider with any resources subject to change.

@alexsomesan @radeksimko @dak1n1 @jrhouston do you have bandwidth to review/merge this PR please?

@DanielHabenicht
Copy link

DanielHabenicht commented Feb 4, 2022

Running out of ideas on how to use this provider with any resources subject to change.

We have swallowed the bitter pill of splitting the configuration into two deployments and then using data sources to get the modified clusters, effectively decoupling them. (One terraform run for setting/updating the clusters and one for configuring the clusters with this provider)

@ajostergaard
Copy link

@DanielHabenicht we use the strategy described here and at the moment it works well and we have no reason to split as you have done.

However, we are worried that we are missing something that will trap us later. I assume you tried the pattern above so, if you don't mind, what was it exactly that you couldn't achieve with that pattern that caused you to split into two deployments? Thanks in advance!

@DanielHabenicht
Copy link

@DanielHabenicht we use the strategy described here and at the moment it works well and we have no reason to split as you have done.

However, we are worried that we are missing something that will trap us later. I assume you tried the pattern above so, if you don't mind, what was it exactly that you couldn't achieve with that pattern that caused you to split into two deployments? Thanks in advance!

It was just a hassle to manually set all targets (as Kubernetes is not the only thing we deploy), so we opted for clear split.

@FalconerTC
Copy link

Any update here?

@@ -226,7 +226,7 @@ type KubeClientsets interface {
}

type kubeClientsets struct {
config *restclient.Config
config *clientcmd.ClientConfig
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a pointer because ClientConfig is an interface.

@mrwonko
Copy link

mrwonko commented Mar 25, 2022

I've been debugging this and noticed that for me, the provider logs

[WARN] Invalid provider configuration was supplied. Provider operations likely to fail: invalid configuration: more than one authentication method found for ; found [token basicAuth], only one is allowed

You can only see this with TF_LOG=DEBUG or above.

Apparently the provider silently recovers from this (basically ignoring the error) by using a default configuration (which connects to localhost).

Further investigation revealed that I had set the KUBE_USER environment variable in addition to having an auth token configured. Unsetting KUBE_USER resolved the issue for me.

But I think this PR intended to address a different root cause.

@DorukAkinci
Copy link

This is a very important bug that needs to be addressed. Is there any update on this issue?

@danielllek
Copy link

danielllek commented Nov 24, 2022

Just hit the bug and lost plenty of time to find the cause :(

@xRDHS
Copy link

xRDHS commented Jan 12, 2023

Same @danielllek

@streamnsight
Copy link

I hit the bug with TF 1.0 but it seems like it is gone in v1.2.x
The reason for this issue is the Refresh stage:

  • in first plan, refresh happens but k8s resource do not exist so nothing happens.
  • on first apply, same thing, nothing happens on refresh, so tf moves on
  • on change to any config, running plan or apply will attempt refresh first, but the DATASOURCES are not updated until later, so any k8s provider that makes uses of dynamic cluster info (id, cert, exec command) will fail there. The fix at this point is to use -refresh=false

in tf v1.2.x, it seems like datasource refresh now happens before the provider configuration, and the data is now available.
Using kubernetes 2.18.1 and tf 1.2.0, I have not seen this issue anymore.

I looked at the code in depth, and this PR would not help, because the provider is initialized by calling the ConfigureContextFunc anyway, so delaying the config does not make any difference if the datasource data is not there.

@RichiCoder1
Copy link

This is maybe a quirk of the module in question, but this issue is still be reported with v1.3.9 and the latest providers: terraform-aws-modules/terraform-aws-eks#2501

@streamnsight
Copy link

RichiCoder1

See my reply on the issue: the example failing code does not use a datasource to configure the provider, so it may indeed still fail. Using the data source lookup pattern, it should work with tf1.2

@chamini2
Copy link

chamini2 commented Mar 14, 2023

@streamnsight , I create the Kubernetes cluster in the same module and then access it and provision kubernetes manifests in.

Can you help me understand how I would change to use data instead of resource in that case so I can avoid this issue?

@streamnsight
Copy link

@chamini2 just follow the examples
https://github.com/hashicorp/terraform-provider-kubernetes/tree/main/_examples

@jw-maynard
Copy link

@streamnsight this solution won't work if the cluster doesn't already exist in AWS.

@chamini2
Copy link

chamini2 commented Mar 14, 2023

@chamini2 just follow the examples https://github.com/hashicorp/terraform-provider-kubernetes/tree/main/_examples

The only difference I see from my solution is the explicit depends_on in

depends_on = [module.gke-cluster]

I will add that and see if I don't get any more errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet