Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provider cache becomes ineffective with 1.4.0-alpha release #32205

Closed
tjstansell opened this issue Nov 11, 2022 · 16 comments
Closed

provider cache becomes ineffective with 1.4.0-alpha release #32205

tjstansell opened this issue Nov 11, 2022 · 16 comments
Labels
bug new new issue not yet triaged v1.4 Issues (primarily bugs) reported against v1.4 releases working as designed confirmed as reported and closed because the behavior is intended

Comments

@tjstansell
Copy link

Terraform Version

Terraform v1.4.0-alpha20221109
on linux_amd64

Terraform Configuration Files

Enable the global provider cache by setting TF_PLUGIN_CACHE_DIR.

Debug Output

Not really applicable.

Expected Behavior

The global provider cache should be used when initializing a terraform repository. We use terragrunt to initialize and run hundreds of terraform repositories simultaneously and rely on the global cache to optimize this process. We ensure that the provider cache is fully populated first because of the lack of concurrency support, but once it has been populated, we rely on terraform being able to utilize this to initialize by simply reading from the cache.

This is what 1.3.x shows and is what is expected:

Initializing provider plugins...
- Finding latest version of opsgenie/opsgenie...
- Finding latest version of jfrog/artifactory...
- Finding latest version of sumologic/sumologic...
- Finding latest version of hashicorp/template...
- Finding latest version of hashicorp/aws...
- Finding latest version of hashicorp/random...
- Finding latest version of hashicorp/vault...
- Finding latest version of hashicorp/external...
- Finding latest version of hashicorp/kubernetes...
- Finding latest version of cloudflare/cloudflare...
- Finding latest version of hashicorp/cloudinit...
- Finding latest version of okta/okta...
- Finding latest version of hashicorp/helm...
- Using cloudflare/cloudflare v3.27.0 from the shared cache directory
- Using okta/okta v3.38.0 from the shared cache directory
- Using hashicorp/helm v2.7.1 from the shared cache directory
- Using hashicorp/template v2.2.0 from the shared cache directory
- Using hashicorp/random v3.4.3 from the shared cache directory
- Using sumologic/sumologic v2.19.2 from the shared cache directory
- Using hashicorp/aws v4.38.0 from the shared cache directory
- Using hashicorp/vault v3.10.0 from the shared cache directory
- Using hashicorp/external v2.2.3 from the shared cache directory
- Using hashicorp/kubernetes v2.15.0 from the shared cache directory
- Using hashicorp/cloudinit v2.2.0 from the shared cache directory
- Using opsgenie/opsgenie v0.6.10 from the shared cache directory
- Using jfrog/artifactory v6.19.0 from the shared cache directory

Actual Behavior

Terraform tries to re-populate the provider in the global cache, with the exact same file. Because this process is not concurrency-safe, this causes problems when we use terragrunt and we have multiple terraform initializations running simultaneously. It leads to checksum errors and failures.

With 1.4.0-alpha, initialization of every repository shows:

Initializing provider plugins...
- Finding latest version of hashicorp/external...
- Finding latest version of jfrog/artifactory...
- Finding latest version of sumologic/sumologic...
- Finding latest version of okta/okta...
- Finding latest version of hashicorp/cloudinit...
- Finding latest version of hashicorp/helm...
- Finding latest version of hashicorp/aws...
- Finding latest version of opsgenie/opsgenie...
- Finding latest version of hashicorp/kubernetes...
- Finding latest version of hashicorp/vault...
- Finding latest version of hashicorp/random...
- Finding latest version of cloudflare/cloudflare...
- Finding latest version of hashicorp/template...
- Installing hashicorp/random v3.4.3...
- Installed hashicorp/random v3.4.3 (unauthenticated)
- Installing okta/okta v3.38.0...
- Installed okta/okta v3.38.0 (unauthenticated)
- Installing hashicorp/helm v2.7.1...
- Installed hashicorp/helm v2.7.1 (unauthenticated)
- Installing hashicorp/aws v4.38.0...
- Installed hashicorp/aws v4.38.0 (unauthenticated)
- Installing hashicorp/vault v3.10.0...
- Installed hashicorp/vault v3.10.0 (unauthenticated)
- Installing opsgenie/opsgenie v0.6.10...
- Installed opsgenie/opsgenie v0.6.10 (unauthenticated)
- Installing hashicorp/kubernetes v2.15.0...
- Installed hashicorp/kubernetes v2.15.0 (unauthenticated)
- Installing cloudflare/cloudflare v3.27.0...
- Installed cloudflare/cloudflare v3.27.0 (unauthenticated)
- Installing hashicorp/template v2.2.0...
- Installed hashicorp/template v2.2.0 (unauthenticated)
- Installing hashicorp/external v2.2.3...
- Installed hashicorp/external v2.2.3 (unauthenticated)
- Installing jfrog/artifactory v6.19.0...
- Installed jfrog/artifactory v6.19.0 (unauthenticated)
- Installing sumologic/sumologic v2.19.2...
- Installed sumologic/sumologic v2.19.2 (unauthenticated)
- Installing hashicorp/cloudinit v2.2.0...
- Installed hashicorp/cloudinit v2.2.0 (unauthenticated)

Steps to Reproduce

Initialize two separate terraform repositories with the global provider cache enabled. Both will install the same providers into the cache rather than re-using them.

Additional Context

No response

References

@tjstansell tjstansell added bug new new issue not yet triaged labels Nov 11, 2022
@joe-a-t
Copy link

joe-a-t commented Nov 12, 2022

Just want to extremely strongly 👍 this, we also have a ton of parallelization in our automation and from a purely timing perspective, we are looking at 30-60+ seconds added to every single terraform init that we run, which really adds up when you are running through hundreds of different Terraform root modules. There's also a ton more flakiness due to network issues, especially if you try to have a higher level of parallelization so we get a large increase in the frequency of failed builds and needs to retry.

This change will have a HUGE performance hit for us, which is an extremely painful process when we are already struggling with the performance of Terraform and have a number of open issues to improve Terraform's performance like #29503 and #31964. It seems extremely disproportionate to cause this much of a performance hit out of a desire to solve an issue that someone ran into for the first time more than 2 years after the relevant code/behavior was released in 0.14.

@apparentlymart
Copy link
Member

Hi @tjstansell. Thanks for reporting this.

The output here suggests that you are working in a repository without a dependency lock file, because I see it announcing that it is querying for the latest version instead of selecting a version from the lock file.

If you run terraform init again once the lock file has been created, it should then use the cache directory. You can make that permanent by putting the lock file in your version control system as terraform init mentions once it has created the file for the first time.

@joe-a-t
Copy link

joe-a-t commented Nov 15, 2022

Hey @apparentlymart, we deliberately do not use a dependency lock file since we have over 5k root modules so having to update 5k+ dependency lock files on a regular basis is either a TON of toil, or we would have to build a new set of automation tooling and allow said tooling to push directly to master, which is not something that our security and compliance teams are ok with given our required repo branch protection rules.

The only scenario in which a dependency lock file would become useful to us is if we could set it at a top level directory or a global spot on the machine and have it get inherited down to all of the child directories, at least that way we would be talking about having to keep 1-30 spots updated instead 5K+.

We have previously discussed this in other issues and with Omar/Korinne, and were given assurances that we could safely just ignore the lockfile if we found it to be an impediment. We are also not the only people that do this, as evidenced by #31856 asking to disable the warning when Terraform generates the lockfile.

@apparentlymart apparentlymart added the working as designed confirmed as reported and closed because the behavior is intended label Nov 15, 2022
@tjstansell
Copy link
Author

Can't the cache include some form of metadata so you can detect the proper hash values (from the original zips it was installed from) so the cache could still be used during an initial run as well? Even if it's just a $file.hash that just stores the hash of the source zip just like what's expected in the lock file? We actually have our own script that takes all of our providers baked into a local mirror directory in our docker image and pre-installs them into the cache before terraform even runs the first time. It would be easy to just stuff the hash of the .zip file as it is installing it.

As with @joe-a-t, we deliberately do not check in lock files for our repos. Maintaining those across all of our repos would be a full time job.

@joemiller
Copy link

joemiller commented Dec 7, 2022

This change effectively makes terragrunt expensive to use, possibly prohibitively so. We only have about ~380 terragrunt stacks (not 5k+! 😨 ) and doing init across all of them without benefit of the cache means hundreds of GBs of disk space (and bandwidth) needed for the duplicated provider binaries.

$ time TERRAGRUNT_TFPATH=/Users/joe/git/hashicorp/terraform/terraform terragrunt run-all init --terragrunt-parallelism=8 --terragrunt-ignore-dependency-order --terragrunt-ignore-dependency-errors -lock=false

4580.15s user / 1923.02s system / 329% cpu 32:52.57 total / 0 amem / 145684 maxmem
$ du -shx .
196G    .
$ git ls-tree -r --name-only head | grep terragrunt.hcl | wc -l
     382

b/w usage during the 32 minute init:
image

@grimm26
Copy link

grimm26 commented Jan 13, 2023

WAD? Seriously? I guess so, since the whole dependency lockfile design is definitely not a "one size fits all" solution and makes terraform decidedly more difficult to use for many of us. Please add a way to just disable that feature as has been asked for in issues like #29760 and #31533. Please stop forcing us to endure spurious warnings and workarounds.

@wyardley
Copy link

wyardley commented Jan 27, 2023

This is very disappointing. I'm glad there's at least some kind of temporary workaround available, but all the comments in the docs about silently discarding that flag down the road means that this is really just kicking the can down the road.

I know it's been said that Terraform "needs" the lockfile, but since it can initialize without one, it clearly doesn't really. Clearly there are plenty of people who, for various reasons, find checking in the lockfile to VCS to be a neutral or a negative; I really still haven't seen a good explanation why it's impossible to come up with a solution that more gracefully accommodates these ways of working.

Correct me if I'm wrong, but also haven't really seen any of the accommodations / improvements mentioned in #31533 and similar items happen yet.

@nitrocode
Copy link

We also do not check in the lockfile and do not plan to since we find locking providers in a required_providers block in versions.tf a lot more palatable.

Please do not force the requirement of checking in the lockfile.

@apparentlymart
Copy link
Member

This has already been addressed in #32494. Those who use an non-standard workflow without checking in the lock file can opt in to the previous broken behavior using a CLI configuration setting, which will then behave just the same as 1.3. There will be more information on this in the v1.4 upgrade guide once v1.4.0 final is released.

required_providers is for describing what your modules are compatible with and is not a suitable substitute for the lock file because it cannot track the expected checksums for a provider. You can choose to take that risk if you wish, but I think it's important to understand that it is a risk and make sure you're mitigating that risk some other way.

This new behavior is better for those who use Terraform as recommended, by checking their lock file into version control, and so the opt-out is for the minority of folks who use a non-standard workflow (including those who spoke up here). If you don't wish to adopt the recommended workflow then you should set the new CLI configuration setting once you upgrade.

@tjstansell
Copy link
Author

Thanks @apparentlymart. Is there a proposed discussion or issue that we can follow that is meant to work through the reasons why we currently do not check in a lock file and therefore do not follow your recommended workflow? If you intend to eventually remove this workaround, I'd like to be involved in the discussions to work through the requirements of that.

@wyardley
Copy link

wyardley commented Mar 6, 2023

@tjstansell some discussions that may be of interest to you:
#27888
#29760
#31533
as well as some of the issue linked to those.

@tjstansell
Copy link
Author

Thanks @wyardley. Yeah, I've been looking at those, but was kinda hoping that Hashicorp was trying to have a more consolidated effort somewhere, rather than spread across various issues...

@dee-kryvenko
Copy link

dee-kryvenko commented Mar 8, 2023

Just my 2 cents.

I would strongly dismiss all comments mentioning that they face problems because they don't use lock files as invalid use cases. I am sorry, in the day and age supply chain attacks became new normal and SBOM initiatives finally gaining traction left and right - there can't be an excuse to even mention this as something someone would seriously consider doing on a real production system. It is bizarre to me that there are still people who stuck in the past so much and are still adding lock files to .gitignore. You got to understand - not using the lock file just sweeps the problem under the rug. It feels like toil to having to update these lock files? Is it not a toil to fixing random unexpected failures every day because some change released upstream? One is controller and automatable - the other is chaotic unpredictable wildfire fighting effort.

Regardless, the "fix" in 1.4 is still bizarre to me too. If I have cache locally, why Terraform would even need to reach out remotely for TOFU? What is wrong with trusting the local cache? I am not aware of any other dependency management system that would do something like this, why Hashicorp is failing to copy well established best practices? This is a rhetoric question - I am trying to highlight the fact that the "lock" implementation is messed up in its entirety, not just the lock gen. There shouldn't be a separation of lock metadata per arch to begin with. A lock file is a part of configuration, and in current implementation - it leaks system implementation details into the user-space. Current implementation is missing an abstraction layer. Terraform could, for instance, cache the list of checksums per arch along with the arch-specific binary, and use a checksum of checksums as a lock file to store in git. It could also use cryptographic signatures (which it already does, just not for this purpose) and store fingerprint in the lock file. Even a dumb approach of caching all architectures in the local cache would be more efficient and less wasteful that the current implementation. There are multiple ways to achieve this and remove the need to reach out externally for the lock gen, which the actual root cause of all of these associated problems.

@joe-a-t
Copy link

joe-a-t commented Mar 9, 2023

@dee-kryvenko don't want to start a big sidetrack on here, but there are multiple ways to prevent supply chain attacks, a lockfile is not the only way and in fact interferes with some other methods that are arguably superior like having local caches that are pre-loaded with the blessed versions of providers instead of pulling them from the internet. Not using a lockfile does not in and of itself mean that you are vulnerable to the attacks or downsides that your post described.

@dee-kryvenko
Copy link

arguably superior like having local caches that are pre-loaded with the blessed versions of providers

OMG... somehow this conversations starts to remind me MPLS vs SD-WAN conversation back in the day. And how do you rebuild that cache?... Is that what you do with everything else, like Go, pip, gems, npm etc?... And what guarantees the integrity of that local cache?

@github-actions
Copy link

github-actions bot commented Apr 9, 2023

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug new new issue not yet triaged v1.4 Issues (primarily bugs) reported against v1.4 releases working as designed confirmed as reported and closed because the behavior is intended
Projects
None yet
Development

No branches or pull requests

8 participants