Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

providercache: Ignore lock-mismatching global cache entries #32129

Merged
merged 1 commit into from Nov 4, 2022

Conversation

apparentlymart
Copy link
Member

@apparentlymart apparentlymart commented Nov 1, 2022

When we originally introduced the trust-on-first-use checksum locking mechanism in v0.14, we had to make some tricky decisions about how it should interact with the pre-existing optional read-through global cache of provider packages:

The global cache conflicts with the checksum locking because if the needed provider is already in the cache then Terraform skips installing the provider from upstream and therefore misses the opportunity to capture the signed checksums published by the provider developer. We can't use the signed checksums to verify a cache entry because the origin registry protocol is still using the legacy ziphash scheme and that is only usable for the original zipped provider packages and not for the unpacked-layout cache directory. Therefore we decided to prioritize the existing cache directory behavior at the expense of the lock file behavior, making Terraform produce an incomplete lock file in that case.

Now that we've had some real-world experience with the lock file mechanism, we can see that the chosen compromise was not ideal because it causes terraform init to behave significantly differently in its lock file update behavior depending on whether or not a particular provider is already cached. By robbing Terraform of its opportunity to fetch the official checksums, Terraform must generate a lock file that is inherently non-portable, which is problematic for any team which works with the same Terraform configuration on multiple different platforms.

This change addresses that problem by essentially flipping the decision so that we'll prioritize the lock file behavior over the provider cache behavior. Now a global cache entry is eligible for use if and only if the lock file already contains a checksum that matches the cache entry. This means that the first time a particular configuration sees a new provider it will always be fetched from the configured installation source (typically the origin registry) and record the checksums from that source.

On subsequent installs of the same provider version already locked, Terraform will then consider the cache entry to be eligible and skip re-downloading the same package.

This intentionally makes the global cache mechanism subordinate to the lock file mechanism: the lock file must be populated in order for the global cache to be effective. For those who have many separate configurations which all refer to the same provider version, they will need to re-download the provider once for each configuration in order to gather the information needed to populate the lock file, whereas before they would have only downloaded it for the first configuration using that provider.

This should therefore remove the most significant cause of folks ending up with incomplete lock files that don't work for colleagues using other platforms, at the expense of bypassing the cache for the first use of each new package with each new configuration. This tradeoff seems reasonable because otherwise such users would inevitably need to run terraform providers lock separately anyway, and that command always bypasses the cache. Although this change does decrease the hit rate of the cache, if we subtract the never-cached downloads caused by "terraform providers lock" then this is a net benefit overall, and does the right thing by default without the need to run a separate command.

@apparentlymart
Copy link
Member Author

I can see that this has broken an assumption being made by one of the end-to-end tests that didn't run in my dev environment because I was only running the "offline" subset of the tests. I'll see about updating that tomorrow so that it properly models the two cases of this new behavior, both with and without an existing lock file.

When we originally introduced the trust-on-first-use checksum locking
mechanism in v0.14, we had to make some tricky decisions about how it
should interact with the pre-existing optional read-through global cache
of provider packages:

The global cache essentially conflicts with the checksum locking because
if the needed provider is already in the cache then Terraform skips
installing the provider from upstream and therefore misses the opportunity
to capture the signed checksums published by the provider developer. We
can't use the signed checksums to verify a cache entry because the origin
registry protocol is still using the legacy ziphash scheme and that is
only usable for the original zipped provider packages and not for the
unpacked-layout cache directory. Therefore we decided to prioritize the
existing cache directory behavior at the expense of the lock file behavior,
making Terraform produce an incomplete lock file in that case.

Now that we've had some real-world experience with the lock file mechanism,
we can see that the chosen compromise was not ideal because it causes
"terraform init" to behave significantly differently in its lock file
update behavior depending on whether or not a particular provider is
already cached. By robbing Terraform of its opportunity to fetch the
official checksums, Terraform must generate a lock file that is inherently
non-portable, which is problematic for any team which works with the same
Terraform configuration on multiple different platforms.

This change addresses that problem by essentially flipping the decision so
that we'll prioritize the lock file behavior over the provider cache
behavior. Now a global cache entry is eligible for use if and only if the
lock file already contains a checksum that matches the cache entry. This
means that the first time a particular configuration sees a new provider
it will always be fetched from the configured installation source
(typically the origin registry) and record the checksums from that source.

On subsequent installs of the same provider version already locked,
Terraform will then consider the cache entry to be eligible and skip
re-downloading the same package.

This intentionally makes the global cache mechanism subordinate to the
lock file mechanism: the lock file must be populated in order for the
global cache to be effective. For those who have many separate
configurations which all refer to the same provider version, they will
need to re-download the provider once for each configuration in order to
gather the information needed to populate the lock file, whereas before
they would have only downloaded it for the _first_ configuration using
that provider.

This should therefore remove the most significant cause of folks ending
up with incomplete lock files that don't work for colleagues using other
platforms, and the expense of bypassing the cache for the first use of
each new package with each new configuration. This tradeoff seems
reasonable because otherwise such users would inevitably need to run
"terraform providers lock" separately anyway, and that command _always_
bypasses the cache. Although this change does decrease the hit rate of the
cache, if we subtract the never-cached downloads caused by
"terraform providers lock" then this is a net benefit overall, and does
the right thing by default without the need to run a separate command.
@crw crw linked an issue Nov 4, 2022 that may be closed by this pull request
@apparentlymart apparentlymart merged commit d0a35c6 into main Nov 4, 2022
@apparentlymart apparentlymart deleted the f-provider-global-cache-checksum-mismatch branch November 4, 2022 23:18
@github-actions
Copy link

github-actions bot commented Nov 4, 2022

Reminder for the merging maintainer: if this is a user-visible change, please update the changelog on the appropriate release branch.

@joe-a-t
Copy link

joe-a-t commented Nov 9, 2022

I am extremely concerned about this change. It sounds like you are saying that this will result in the global provider cache being entirely ignored if people don't use checksum lockfiles in all of their root modules. We deliberately do not use those lockfiles so it sounds like we will take a huge performance hit as a result of this change since the cache would always be ignored for us. Is that understanding correct?

@joe-a-t
Copy link

joe-a-t commented Nov 9, 2022

Also, there are some custom providers we use that do not have an upstream to pull from, we have to pre-populate them in the cache. If I am reading this correctly, all of those will break if this change gets merged. I strongly believe that this would wreck so much chaos on our workflow (which I don't think is unique) that it would be entirely inappropriate to do without a major version release.

@apparentlymart
Copy link
Member Author

apparentlymart commented Nov 9, 2022

Yes, it is correct that with this change using lock files will be mandatory if you want to use the global plugin cache. We have merged this early in the v1.4 development period to allow plenty of time to refine it based on feedback.

This should not break local (non-registry) modules if you use them in the documented way, by placing them into a local filesystem mirror directory instead of in the global cache. Terraform itself is the only process allowed to write into the global cache directory.

The alpha release today is one vehicle for that feedback; I'd love if you'd try the alpha and share your experiences in a new issue. Thanks!

@tjstansell
Copy link

So, this completely breaks our use of the cache in a terragrunt world where we're initializing hundreds of terraform repositories. We pre-populate the provider cache once and then in the terraform 1.3 world, every repo we initializes shows it's Using hashicorp/aws v4.38.0 from the shared cache directory. With 1.4.0-alpha, every repo shows

- Installing hashicorp/aws v4.38.0...
- Installed hashicorp/aws v4.38.0 (unauthenticated)

This effectively negates the use of the cache for us.

@tjstansell
Copy link

I'll add that even though all of this is running inside a custom docker image with a local mirror directory the providers come from (note the unauthenticated above), it's still re-copying the exact same provider every time into the cache directory. Since this process is not concurrency-safe, we end up with corrupted files or invalid checksums, not to mention the fact it adds significant delay when this is happening for hundreds of terraform repositories.

@tjstansell
Copy link

Sorry, I'll open a new ticket with this feedback.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Terraform lock file irresolvable problem
4 participants