Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup with transit seal method and revoked token silently fails #13130

Open
laugmanuel opened this issue Nov 12, 2021 · 8 comments
Open

Backup with transit seal method and revoked token silently fails #13130

laugmanuel opened this issue Nov 12, 2021 · 8 comments

Comments

@laugmanuel
Copy link

Describe the bug
We use Raft as our storage backend.
We also do use transit sealing against a secondary Vault instance to provide auto unsealing for our primary Vault installed in Kubernetes. The token we use for that gets created by an init-container and is only valid for a few minutes.
Until recently, this setup worked fine for us. The pods got unsealed automatically and the backups were present and valid (could be successfully restored).

Probably due to #12388, this behaviour changed!
Creating a backup using vault operator raft snapshot save <snapshot file> results in an error regarding the SHA256SUMS.sealed file. Using the API endpoint, we can successfully download the snapshot without any error.
In both cases the snapshot file gets created and looks to contain data:

  • the file size makes sense (it's not only a few bytes but matches old backups)
  • using file <snapshot file> the backup is recognized as gzip compressed data

However, the backup can not be restored and Vault complains about Load error in the UI. Restoring using the CLI also fails. If I try to unpack the backup using gzip, I get unexpected end of file -> it looks like the backup file is corrupted.

If I extend the lifetime of the unseal token, the backup gets created and can be restored successfully!
There is no word in the docs, that the transit token used in the Vault config/env variables must still be valid for a backup to succeed!

To Reproduce
Steps to reproduce the behavior:

  1. Setup a Vault with transit sealing
  2. Issue a new token on the Vault providing the Transit engine
  3. Unseal the new Vault using that token
  4. Wait for the token to be revoked / revoke it manually
  5. Create a backup using API or UI
  6. Try to restore that exact backup

Expected behavior
Either a valid backup file (a file that can be extracted using gzip+tar and restored) should be created; even though there is the warning about SHA256SUMS.sealed file.
OR the creation of the backup should hard fail without any file being created.

If someone uses the API to create the backup but does not regularly check the restore, there would be no way to see, that the backup file is corrupted.

Also, the docs about raft snapshotting should mention, that the seal-configuration (including the token) must be valid for the backup to fully work.

Environment:

  • Vault Server Version (retrieve with vault status): 1.8.3, 1.8.4
  • Vault CLI Version (retrieve with vault version): 1.8.3, 1.8.4
  • Server Operating System/Architecture: Linux

Vault server configuration file(s):

ui = true
disable_mlock = true
log_level = "Info"
log_format = "json"

api_addr = "http://localhost:8200"
cluster_addr = "http://localhost:8201"

listener "tcp" {
  address = "[::]:8200"
  cluster_address = "[::]:8201"

  tls_disable = 1
}

seal "transit" {
  token = "<token>" # this token is the problem
  key_name = "vault-transit"
  mount_path = "transit/"
  address = "https://transit-providing-vault:8200"
}

storage "raft" {
  path = "/data"

  retry_join {
    leader_api_addr = "http://localhost:8200"
  }
}

Additional context
There must be a notice in the docs about the token used for transit. The docs and also the howto guides only mention to create a new token and to put it in the config/env variable. This would also break after the default lifetime of 32d:

@hsimon-hashicorp hsimon-hashicorp added storage/raft bug Used to indicate a potential bug labels Nov 12, 2021
@hsimon-hashicorp
Copy link
Contributor

Hi @laugmanuel - were you testing your snapshot restores previously? In #12388, the changes were made to expose broken seals that are resulting in unusable snapshots. Prior to the changes, the snapshot creation would appear to be successful, but the snapshots could not be restored. If you could let us know, I'd appreciate it. :)

@laugmanuel
Copy link
Author

Hi @hsimon-hashicorp ,
yes we did test the restores previously and they were successful. However, I do not remember if this was tested with snapshots created manually/automatism shortly after unsealing the Vault or by the scheduled backup.
I can try to reproduce this with a Vault version prior to the mentioned change and report back.

Nevertheless, the other points regarding docs and serving a broken backup through API and UI are still valid 😉

@hsimon-hashicorp
Copy link
Contributor

When a snapshot is initiated via the API, a success is returned immediately upon the snapshot starting to stream. The snapshot is not buffered on the server, because the size of the snapshot is unknown. So, the snapshot API request returns a "success", starts to stream, and then if at some point the seal isn't available, the snapshot will be broken. This is why testing restores is a critical part of any backup process.
Additionally, #13078 may help with this, to make detecting seal issues easier and faster. Let me know if this answers your questions about the API. I'll ask @taoism4504 for assistance re: docs.

@hsimon-hashicorp hsimon-hashicorp added docs waiting-for-response and removed bug Used to indicate a potential bug labels Nov 15, 2021
@laugmanuel
Copy link
Author

I've tested with Vault 1.8.5 and Vault 1.7.4 (which does, according to the Changelog, not contain the above fix). In both cases, the snapshot was valid and restorable with a valid token and became broken after the token expired.
So I guess, the backups were broken with earlier versions after all.

For us, I fixed it temporarily by issuing a token with a relatively long lifetime (based on an approle which overrides the default ttl of 32d).
I will experiment with periodic tokens for transit because the transit seal provider seems to have a refresh feature (disable_renewal = "false") for the token?! https://www.vaultproject.io/docs/configuration/seal/transit#disable_renewal

@hsimon-hashicorp
Copy link
Contributor

Hi @taoism4504 - we were discussing this today - this might be good to clarify and expand in the snapshot and restore documentation with regards to token longevity and not breaking snapshots. :)

@laugmanuel
Copy link
Author

Hi @hsimon-hashicorp , what's the status on this?
Using periodic tokens together with disable_renewal = "false" works fine for me; so does using a token with very long TTL. Just wondering if docs will be modified - otherwise we can close this.

@bendem
Copy link

bendem commented Aug 29, 2023

We've had this problem happen today, the token in the config for the autounseal had expired. We renewed the token, updated the config, reloaded vault (using kill -HUP), but the snapshot still failed with the same error until we actually restarted all our nodes. If the transit token not reloaded on SIGHUP?

@hsimon-hashicorp
Copy link
Contributor

Pinging @schavis for docs update. Thanks @laugmanuel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants