Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mount table corrupted after GCS rate limiting #7455

Open
bharanin opened this issue Sep 9, 2019 · 3 comments
Open

Mount table corrupted after GCS rate limiting #7455

bharanin opened this issue Sep 9, 2019 · 3 comments
Labels
bug Used to indicate a potential bug storage/gcs waiting-for-response

Comments

@bharanin
Copy link

bharanin commented Sep 9, 2019

Summary

Under heavy load, Vault encounters GCS rate limiting on the mount table object (core/mounts) and occasionally corrupts data. It appears to insert duplicate entries in the table.

Log Snippet

vault: [ERROR] core: failed to persist mount table: error="1 error occurred:"  

vault: * error closing connection: googleapi: Error 429: The total number of changes to the object bucket/core/mounts exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"

[ERROR] core: failed to persist mount table: error="1 error occurred:

[ERROR] core: failed to remove entry from mounts table: error="1 error occurred:

The cluster continues to operate normally after these errors until a leader election needs to take place. At that time, we see the following:

[ERROR] core: failed to mount entry: path=org/octopus/transit/ error="cannot mount under existing mount "org/octopus/transit/""

Because of this, no instance can become active and the cluster is unavailable. We’re not aware of any way to recover the storage after this error occurs and have resorted to restoring from backups (luckily this has been in our lab/test environment).

Other Details

  • Corruption doesn’t appear to happen every time rate limiting is encountered which makes this hard to reproduce.
  • Per GCS docs for the error, the rate limiting happens at ~1 write/second (https://cloud.google.com/storage/docs/key-terms#immutability) on a given object.
  • We’ve encountered this three times in our lab/test environment. In the most recent case, we made a copy of the storage to verify it was indeed corrupted, but were able to leave the active node in place to gather more data. The mount table as printed via vault read sys/mounts does not show a duplicate entry in the table - however, when trying to unseal in a separate cluster, we get the cannot mount under existing mount error.
  • We're using default max_parallel setting
@michelvocks michelvocks added bug Used to indicate a potential bug storage/gcs labels Nov 5, 2019
@hsimon-hashicorp
Copy link
Contributor

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

@fcrespofastly
Copy link

@hsimon-hashicorp 👋🏻 I'm seeing similar issues in 1.15.4 also related to:

#23635

We've been rate limited on the core/seal-config object:

log:   | \t* error closing connection: googleapi: Error 429: The object $BUCKET/core/seal-config exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429., rateLimitExceeded

(I think) this led to leases corruption due to GCS started returning 503s which in a Vault cluster we have with 3 replicas it all happened at the same time and the cluster got completely sealed and since we use auto-unseal we had to fix it by pointing it to a backup GCS backend. Btw advices on how to better fix that are appreciated (perhaps deleting the leases?):

   log: 2024-04-25T11:35:45.074Z [ERROR] expiration: error restoring leases:

 error=                                                                                                                                  
 | failed to read lease entry auth/kubernetes/login/LEASE_ID 1 error occurred: 
 | \t* error closing connection: googleapi: got HTTP response code 503 with body: Service Unavailable                                    
 |  

Other than that I could spot several other 503s doing other kind of operations + context deadline exceeded or context canceled.

We're living in pretty dangerous situation at the moment so any help is appreciated!

@fcrespofastly
Copy link

somewhat related/ potential improvement:

#26673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug storage/gcs waiting-for-response
Projects
None yet
Development

No branches or pull requests

5 participants