Mount table corrupted after GCS rate limiting #7455

bharanin · 2019-09-09T21:16:08Z

Summary

Under heavy load, Vault encounters GCS rate limiting on the mount table object (core/mounts) and occasionally corrupts data. It appears to insert duplicate entries in the table.

Log Snippet

vault: [ERROR] core: failed to persist mount table: error="1 error occurred:"  

vault: * error closing connection: googleapi: Error 429: The total number of changes to the object bucket/core/mounts exceeds the rate limit. Please reduce the rate of create, update, and delete requests., rateLimitExceeded"

[ERROR] core: failed to persist mount table: error="1 error occurred:

[ERROR] core: failed to remove entry from mounts table: error="1 error occurred:

The cluster continues to operate normally after these errors until a leader election needs to take place. At that time, we see the following:

[ERROR] core: failed to mount entry: path=org/octopus/transit/ error="cannot mount under existing mount "org/octopus/transit/""

Because of this, no instance can become active and the cluster is unavailable. We’re not aware of any way to recover the storage after this error occurs and have resorted to restoring from backups (luckily this has been in our lab/test environment).

Other Details

Corruption doesn’t appear to happen every time rate limiting is encountered which makes this hard to reproduce.
Per GCS docs for the error, the rate limiting happens at ~1 write/second (https://cloud.google.com/storage/docs/key-terms#immutability) on a given object.
We’ve encountered this three times in our lab/test environment. In the most recent case, we made a copy of the storage to verify it was indeed corrupted, but were able to leave the active node in place to gather more data. The mount table as printed via vault read sys/mounts does not show a duplicate entry in the table - however, when trying to unseal in a separate cluster, we get the cannot mount under existing mount error.
We're using default max_parallel setting

The text was updated successfully, but these errors were encountered:

hsimon-hashicorp · 2024-03-21T22:37:42Z

Hi folks! Is this still an issue in newer versions of Vault? Please let me know so I can bubble it up accordingly. Thanks!

fcrespofastly · 2024-04-26T06:03:03Z

@hsimon-hashicorp 👋🏻 I'm seeing similar issues in 1.15.4 also related to:

#23635

We've been rate limited on the core/seal-config object:

log:   | \t* error closing connection: googleapi: Error 429: The object $BUCKET/core/seal-config exceeded the rate limit for object mutation operations (create, update, and delete). Please reduce your request rate. See https://cloud.google.com/storage/docs/gcs429., rateLimitExceeded

(I think) this led to leases corruption due to GCS started returning 503s which in a Vault cluster we have with 3 replicas it all happened at the same time and the cluster got completely sealed and since we use auto-unseal we had to fix it by pointing it to a backup GCS backend. Btw advices on how to better fix that are appreciated (perhaps deleting the leases?):

   log: 2024-04-25T11:35:45.074Z [ERROR] expiration: error restoring leases:

 error=                                                                                                                                  
 | failed to read lease entry auth/kubernetes/login/LEASE_ID 1 error occurred: 
 | \t* error closing connection: googleapi: got HTTP response code 503 with body: Service Unavailable                                    
 |

Other than that I could spot several other 503s doing other kind of operations + context deadline exceeded or context canceled.

We're living in pretty dangerous situation at the moment so any help is appreciated!

fcrespofastly · 2024-04-27T01:00:50Z

somewhat related/ potential improvement:

#26673

michelvocks added bug Used to indicate a potential bug storage/gcs labels Nov 5, 2019

catsby added the version/1.3.x label Nov 18, 2019

hsimon-hashicorp removed the version/1.3.x label Jul 21, 2021

hsimon-hashicorp added the waiting-for-response label Mar 21, 2024

fcrespofastly mentioned this issue Apr 26, 2024

Vault 1.15.0 rateLimitExceeded with GCS #23635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mount table corrupted after GCS rate limiting #7455

Mount table corrupted after GCS rate limiting #7455

bharanin commented Sep 9, 2019

hsimon-hashicorp commented Mar 21, 2024

fcrespofastly commented Apr 26, 2024

fcrespofastly commented Apr 27, 2024

Mount table corrupted after GCS rate limiting #7455

Mount table corrupted after GCS rate limiting #7455

Comments

bharanin commented Sep 9, 2019

hsimon-hashicorp commented Mar 21, 2024

fcrespofastly commented Apr 26, 2024

fcrespofastly commented Apr 27, 2024