Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identity: Immediately issues JWTs with rotated keys causing stampeding herds #9201

Closed
evanj opened this issue Jun 11, 2020 · 2 comments
Closed

Comments

@evanj
Copy link

evanj commented Jun 11, 2020

When rotating keys, Vault's Identity engine creates a new key and immediately starts using it for new JWTs. This means that all applications that are validating Vault JWTs must refresh their keys immediately, which causes a stampeding herd of requests for public keys. Instead, it should create a new key long before it starts using it. This spreads the refresh load on Vault over time. It also makes applications that use these keys less linked to Vault being available. If the keys are rotated early, then clients can tolerate Vault downtime.

My personal policy suggestion: Vault should always publish 3 keys: the next key, the current key, and the previous one. This is more than strictly necessary, but means that clients can just refresh every rotation_period and always be sure to get the keys that are needed.

Related minor issues:

  • Vault should probably randomize the Cache-Control header's max-age parameter that it returns to clients to some interval after the next key will be available. Today, it returns the same expiration time to all clients. Again, this causes a thundering herd at the same moment.

  • Vault returns Cache-Control: no-store between when one key has expired and the next has been generated. This means a "well-behaved" client will need to hit vault on every request to get public keys! I observed two rotations one where this period of time lasted around 2 minutes. Generating the "next" key before it is needed will also mean Vault can always return some "reasonable" max-age, limiting the amount of requests clients need to make to read keys.

Details

I wrote a program that requests JWTs from Vault every minute, and loads the public keys and prints the Cache-Control header. This is what I observed at the rotation boundary:

Before expiration

2020/06/10 23:06:40 vault issued token for role=... kid=d4171f46-17ed-882f-9e2d-c0de753048ac ttl=3600
2020/06/10 23:06:40 public keys Cache-Control:max-age=28 kids:17e89607-4e00-24c1-4465-85d8971d0b83,37ff698e-ae60-7c9b-10f4-597578d7b848,77b0a6a6-7501-40c0-474b-89ade66f7ac2,d4171f46-17ed-882f-9e2d-c0de753048ac

The first line shows that it issued a token using kid=d4171f46-17ed-882f-9e2d-c0de753048ac with a TTL of 3600 seconds. The second link shows the KIDs that were in the public document. This response is only cachable for 28 seconds.

After expiration

2020/06/10 23:07:40 vault issued token for role=... kid=d4171f46-17ed-882f-9e2d-c0de753048ac ttl=3600
2020/06/10 23:07:40 public keys Cache-Control:no-store kids:17e89607-4e00-24c1-4465-85d8971d0b83,37ff698e-ae60-7c9b-10f4-597578d7b848,77b0a6a6-7501-40c0-474b-89ade66f7ac2,d4171f46-17ed-882f-9e2d-c0de753048ac

This shows Vault is still issuing new tokens with kid= kid=d4171f46-17ed-882f-9e2d-c0de753048ac, but has not generated new keys yet. The Cache-Control header says that client must not store this response at all.

2020/06/10 23:08:40 vault issued token for role=... kid=c1c52c48-6ff2-b7d2-5b74-43cb8a7d7b56 ttl=3600
2020/06/10 23:08:40 public keys Cache-Control:max-age=77368 kids:37ff698e-ae60-7c9b-10f4-597578d7b848,77b0a6a6-7501-40c0-474b-89ade66f7ac2,c1c52c48-6ff2-b7d2-5b74-43cb8a7d7b56,d4171f46-17ed-882f-9e2d-c0de753048ac

Vault is now issuing JWTs with the new kid=c1c52c48-6ff2-b7d2-5b74-43cb8a7d7b56. This key now appears in the public list, and kid=17e89607-4e00-24c1-4465-85d8971d0b83 was removed. The max-age is back to a reasonable amount of time.

Environment:

  • Vault Server Version (retrieve with vault status): 1.4.1
@kalafut
Copy link
Contributor

kalafut commented Apr 23, 2021

@evanj Sorry for the delay getting back to you. These are excellent points and will be reviewed in tandem with #11438 .

@austingebauer
Copy link
Member

@evanj - Thanks again for opening this issue. We ended up addressing a number of the concerns here in #12414. There is some additional context in #11438. I'm going to close this issue now. Feel free to reopen if you feel we've missed anything important!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants