Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix early rotation for roles with WALs, handle multiple WALs per role #28

Merged
merged 23 commits into from Sep 21, 2021

Conversation

tomhjp
Copy link
Contributor

@tomhjp tomhjp commented Sep 10, 2021

Early rotation of credentials

Bug and repro description

When the plugin finds a WAL for a role on initialisation, it (understandably) assumes it needs to rotate the password for the role ASAP. However, the plugin also assumes there can be at most 1 WAL per role at any time, and that's not true. We can have arbitrarily old and stale WALs in storage generated by failures during manual attempts to set up or rotate a role. These only ever get a chance to be cleaned up one at a time for each initialisation the plugin goes through. As a result, we can see early password rotations occur when the plugin runs initialisation.

To illustrate, here is a scenario to reproduce an early rotation:

vault write openldap/config \
    url=ldaps://vault-ad-test.example.com\
    binddn="CN=vault-ad-admin,CN=Users,DC=example,DC=com" \
    bindpass="some-password" \
    schema=ad

# Ensure the AD server is down so that this command fails
# No storage entry gets created for the role itself, but it will leave a WAL behind in storage for the role anyway
vault write openldap/static-role/mary \
    dn="CN=mary.smith,CN=Users,DC=example,DC=com" \
    username="mary.smith" \
    rotation_period="48h"

# To observe the dangling WAL, check the sys/raw endpoint, which should return a single WAL
MOUNT_UUID=$(vault read -format=json sys/mounts | jq -r '.data["openldap/"].uuid')
vault list sys/raw/logical/$MOUNT_UUID/wal

# Now ensure the AD server is up again
# This command should succeed, in which case create and then quickly delete a WAL, but leave the original WAL untouched
vault write openldap/static-role/mary \
    dn="CN=mary.smith,CN=Users,DC=example,DC=com" \
    username="mary.smith" \
    rotation_period="48h"

# Still 1 WAL
vault list sys/raw/logical/$MOUNT_UUID/wal

# Read the creds
vault read openldap/static-cred/mary

# Reload the plugin, and on initialisation, it will see the first WAL we created and immediately rotate the role, even though it's already been rotated very recently.
vault write sys/plugins/reload/backend mounts=openldap/

# Read the creds again, and you should see they've been rotated
vault read openldap/static-cred/mary
# WAL has been cleared
vault list sys/raw/logical/$MOUNT_UUID/wal

Fix

There are a few meaningful changes:

  1. Logging updates
  2. Ensure a maximum of 1 WAL per role. There are some areas where we lose track of an existing WAL ID and end up creating an extra WAL for the role instead
  3. Use the new password stored in the WAL when available. Previously we generated a new password on every attempt. This ensures we perform fewer rotations when we retry after failed attempts.
  4. Delete stale WALs from storage as we load them. Without the above bug fix, this would be a pretty effective mitigation, but as it is, the main motivation now is to avoid a resource leak.
  5. Discard WALs that have a last rotation time of 0. They should only be possible when they exist as a hangover from a previous failed attempt to create a role in Vault, and so we should be able to always safely ignore them.

rotation.go Show resolved Hide resolved
path_rotate.go Outdated Show resolved Hide resolved
Jim Kalafut and others added 9 commits September 14, 2021 15:18
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Reinstate finding a WAL => rotate immediately
* Remove patch check that would stop manual rotations when popping from the queue
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
@tomhjp
Copy link
Contributor Author

tomhjp commented Sep 14, 2021

Rebased against master to resolve some conflicts. The branch this PR was based off was originally off an old release, but my plan for now is to merge to master and then cherry-pick into release branches

@tomhjp tomhjp marked this pull request as ready for review September 14, 2021 17:24
path_static_roles.go Outdated Show resolved Hide resolved
rotation.go Show resolved Hide resolved
rotation.go Outdated Show resolved Hide resolved
rotation.go Show resolved Hide resolved
rotation.go Outdated Show resolved Hide resolved
@kalafut
Copy link
Contributor

kalafut commented Sep 15, 2021

Can we consider renaming the function pathRotateCredentialsUpdate to pathRotateRootCredentialsUpdate. I'm continually confused, thinking this one is the non-root one since surely "root" would be called out.

* Take exclusive lock before reading config
* Validate input before creating output struct
* Comment updates
Copy link
Member

@calvn calvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of smaller follow-up comments, but the overall implementation looks good!

rotation.go Outdated Show resolved Hide resolved
rotation.go Outdated Show resolved Hide resolved
rotation.go Outdated Show resolved Hide resolved
Copy link
Member

@briankassouf briankassouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things, otherwise the changes look good

merr = multierror.Append(merr, err)
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to do the WAL delete before the role deletion, that way we have an avenue to retry the operation if the WAL cleanup fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried implementing this, but I think I'd prefer to leave WAL cleanup as a best effort, because although it has its own consistency problems, the other order does as well. If we delete WALs before the role and fail, we'll need to push the item back onto the queue, which itself could fail. In practice though, deleting roles and WALs both rely on the same storage layer, so at least these partial failure scenarios should be very rare.

We could address the inconsistency of leaving WALs lying around fairly robustly by keying WALs on a UUID whose lifetime is attached to a role's storage entry instead of keying WALs on the role name. That obviously wouldn't be a backport candidate though. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, let's leave it as is for now and revisit any improvements in a subsequent PR

rotation.go Show resolved Hide resolved
@tomhjp tomhjp merged commit e86105e into master Sep 21, 2021
@tomhjp tomhjp deleted the fix-early-rotation-on-init branch September 21, 2021 17:14
tomhjp added a commit that referenced this pull request Sep 21, 2021
* Add more WAL logging
* Fix early rotation for roles with WALs, handle multiple WALs per role
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
* Switch from warning to error to correct HTTP status code from 400 -> 500
* Delete WALs on failed role creation or role deletion
* Take exclusive lock before reading config
* Fix manual rotate not respecting WAL ID
* Add tests for processing of stored WALs
* Remove unnecessary multierror
* Add last check on newPassword
tomhjp added a commit that referenced this pull request Sep 21, 2021
* Add more WAL logging
* Fix early rotation for roles with WALs, handle multiple WALs per role
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
* Switch from warning to error to correct HTTP status code from 400 -> 500
* Delete WALs on failed role creation or role deletion
* Take exclusive lock before reading config
* Fix manual rotate not respecting WAL ID
* Add tests for processing of stored WALs
* Remove unnecessary multierror
* Add last check on newPassword
tomhjp added a commit that referenced this pull request Sep 21, 2021
* Add more WAL logging
* Fix early rotation for roles with WALs, handle multiple WALs per role
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
* Switch from warning to error to correct HTTP status code from 400 -> 500
* Delete WALs on failed role creation or role deletion
* Take exclusive lock before reading config
* Fix manual rotate not respecting WAL ID
* Add tests for processing of stored WALs
* Remove unnecessary multierror
* Add last check on newPassword
calvn pushed a commit that referenced this pull request Sep 21, 2021
…#31)

* Add more WAL logging
* Fix early rotation for roles with WALs, handle multiple WALs per role
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
* Switch from warning to error to correct HTTP status code from 400 -> 500
* Delete WALs on failed role creation or role deletion
* Take exclusive lock before reading config
* Fix manual rotate not respecting WAL ID
* Add tests for processing of stored WALs
* Remove unnecessary multierror
* Add last check on newPassword
calvn pushed a commit that referenced this pull request Sep 21, 2021
…#30)

* Add more WAL logging
* Fix early rotation for roles with WALs, handle multiple WALs per role
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
* Switch from warning to error to correct HTTP status code from 400 -> 500
* Delete WALs on failed role creation or role deletion
* Take exclusive lock before reading config
* Fix manual rotate not respecting WAL ID
* Add tests for processing of stored WALs
* Remove unnecessary multierror
* Add last check on newPassword
calvn pushed a commit that referenced this pull request Sep 21, 2021
…#29)

* Add more WAL logging
* Fix early rotation for roles with WALs, handle multiple WALs per role
* Respect previous WAL's new password
* Ensure only 1 WAL per role
* Add a warning to manual rotation response when rotation not immediately successful
* Remove re-storing of mount config when rotating unrelated roles
* Discard all WALs with a previous rotation time of 0
* Remove deleted WAL IDs from queue items
* Delete unused struct fields
* Switch from warning to error to correct HTTP status code from 400 -> 500
* Delete WALs on failed role creation or role deletion
* Take exclusive lock before reading config
* Fix manual rotate not respecting WAL ID
* Add tests for processing of stored WALs
* Remove unnecessary multierror
* Add last check on newPassword
@tomhjp tomhjp mentioned this pull request Oct 18, 2021
@jasonodonnell jasonodonnell removed their request for review August 9, 2022 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants