Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: tolerate partial restore failure from persistent cache #12718

Merged
merged 4 commits into from Oct 8, 2021

Conversation

tomhjp
Copy link
Contributor

@tomhjp tomhjp commented Oct 4, 2021

Agent will continue with the restore process after encountering any errors, and then log any errors. If exit_on_err is false it will continue after that.

I've written some unit testing, but I haven't thought of a good way to introduce errors in an integration or manual testing environment yet. Although I think the unit testing does have decent coverage.

@vercel vercel bot temporarily deployed to Preview – vault October 5, 2021 13:22 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook October 5, 2021 13:22 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook October 5, 2021 13:31 Inactive
@vercel vercel bot temporarily deployed to Preview – vault October 5, 2021 13:31 Inactive
Comment on lines -1 to -3
+```release-note:improvement
+auth/kubernetes: validate JWT against the provided role on alias look ahead operations
+```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing a small syntax error in the previous changelog

@tomhjp tomhjp marked this pull request as ready for review October 5, 2021 13:40
@tomhjp tomhjp requested review from a team, tvoran, imthaghost and benashz October 5, 2021 13:40
@@ -969,56 +970,68 @@ func (c *LeaseCache) Flush() error {
// tokens first, since restoring a lease's renewal context and watcher requires
// looking up the token in the cachememdb.
func (c *LeaseCache) Restore(ctx context.Context, storage *cacheboltdb.BoltStorage) error {
var errors *multierror.Error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth looking into down the road for how we define custom errors.
https://github.com/uber-go/guide/blob/master/style.md#error-types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which bit of the guide are you referring to? The only bit I see that refers to multiple errors is wrapping, and that's not appropriate here because the errors we're collecting at this level are "peers" rather than forming part of a stack. In the case that there are errors, we want to log a message about what failed and optionally continue (if exit_on_err is false in the cache config).

Copy link
Member

@calvn calvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One meta-comment that I want to leave here is that this could make it trickier for us to implement VAULT-1096 (ordered restores) since we could get into a state where the dependency graph is incomplete, say if we failed to restore a parent. For instance, what happens to the lifecycle of its child tokens?

@vercel vercel bot temporarily deployed to Preview – vault October 7, 2021 13:35 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook October 7, 2021 13:35 Inactive
Copy link
Contributor

@benashz benashz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tomhjp
Copy link
Contributor Author

tomhjp commented Oct 8, 2021

since we could get into a state where the dependency graph is incomplete, say if we failed to restore a parent. For instance, what happens to the lifecycle of its child tokens?

Good question - I've addressed this in #12765. If we fail to restore a parent, we need to ensure we don't block waiting for it forever, and when we process that child lease, we won't be able to restore it just as before.

@tomhjp tomhjp merged commit 67069f0 into main Oct 8, 2021
@tomhjp tomhjp deleted the agent-persistent-cache-restore branch October 8, 2021 10:30
@tvoran tvoran added this to the 1.9 milestone Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants