Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle out of order events #5071

Merged
merged 20 commits into from
Jun 4, 2024
Merged

Conversation

faisal-memon
Copy link
Contributor

@faisal-memon faisal-memon commented Apr 15, 2024

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality
Events based cache

Description of change
Events can come out of order from sql. If we skip over an event, retry it later.

Which issue this PR fixes
fixes #5021

@azdagron azdagron added this to the 1.9.5 milestone Apr 18, 2024
@@ -115,28 +122,62 @@ func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCa

seenMap := map[string]struct{}{}
for _, event := range resp.Events {
// If there is a gap in the event log the missed events for later processing
if event.EventID != a.lastRegistrationEntryEventID+1 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to check for zero

a.missedRegistrationEntryEvents[i] = struct{}{}
}
}

// Skip fetching entries we've already fetched this call
if _, seen := seenMap[event.EntryID]; seen {
a.lastRegistrationEntryEventID = event.EventID
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get the list of events in ascending order

@faisal-memon faisal-memon marked this pull request as ready for review May 1, 2024 21:39
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
@azdagron azdagron modified the milestones: 1.9.5, 1.10.0 May 7, 2024
pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved
pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved
pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved
pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved
Comment on lines 322 to 335
var lastEventID uint
missedRegistrationEntryEvents := make(map[uint]time.Time)
resp, err := ds.ListRegistrationEntriesEvents(ctx, &datastore.ListRegistrationEntriesEventsRequest{})
if err != nil {
return 0, nil, err
}
for _, event := range resp.Events {
if event.EventID != lastEventID+1 && lastEventID != 0 {
for i := lastEventID + 1; i < event.EventID; i++ {
missedRegistrationEntryEvents[i] = clk.Now()
}
}
lastEventID = event.EventID
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required? are not we getting all entries after restart?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look back in the event history to see if there is any gaps because the event might commit after we get all entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcosDY Yes, it is required. When a long running write transaction that is modify SPIRE Entries doesn't commit because it is hung, it leaves gaps in the auto increment keys of the event table. By tracking which event ids in the event table are missing, we can then go back and explicitly poll the skipped event ids prior to scanning the table for new entry event ids. This is the best that can be done at the moment to ensure that we don't skip event ids when other transactions push the last seen event id beyond those that are stuck in hung database transactions.

func buildAttestedNodesCache(ctx context.Context, ds datastore.DataStore, clk clock.Clock, cache *authorizedentries.Cache) (uint, map[uint]time.Time, error) {
// Gather any events that may have been skipped during restart
var lastEventID uint
missedAttestedNodeEvents := make(map[uint]time.Time)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question that I did for entries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look back in the event history to see if there is any gaps because the event might commit after we get all entries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcosDY Yes, it is required. When a long running write transaction that is modify SPIRE Entries doesn't commit because it is hung, it leaves gaps in the auto increment keys of the event table. By tracking which event ids in the event table are missing, we can then go back and explicitly poll the skipped event ids prior to scanning the table for new entry event ids. This is the best that can be done at the moment to ensure that we don't skip event ids when other transactions push the last seen event id beyond those that are stuck in hung database transactions.

require.NoError(t, err)
require.NotNil(t, ef)

assert.Contains(t, ef.missedRegistrationEntryEvents, uint(2))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cna you use equal here? so we are sure there are no unexpected results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you suggest the alternative approach that you intend?

Comment on lines +233 to +234
assert.Contains(t, ef.missedAttestedNodeEvents, uint(2))
assert.Contains(t, ef.missedAttestedNodeEvents, uint(3))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cna you use equal here? so we are sure there are no unexpected results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faisal checked for equality of the count of returned items on line 231. Note that the actual items can come in any order (due to SQL and not having an order statement here) so it is correct to check each item is contained in the collection instead of assuming some sort of index based equality.

Copy link
Collaborator

@MarcosDY MarcosDY May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a map, so order does not matter when comparing maps, with that equal will guaranty that you have 2 entries and the expected entries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you suggest the alternative approach you are seeking here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.ElementsMatch is useful when you don't need exact equality but want to make sure that the set of elements is equal, independent of ordering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, what marcos said works. Map equality checks are already order independent.

@edwbuck
Copy link
Contributor

edwbuck commented May 17, 2024

@faisal-memon Can you push a signed copy? This is failing the DCO checks.

@edwbuck
Copy link
Contributor

edwbuck commented May 17, 2024

@MarcosDY @faisal-memon @azdagron If you have time, let's see if we can get this reviewed early enough it doesn't upset the 1.10.0 release.

Signed-off-by: Faisal Memon <fymemon@yahoo.com>
@faisal-memon
Copy link
Contributor Author

@faisal-memon Can you push a signed copy? This is failing the DCO checks.

dco fixed

faisal-memon and others added 7 commits May 22, 2024 16:12
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Co-authored-by: Marcos Yacob <marcosyacob@gmail.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Co-authored-by: Marcos Yacob <marcosyacob@gmail.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
@faisal-memon
Copy link
Contributor Author

@MarcosDY All comments addressed

Signed-off-by: Faisal Memon <fymemon@yahoo.com>
CacheReloadInterval string `hcl:"cache_reload_interval"`
EventsBasedCache bool `hcl:"events_based_cache"`
PruneEventsOlderThan string `hcl:"prune_events_older_than"`
SQLTransactionTimeout string `hcl:"sql_transaction_timeout"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it really required?
do you have a use case where a timeout that is used for pruning can be useful?

log.Info("Building event-based in-memory entry cache")
cache, lastRegistrationEntryEventID, lastAttestedNodeEventID, err := buildCache(ctx, ds, clk)
cache, receivedFirstRegistrationEntryEvent, lastRegistrationEntryEventID, missedRegistrationEntryEvents, receivedFirstAttestedNodeEvent, lastAttestedNodeEventID, missedAttestedNodeEvents, err := buildCache(ctx, ds, clk)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too much for a return
it is better to return a struct here, or reduce the scope of this buildCache

}

entry, err := api.RegistrationEntryToProto(commonEntry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update about this comment?

defer a.mu.Unlock()

for eventID, eventTime := range a.missedRegistrationEntryEvents {
if time.Since(eventTime) > a.pruneEventsOlderThan {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use a.clk (to get Now() and then do a Sub)

func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCache(ctx context.Context) error {
// Pocess events skipped over previously
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Pocess events skipped over previously
// Process events skipped over previously

Comment on lines +233 to +234
assert.Contains(t, ef.missedAttestedNodeEvents, uint(2))
assert.Contains(t, ef.missedAttestedNodeEvents, uint(3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert.ElementsMatch is useful when you don't need exact equality but want to make sure that the set of elements is equal, independent of ordering.

defer a.mu.Unlock()

for eventID, eventTime := range a.missedRegistrationEntryEvents {
if time.Since(eventTime) > a.sqlTransactionTimeout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if time.Since(eventTime) > a.sqlTransactionTimeout {
if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout {

defer a.mu.Unlock()

for eventID, eventTime := range a.missedAttestedNodeEvents {
if time.Since(eventTime) > a.sqlTransactionTimeout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if time.Since(eventTime) > a.sqlTransactionTimeout {
if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout {

@@ -51,6 +51,9 @@ const (

// This is the default amoount of time events live before they are pruned
defaultPruneEventsOlderThan = 12 * time.Hour

// This is the default SQL transaction timeout. This value matches MYSQL's default.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, the postgres default is the larger of the two. For safety sake, should the default be the larger value?

@@ -115,32 +157,85 @@ func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCa

seenMap := map[string]struct{}{}
for _, event := range resp.Events {
// If there is a gap in the event stream, log the missed events for later processing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long can this gap realistically get? Would it be better to track this via ranges instead of individually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no known limit on the gap, because the gap will contain all of the event ids that are stuck in a transaction. With the bulk transaction APIs, this might be more than just an item or two. That said, we also support single item updates, so stuck transactions on single item updates would only contain one event id.

Also, since the locking on the event id SQL sequence is on a per-row request, and not on a per-transaction request, the numbering of the ids is not guaranteed to be a single range, for example a request to update 4 items might easily contain ids (1001, 1003, 1004, 1006) if some other in-flight update managed to snag 1002 and 1005.

If we want to shift to ranges, it would be a blind optimization on behavior we haven't even started to address. I would argue that we should add metrics (as blocked issues #4836 #4837 and #4720 suggest) and then decide if the routine needs further optimization.

Keep in mind that all polling for event ids will only occur on a stuck transaction, which according to Uber, occurs ~0.02% of the time, and may occur more or less frequently depending on database performance and event update patterns. Additional optimization might not make sense for such a small frequency of occurrences, but fixing the logical errors of missing an event is a very high priority.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I can buy the argument that using ranges might be premature at this point without some further insight.

Signed-off-by: Marcos Yacob <marcos.yacob@hpe.com>
@MarcosDY MarcosDY merged commit 09e0e36 into spiffe:main Jun 4, 2024
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Server instances using events-based entry cache occasionally have sticky, stale entries
4 participants