Handle out of order events #5071

faisal-memon · 2024-04-15T19:24:27Z

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated?

Affected functionality
Events based cache

Description of change
Events can come out of order from sql. If we skip over an event, retry it later.

Which issue this PR fixes
fixes #5021

faisal-memon · 2024-05-01T21:31:08Z

pkg/server/endpoints/authorized_entryfetcher.go

@@ -115,28 +122,62 @@ func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCa

 	seenMap := map[string]struct{}{}
 	for _, event := range resp.Events {
+		// If there is a gap in the event log the missed events for later processing
+		if event.EventID != a.lastRegistrationEntryEventID+1 {


Updated to check for zero

faisal-memon · 2024-05-01T21:39:20Z

pkg/server/endpoints/authorized_entryfetcher.go

+				a.missedRegistrationEntryEvents[i] = struct{}{}
+			}
+		}
+
 		// Skip fetching entries we've already fetched this call
 		if _, seen := seenMap[event.EntryID]; seen {
 			a.lastRegistrationEntryEventID = event.EventID


We get the list of events in ascending order

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

pkg/common/telemetry/server/datastore/event.go

pkg/server/endpoints/authorized_entryfetcher.go

MarcosDY · 2024-05-09T15:49:50Z

pkg/server/endpoints/authorized_entryfetcher.go

+	var lastEventID uint
+	missedRegistrationEntryEvents := make(map[uint]time.Time)
+	resp, err := ds.ListRegistrationEntriesEvents(ctx, &datastore.ListRegistrationEntriesEventsRequest{})
+	if err != nil {
+		return 0, nil, err
+	}
+	for _, event := range resp.Events {
+		if event.EventID != lastEventID+1 && lastEventID != 0 {
+			for i := lastEventID + 1; i < event.EventID; i++ {
+				missedRegistrationEntryEvents[i] = clk.Now()
+			}
+		}
+		lastEventID = event.EventID
 	}


is this required? are not we getting all entries after restart?

We need to look back in the event history to see if there is any gaps because the event might commit after we get all entries.

@MarcosDY Yes, it is required. When a long running write transaction that is modify SPIRE Entries doesn't commit because it is hung, it leaves gaps in the auto increment keys of the event table. By tracking which event ids in the event table are missing, we can then go back and explicitly poll the skipped event ids prior to scanning the table for new entry event ids. This is the best that can be done at the moment to ensure that we don't skip event ids when other transactions push the last seen event id beyond those that are stuck in hung database transactions.

MarcosDY · 2024-05-09T15:50:30Z

pkg/server/endpoints/authorized_entryfetcher.go

+func buildAttestedNodesCache(ctx context.Context, ds datastore.DataStore, clk clock.Clock, cache *authorizedentries.Cache) (uint, map[uint]time.Time, error) {
+	// Gather any events that may have been skipped during restart
+	var lastEventID uint
+	missedAttestedNodeEvents := make(map[uint]time.Time)


same question that I did for entries

We need to look back in the event history to see if there is any gaps because the event might commit after we get all entries.

@MarcosDY Yes, it is required. When a long running write transaction that is modify SPIRE Entries doesn't commit because it is hung, it leaves gaps in the auto increment keys of the event table. By tracking which event ids in the event table are missing, we can then go back and explicitly poll the skipped event ids prior to scanning the table for new entry event ids. This is the best that can be done at the moment to ensure that we don't skip event ids when other transactions push the last seen event id beyond those that are stuck in hung database transactions.

MarcosDY · 2024-05-09T15:52:42Z

pkg/server/endpoints/authorized_entryfetcher_test.go

+	require.NoError(t, err)
+	require.NotNil(t, ef)
+
+	assert.Contains(t, ef.missedRegistrationEntryEvents, uint(2))


cna you use equal here? so we are sure there are no unexpected results?

Would you suggest the alternative approach that you intend?

MarcosDY · 2024-05-09T15:52:46Z

pkg/server/endpoints/authorized_entryfetcher_test.go

+	assert.Contains(t, ef.missedAttestedNodeEvents, uint(2))
+	assert.Contains(t, ef.missedAttestedNodeEvents, uint(3))


cna you use equal here? so we are sure there are no unexpected results?

Faisal checked for equality of the count of returned items on line 231. Note that the actual items can come in any order (due to SQL and not having an order statement here) so it is correct to check each item is contained in the collection instead of assuming some sort of index based equality.

it is a map, so order does not matter when comparing maps, with that equal will guaranty that you have 2 entries and the expected entries

Would you suggest the alternative approach you are seeking here?

assert.ElementsMatch is useful when you don't need exact equality but want to make sure that the set of elements is equal, independent of ordering.

And yeah, what marcos said works. Map equality checks are already order independent.

edwbuck · 2024-05-17T14:44:43Z

@faisal-memon Can you push a signed copy? This is failing the DCO checks.

edwbuck · 2024-05-17T14:48:47Z

@MarcosDY @faisal-memon @azdagron If you have time, let's see if we can get this reviewed early enough it doesn't upset the 1.10.0 release.

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

faisal-memon · 2024-05-22T23:04:12Z

@faisal-memon Can you push a signed copy? This is failing the DCO checks.

dco fixed

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

faisal-memon · 2024-05-28T22:29:14Z

@MarcosDY All comments addressed

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

MarcosDY · 2024-05-31T17:55:36Z

cmd/spire-server/cli/run/run.go

+	CacheReloadInterval   string                      `hcl:"cache_reload_interval"`
+	EventsBasedCache      bool                        `hcl:"events_based_cache"`
+	PruneEventsOlderThan  string                      `hcl:"prune_events_older_than"`
+	SQLTransactionTimeout string                      `hcl:"sql_transaction_timeout"`


is it really required?
do you have a use case where a timeout that is used for pruning can be useful?

MarcosDY · 2024-05-31T18:30:11Z

pkg/server/endpoints/authorized_entryfetcher.go

 	log.Info("Building event-based in-memory entry cache")
-	cache, lastRegistrationEntryEventID, lastAttestedNodeEventID, err := buildCache(ctx, ds, clk)
+	cache, receivedFirstRegistrationEntryEvent, lastRegistrationEntryEventID, missedRegistrationEntryEvents, receivedFirstAttestedNodeEvent, lastAttestedNodeEventID, missedAttestedNodeEvents, err := buildCache(ctx, ds, clk)


this is too much for a return
it is better to return a struct here, or reduce the scope of this buildCache

pkg/server/endpoints/authorized_entryfetcher.go

MarcosDY · 2024-05-31T18:42:32Z

pkg/server/endpoints/authorized_entryfetcher.go

 	}

+	entry, err := api.RegistrationEntryToProto(commonEntry)


any update about this comment?

azdagron · 2024-05-01T22:51:33Z

pkg/server/endpoints/authorized_entryfetcher.go

+	defer a.mu.Unlock()
+
+	for eventID, eventTime := range a.missedRegistrationEntryEvents {
+		if time.Since(eventTime) > a.pruneEventsOlderThan {


This should use a.clk (to get Now() and then do a Sub)

azdagron · 2024-05-01T22:51:50Z

pkg/server/endpoints/authorized_entryfetcher.go

 func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCache(ctx context.Context) error {
+	// Pocess events skipped over previously


Suggested change

// Pocess events skipped over previously

// Process events skipped over previously

azdagron · 2024-06-03T16:13:05Z

pkg/server/endpoints/authorized_entryfetcher_test.go

+	assert.Contains(t, ef.missedAttestedNodeEvents, uint(2))
+	assert.Contains(t, ef.missedAttestedNodeEvents, uint(3))


assert.ElementsMatch is useful when you don't need exact equality but want to make sure that the set of elements is equal, independent of ordering.

azdagron · 2024-06-03T16:23:49Z

pkg/server/endpoints/authorized_entryfetcher.go

+	defer a.mu.Unlock()
+
+	for eventID, eventTime := range a.missedRegistrationEntryEvents {
+		if time.Since(eventTime) > a.sqlTransactionTimeout {


Suggested change

if time.Since(eventTime) > a.sqlTransactionTimeout {

if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout {

azdagron · 2024-06-03T16:23:56Z

pkg/server/endpoints/authorized_entryfetcher.go

+	defer a.mu.Unlock()
+
+	for eventID, eventTime := range a.missedAttestedNodeEvents {
+		if time.Since(eventTime) > a.sqlTransactionTimeout {


Suggested change

if time.Since(eventTime) > a.sqlTransactionTimeout {

if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout {

azdagron · 2024-06-03T16:37:41Z

pkg/server/endpoints/endpoints.go

@@ -51,6 +51,9 @@ const (

 	// This is the default amoount of time events live before they are pruned
 	defaultPruneEventsOlderThan = 12 * time.Hour
+
+	// This is the default SQL transaction timeout. This value matches MYSQL's default.


From my understanding, the postgres default is the larger of the two. For safety sake, should the default be the larger value?

azdagron · 2024-06-03T16:40:09Z

pkg/server/endpoints/authorized_entryfetcher.go

@@ -115,32 +157,85 @@ func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCa

 	seenMap := map[string]struct{}{}
 	for _, event := range resp.Events {
+		// If there is a gap in the event stream, log the missed events for later processing


How long can this gap realistically get? Would it be better to track this via ranges instead of individually?

There is no known limit on the gap, because the gap will contain all of the event ids that are stuck in a transaction. With the bulk transaction APIs, this might be more than just an item or two. That said, we also support single item updates, so stuck transactions on single item updates would only contain one event id.

Also, since the locking on the event id SQL sequence is on a per-row request, and not on a per-transaction request, the numbering of the ids is not guaranteed to be a single range, for example a request to update 4 items might easily contain ids (1001, 1003, 1004, 1006) if some other in-flight update managed to snag 1002 and 1005.

If we want to shift to ranges, it would be a blind optimization on behavior we haven't even started to address. I would argue that we should add metrics (as blocked issues #4836 #4837 and #4720 suggest) and then decide if the routine needs further optimization.

Keep in mind that all polling for event ids will only occur on a stuck transaction, which according to Uber, occurs ~0.02% of the time, and may occur more or less frequently depending on database performance and event update patterns. Additional optimization might not make sense for such a small frequency of occurrences, but fixing the logical errors of missing an event is a very high priority.

That's fair. I can buy the argument that using ranges might be premature at this point without some further insight.

Signed-off-by: Marcos Yacob <marcos.yacob@hpe.com>

MarcosDY assigned azdagron Apr 16, 2024

azdagron added this to the 1.9.5 milestone Apr 18, 2024

faisal-memon commented May 1, 2024

View reviewed changes

faisal-memon marked this pull request as ready for review May 1, 2024 21:39

faisal-memon requested review from evan2645, amartinezfayo, azdagron, MarcosDY and rturner3 as code owners May 1, 2024 21:39

Handle out of order events

447802f

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

faisal-memon force-pushed the auto_increment_fix branch from 1c402e1 to 447802f Compare May 2, 2024 04:08

faisal-memon added 2 commits May 1, 2024 21:08

Merge branch 'main' into auto_increment_fix

fb49e2f

Fix linter errors

2804c38

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

azdagron modified the milestones: 1.9.5, 1.10.0 May 7, 2024

MarcosDY reviewed May 9, 2024

View reviewed changes

Prune missed attested node events

e3c3518

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

faisal-memon force-pushed the auto_increment_fix branch from 100bcde to e3c3518 Compare May 22, 2024 17:02

edwbuck mentioned this pull request May 22, 2024

Events based cache could lose events at startup #5151

Open

faisal-memon and others added 7 commits May 22, 2024 16:12

Fix comments

0019f39

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Update pkg/server/endpoints/authorized_entryfetcher.go

9268243

Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Cleanup

9f21c01

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Removed unused API

88fe203

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Add ForTesting suffix

4e6a054

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Wrap fetch calls in transaction

b992708

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Update pkg/server/endpoints/authorized_entryfetcher.go

b7989b0

Co-authored-by: Marcos Yacob <marcosyacob@gmail.com> Signed-off-by: Faisal Memon <fymemon@yahoo.com>

faisal-memon added 5 commits May 23, 2024 16:00

Remove unused return value

bdcf8df

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Add event id to telemetry

7fcf912

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Print debugs when missed event not yet populated

23c301d

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Use bool to determin if first event received

ab7f4eb

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Merge branch 'main' into auto_increment_fix

51cd906

Add SQL transaction timeout configurable

b6fce4c

Signed-off-by: Faisal Memon <fymemon@yahoo.com>

MarcosDY reviewed May 31, 2024

View reviewed changes

Merge branch 'main' into auto_increment_fix

9e0fea0

azdagron reviewed Jun 4, 2024

View reviewed changes

MarcosDY added 2 commits June 4, 2024 13:27

PR changes

b679cfe

Signed-off-by: Marcos Yacob <marcos.yacob@hpe.com>

Merge branch 'main' into auto_increment_fix

9e283cd

azdagron approved these changes Jun 4, 2024

View reviewed changes

MarcosDY merged commit 09e0e36 into spiffe:main Jun 4, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle out of order events #5071

Handle out of order events #5071

faisal-memon commented Apr 15, 2024 •

edited

faisal-memon May 1, 2024

faisal-memon May 1, 2024

MarcosDY May 9, 2024

faisal-memon May 23, 2024

edwbuck May 24, 2024

MarcosDY May 9, 2024

faisal-memon May 23, 2024

edwbuck May 24, 2024

MarcosDY May 9, 2024

edwbuck May 24, 2024

MarcosDY May 9, 2024

edwbuck May 21, 2024

MarcosDY May 22, 2024 •

edited

edwbuck May 24, 2024

azdagron Jun 3, 2024

azdagron Jun 4, 2024

edwbuck commented May 17, 2024

edwbuck commented May 17, 2024

faisal-memon commented May 22, 2024

faisal-memon commented May 28, 2024

MarcosDY May 31, 2024

MarcosDY May 31, 2024

MarcosDY May 31, 2024

azdagron May 1, 2024

azdagron May 1, 2024

azdagron Jun 3, 2024

azdagron Jun 3, 2024

azdagron Jun 3, 2024

azdagron Jun 3, 2024

azdagron Jun 3, 2024

edwbuck Jun 4, 2024

azdagron Jun 4, 2024

		assert.Contains(t, ef.missedAttestedNodeEvents, uint(2))
		assert.Contains(t, ef.missedAttestedNodeEvents, uint(3))

		func (a *AuthorizedEntryFetcherWithEventsBasedCache) updateRegistrationEntriesCache(ctx context.Context) error {
		// Pocess events skipped over previously

	// Pocess events skipped over previously
	// Process events skipped over previously

	if time.Since(eventTime) > a.sqlTransactionTimeout {
	if a.clk.Now().Sub(eventTime) > a.sqlTransactionTimeout {

Handle out of order events #5071

Handle out of order events #5071

Conversation

faisal-memon commented Apr 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcosDY May 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edwbuck commented May 17, 2024

edwbuck commented May 17, 2024

faisal-memon commented May 22, 2024

faisal-memon commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faisal-memon commented Apr 15, 2024 •

edited

MarcosDY May 22, 2024 •

edited