Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server instances using events-based entry cache occasionally have sticky, stale entries #5021

Open
rturner3 opened this issue Mar 26, 2024 · 25 comments · May be fixed by #5071
Open

Server instances using events-based entry cache occasionally have sticky, stale entries #5021

rturner3 opened this issue Mar 26, 2024 · 25 comments · May be fixed by #5071
Assignees
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Milestone

Comments

@rturner3
Copy link
Collaborator

We've seen this error log being emitted regularly in our deployments of SPIRE since enabling the events-based cache: Entry not found or not authorized.

We've seen this error get repeatedly logged for a few entries (~0.01% of total entries). I noticed this log gets emitted for a long time, for days on end. Looking at a few examples, the pattern I've seen is that one server instance is providing an entry from its cache that no longer exists in the database. This causes agents to pick up this entry when they round robin to this bad server instance (let's call this instance A).

Then when the agent round robins to the next server instance B, it asks for an X.509-SVID to be signed for this old entry ID. The server instance B denies the SVID from being signed because the entry ID doesn't match any entry in its cache, then it logs the aforementioned error message.

There doesn't appear to be any other way to mitigate this than to restart the server to get it to force refresh its entry cache.

  • Version: 1.9.0
  • Platform: Linux
  • Subsystem: Server

Relevant Server config:

server {
    experimental {
        events_based_cache = true
        cache_reload_interval = "1s"
        prune_events_older_than = "6h"
    }
}
@faisal-memon
Copy link
Contributor

Please assign this to me

@faisal-memon
Copy link
Contributor

@rturner3 One theory I have for this is that we are reading from the database and getting back a valid entry, upon getting a deleted event. Is it possible we are reading from a replica that has stale data?

@rturner3
Copy link
Collaborator Author

We do configure read-only replicas in SPIRE Server

@azdagron
Copy link
Member

To be explicit, this is configured via "ro_connection_string", correct?

@rturner3
Copy link
Collaborator Author

To be explicit, this is configured via "ro_connection_string", correct?

Yes

@sorindumitru
Copy link
Contributor

I think this is somewhat similar to #4329 (or at least the root cause of it), just that now if affects events which leads to more visible issues. I think the conclusion there is that we need serialiseable isolation level.

@sorindumitru
Copy link
Contributor

I also noticed that during start up we fetch the latest event id and the list the entries/agents in two independent transactions:

lastEventID, err := ds.GetLatestRegistrationEntryEventID(ctx)

Should this happen in a single transaction to guarantee a consistent view of the world?

@faisal-memon
Copy link
Contributor

@sorindumitru

I also noticed that during start up we fetch the latest event id and the list the entries/agents in two independent transactions:

lastEventID, err := ds.GetLatestRegistrationEntryEventID(ctx)

Should this happen in a single transaction to guarantee a consistent view of the world?

@sorindumitru In that case i think its ok its not behind a transaction. Since the events is basically a queue if we get some events in between the two calls then we'll do some extra reads to the database but things will still be consistent..

@faisal-memon
Copy link
Contributor

@rturner3 Found an issue that could lead to a dropped event in the below:

node, err := a.ds.FetchAttestedNode(ctx, event.SpiffeID)
if err != nil {
return err
}
a.lastAttestedNodeEventID = event.EventID
if node == nil {
a.cache.RemoveAgent(event.SpiffeID)
continue
}
selectors, err := a.ds.GetNodeSelectors(ctx, event.SpiffeID, datastore.RequireCurrent)
if err != nil {
return err
}

If the call to GetNodeSelectors() fails we are marking the event as processed and not retrying. If that is the case here you see this error message:

a.log.WithError(err).Error("Failed to update entry cache")

Is that happening in your deployment?

@rturner3
Copy link
Collaborator Author

I see some Failed to update entry cache log messages, all with error datastore-sql: driver: bad connection datastore-sql: driver: bad connection. Unfortunately I only have logs from the last 24 hours, and the logs I see are in just one of our deployments, but this is affecting all our deployments.

I do however have metrics data for all our deployments going back the past 40 days, and I don't see any failed calls to get node selectors (I checked this metric).

In any case, this logic does look suspect if we did ever fail to fetch node selectors since lastAttestedNodeEventID is already updated before we fetch the node selectors and properly process the agent event. But from what I can tell from our logs/metrics, it's not the specific issue we're seeing.

@rturner3
Copy link
Collaborator Author

rturner3 commented Apr 2, 2024

A couple observations from recent logs:

  • Of the cases I've seen, the Entry not found or not authorized error message is only logged for one server instance in an HA deployment
  • I've seen this pattern correlated to workload entries that have expired and been pruned by the server
  • This seems to only affect a very small percentage of agents that are authorized to these entries. In all the cases I saw, it was just one authorized agent that kept requesting an SVID for the old, pruned entry
  • The affected agents otherwise appear healthy by other agent metrics
  • The affected agents I looked at all recently came up for the first time (within past few days)
  • When searching by the entry_id field in agent logs for the stale entries which have been pruned, I see the agent is repeatedly logging the message Creating X509-SVID with the stale entry ID in the entry_id field (code)

We also have the agent LRU cache feature enabled.

Just thinking out loud, is there some edge case we might be hitting where:

  • An entry X is pruned/deleted in the server
  • An agent comes up for the first time, asks for its authorized entries from server instance A that still has X in its cache
  • The agent then tries to fetch the first SVID for X, but fails because the agent round robins requests between server instances, and other server instances don't have X in their entry cache
  • Agent indefinitely retries fetching SVID for X because it never got the first SVID to put in its cache

It still doesn't fully explain to me why we see the Entry not found or not authorized message for all but one instance of the server though 🤔

@amartinezfayo amartinezfayo added this to the 1.9.3 milestone Apr 4, 2024
@amartinezfayo amartinezfayo added priority/backlog Issue is approved and in the backlog priority/urgent Issue is approved and is must be completed in the assigned milestone and removed triage/in-progress Issue triage is in progress priority/backlog Issue is approved and in the backlog labels Apr 4, 2024
@amartinezfayo amartinezfayo modified the milestones: 1.9.3, 1.9.4 Apr 4, 2024
@edwbuck
Copy link
Contributor

edwbuck commented Apr 5, 2024

Mentioning #4985 as this may be a blocker. There's been two additional changes in event cache proposed in PRs, can we verify if those address some / all of the suspected issue described here?

@amartinezfayo amartinezfayo modified the milestones: 1.9.4, 1.9.5 Apr 6, 2024
@faisal-memon
Copy link
Contributor

faisal-memon commented Apr 9, 2024

I think we found the root cause for this. The problem is with how sql auto_increment works within transactions. Looking at the code below where registration entries are created. Lets say we have 3 threads, 2 writers, 1 reader.

if err = ds.withWriteTx(ctx, func(tx *gorm.DB) (err error) {
registrationEntry, err = lookupSimilarEntry(ctx, ds.db, tx, entry)
if err != nil {
return err
}
if registrationEntry != nil {
existing = true
return nil
}
registrationEntry, err = createRegistrationEntry(tx, entry)
if err != nil {
return err
}
return createRegistrationEntryEvent(tx, registrationEntry.EntryId)
}); err != nil {

  • Thread 1 goes through to create an entry. It writes the entry and creates the event with event id 100 and puts in the transaction buffer but hasn't committed yet.
  • Thread 2 goes through to create a seperate entry. It write the entry and creates the event with event id 101 and puts in the transaction buffer
  • Thread 2 commits the transaction with event id 101
  • Thread 3 requests all the latest events, it gets up to 101 but not event 100
  • Thread 1 commits with event id 100
  • Thread 3 requests latest events > 101, skipping over 100

SQL doesnt guarantee auto_increment is in order across transactions as its a 2 step process of first securing the next index and the committing. Kind of like a read modify write cycle.

The problem is more visible with the prune registration entires functionality as the transaction there is across a loop of multiple deletes and event creates, allowing a bigger window for this race condition to happen.

What we need to solve this is an always increasing index in sql, or more generally a way to get "what's new" since the last read.

@faisal-memon
Copy link
Contributor

Couple solutions to this problem:

  1. Replay events from now till some time in the past every time a cache reload is triggered. The logic for this would be pretty simple but also could be trigger lots of unnecessary database reads.
  2. Save any skipped events along with a timestamp during each cache reload operation. Then during each subsequent cache reload operation try to fetch the skipped events and remove from the list if successful. Remove from the list regardless after some amount of time has passed. Can be the same as the current prune_events_older_than as the event would be pruned by that point anyways.
  3. Use transaction commit timestamps. Only available in postgres afaik.

1 and 2 both depend on a "transaction timeout". In the case the server crashes in the middle of a transaction the skipped event will never be written. This is a database configurable which I think is 8 hours by default on mysql.

As this is not a common occurrence I think option 2 would work best.

@edwbuck
Copy link
Contributor

edwbuck commented Apr 9, 2024

@faisal-memon and I have been discussing various solutions.

*** Faisal prefers a solution where we don't process entries more than once. To implement this, he proposed a skipped_entry list. This approach was then refined by me to ensure it was provably not going to mis-process entries.

  1. A skipped_entries collection will be maintained, it holds any entry IDs that might exist in the future, but could not be processed as they don't exist yet. Initially, this list is empty.
  2. A last_polled_entry value will be maintained, it holds the last processed EntryID value.
  3. For each "polling loop"
    A. For each element in the skipped_entries collection, perform the following:
    1. If the skipped_entry was recorded more than two "transaction timeout" periods ago, remove the skipped_entry from the skipped_entries collection without performing any action.
    2. If the skipped_entry was recorded less than two "transaction timeout" periods ago, query the events table for the skipped_entry by the skipped_entry EntryID.
    a. If the skipped_entry was detected, process the skipped_entry and remove it from the skipped_entries collection.
    b. If the skipped_entry was not detected, take no action. It will eventually be removed when it expires in a subsequent polling loop, at step 4.A.1.
    B. Maintain a last_entry value, initialized to the last_polled_entry value.
    C. Poll for new elements with EntryIDs exceeding the last_polled_entry value. For each new element:
    1. The new element's EntryID is compared to the last_entry value. If the element's EntryID is greater than the last_entry value, all missing entry id values between the two will be added to the skipped_entries list and the last_entry value is updated to the element's EntryID.
    2. If the element's EntryID is less than the last_entry value, remove the element's EntryID from the skipped_entries list.
    3. The condition where the element's EntryID is equal to the last_entry value is not required to be handled, as EntryIDs are primary keys.
    D. After all new elements are processed, update the last_polled_entry value to the last_entry value.

The algorithm can be simplified conceptually by these features:

  • The "last processed entry ID" only increases, putting any skipped items into a skipped_entries collection.
  • The skipped_entries collection only has one means of insertion, by having the likely entry ID skipped during the advancement of the "last processed entry ID" progression.
  • The skipped_entries collection has two means of removal
    • The first means of removal is that it is detected and processed elsewhere, either in the pre-batch looping over the entry, or in the in-bath processing for out-of-order items that exist in the same batch query.
    • The second means of removal is that it is not detected within the transaction timeout and a buffer (the above algorithm suggests a buffer equal to the transaction timeout, for a time window to check of 2xtranscation_timeout).

*** If Faisal deems the above algorithm too complex, we will likely pivot to the next complete solution:

  1. Rather than poll the new Entry IDs from the last known Entry ID, we will track the time of the last known Entry ID.
  2. Subsequent polling updates will then poll from "NOW() minus (transaction_timeout + buffer)" ensuring that we capture all open transactions that might have completed. Processing of the entries will proceed as already exists.

The algorithm above depends on the current assumptions that all entries can be processed idempotently, which was a goal of the original event processing system. Note that this assumes that all registration entry events remain idempotent going forward.

If neither of the two solutions are deemed sufficient, the remaining solutions are still possibly viable:

  • Use of Client clocks. The entry is written to with a timestamp of the client and the client's notion of time (ideally, the client's UNIX clock). The server then maintains a vector clock of the "processed time" where it tracks each client's time independently, and the client becomes responsible for writing an ever-increasing number.

  • Use of a "entry event processed" table, where each server that processes an event would document that it processed the event. This would work, but complicate querying for new events, as each server would then query for all entry events that didn't join to a "processed" record for that server. Such a query would involve full table scans, as well as the requisite logic to write the "event was processed" record for each server's processing of each event.

*** The following solutions are deemed non-viable, with the reasoning give below:

  • Use of a row's "create" timestamp

The problems of capturing the autoincrement EntryID at the beginning of the transaction seem like they can be fixed by capturing the row creation time; however, even with capture of the time the row was written to the database, the same problem remains. Rows that would be written to the database in a second session might not commit without significant delay. Such a delay would commit these rows in an out-of-order time sequence, even if after the commit they correctly indicate the ordering of their creation. When the event polling fails to read these in-time-order, but uncommitted rows, the event polling will fail to go backwards in time to capture the skipped events.

Additionally, while Postgres contains support for row-creation capture through the non-standard SQL function CLOCK_TIMESTAMP(), which upon each row's processing will immediately resolve the system clock time, MySQL does not support similar functionality.

NOW(), CURRENT_TIME, CURRENT_TIME() and other variants of time-capturing variables and functions across all databases are pinned to the transaction create time, and are cached such that subsequent calls within the same transaction resolve the same time).

  • Attempts to write the final commit time as the update time

This suggestion by Faisal would solve the event polling issues, by ensuring that we only read transactions in the order by which they close, making it impossible to skip events; but, such a solution is impossible, because by the time the transaction closes, the rows that are part of the transaction are already written to temporary data structures that comprise the commit. Rewriting these rows effectively means holding the transaction open to do the "commit close" timestamp update, which then takes time, delaying the commit close. This creates a chicken and egg style problem, as it is impossible to both close and update the rows in zero time.

  • Attempts to track a per-server "needs to be processed" value that is removed as events are processed by that server.

This suggestion comes from a mis-assumption that the server count and server identity is stable, neither of which are requirements for the product. Spire servers may join or leave the cluster at any time, and it is impossible for the spire client to predict when a joining server will perform the initial cache population. As it could not know when the cache is initialized, it follows that it could not predict if the "needs to be processed" value is required or not.

@faisal-memon
Copy link
Contributor

I was able to create a unit test to reproduce this failure and have a prototype fix here: https://github.com/faisal-memon/spire/pull/344/files

@edwbuck
Copy link
Contributor

edwbuck commented Apr 12, 2024

@faisal-memon would you do a Merge Request for your branch, and if it is incomplete, please mark it as in-progress?

@edwbuck
Copy link
Contributor

edwbuck commented Apr 12, 2024

This is most likely the last issue open that blocks #4985

@faisal-memon
Copy link
Contributor

@faisal-memon would you do a Merge Request for your branch, and if it is incomplete, please mark it as in-progress?

Will do once #5042 merges. My branch is based on top of that one.

@edwbuck
Copy link
Contributor

edwbuck commented Apr 15, 2024

blocks #4985

@azdagron
Copy link
Member

FYI, #5042 was merged this morning.

@faisal-memon faisal-memon linked a pull request Apr 15, 2024 that will close this issue
3 tasks
@MarcosDY MarcosDY modified the milestones: 1.9.5, 1.10.0 Apr 16, 2024
@azdagron azdagron modified the milestones: 1.10.0, 1.9.5 Apr 18, 2024
@faisal-memon
Copy link
Contributor

The algorithm posted above will work great for covering skipped events during the polling loop. It doesn't cover the case of a server restart when there is a skipped event that hasn't resolved. So upon restart we have to get a list of all the events and look for any gaps. This leads to the below issue:

  1. Server 1 starts event 20 but doesnt finish
  2. Server 2 creates event 21
  3. Events < 20 get pruned
  4. Server 3 starts up at event 21
  5. Event 20 completes

In the scenario event 20 will never get processed by server 3.

@azdagron
Copy link
Member

This seems like a viable race condition. If there is a large gap between event 19 and 20, events < 20 could be quite old and in danger of pruning.

@edwbuck
Copy link
Contributor

edwbuck commented Apr 23, 2024

The algorithm posted above will work great for covering skipped events during the polling loop. It doesn't cover the case of a server restart when there is a skipped event that hasn't resolved. So upon restart we have to get a list of all the events and look for any gaps. This leads to the below issue:

1. Server 1 starts event 20 but doesnt finish

2. Server 2 creates event 21

3. Events  < 20 get pruned

4. Server 3 starts up at event 21

5. Event 20 completes

In the scenario event 20 will never get processed by server 3.

I'm following this step by step

  1. Server 1 starts event 20 but doesnt finish
    
  • The autoi ncrement index has issued 20, and the next index will be 21.
  1. Server 2 creates event 21
    

The auto increment index has issued 21, and the next index will be 22.

  • By virtue of the write being completed:, the registration entry table is updated with the entry for event 21.
  • By virtue of the write being completed:, the events table is updated 21 is written, the next item is 21.
  1. Events < 20 get pruned.
    
  • Events 1 through 19 if they existed, are removed.
  1. Server 3 starts up at event 21
    
  • Servers starts with the database table holding all changes except the non-committed one (20). The server will potentially read the "first" record 21, as the first record.
  1. Event 20 completes
    

The database table will be updated after the initial server load, and additionally the next record to read from Server 3 will be 22 or greater, skipping record 21.

While I fully agree with you that this condition will exist; I'd still replace a known bug in steady-state running systems with an error that has a rare chance of happening on server startup. Please proceed to closing the steady state problem of not keeping these entries synced. Startup issue like this will be investigated after we close the current issue.

@azdagron azdagron modified the milestones: 1.9.5, 1.10.0 May 7, 2024
@faisal-memon
Copy link
Contributor

Opened #5151 to track the race issue referenced in #5021 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants