Exponential wait time increase if Raft Snapshot fails #1664

otoolep · 2024-02-01T22:06:58Z

Started with #1643

8.x introduced a degenerate behaviour that was not present in the 7.x series. It goes like this:

rqlite uses the default 5-second busy timeout on all connections to the database (both read-only and read-write). This is the default in the underlying SQLite driver, and has never changed.
a client initiates a read of the database that takes longer than 5 seconds
simultaneously the Raft snapshotting process attempts to checkpoint the WAL using TRUNCATE. This is blocked by the reader, due to the way SQLite works. The TRUNCATE will wait up to 5 seconds before giving up due to the long-running read (assuming the read doesn't complete before that 5 seconds). The snapshot fails. Snapshotting will be re-attempted next time.
but while the snapshotting process is running, writes to rqlite are blocked, due to the nature of Raft.

This is not really solvable in a fundamental way, not without wild ideas like two copies of the SQLite database sitting under rqlite. The benefit of 8.x is way more efficient snapshotting generally. The cost is that reads can block writes .

I'm not happy with writes being blocked this long by long-running reads. I don't like it because in SQLite reads do not block writes when the WAL is enabled (except while a RESTART or TRUNCATE checkpoint is running). So I would like to minimize the amount of time a long-running read can block writes, at least initially.

I think it would be better if the snapshotting process waited up to 500ms the first time, and doubled that wait time on every failure. Next time it would wait 1 sec, then 2, then 4, and so on. This could be achieved by changing the busy timeout on the connection used for checkpointing (the read-write connection), just before the checkpoint operation is attempted. It would revert to 5 seconds (so existing write timeouts are otherwise maintained) before the checkpoint function returns. And if checkpointing failed, rqlite would remember to double the timeout for the next attempt. Once snapshotting succeeds, the timeout returns to 500ms.

That said, I'm not sure this is strictly better. At the end of the day snapshotting has to eventually happen, and everytime a snapshot is missed, the next snapshot, if it suceeds, will take longer. Perhaps a more aggressive exponent might help (quadruple for example).

cc @Tjstretchalot for any opinion.

otoolep · 2024-02-01T22:08:57Z

Actually, it wouldn't need to change the busy timeout. Just exec the SQL statement (PRAGMA CHECKPOINT) with a timeout -- the SQLite driver supports it via context objects.

otoolep · 2024-02-01T22:21:00Z

An alternative idea, which might be more elegant, would be to use channels (or condition variables) to signal when there are Store.Request() or Store.Query() functions active. Whenever the number of those requests goes from non-zero to zero, a window has opened up on which a snapshot could take place. Use that window to block more read requests, check if a snapshot is needed due to WAL size, and do it if necessary.

Tjstretchalot · 2024-02-02T16:39:55Z

I like the alternative idea. The only additional part that I would add is some hint that clients can see that provides a hint for if a snapshot is desired right now, perhaps as an addition to one of the existing endpoints. For example, adding "snapshot_overdue_ms": 123 to store in /db/info, which is how many milliseconds overdue snaphotting is. This could also be the same metric to decide whether to stop accepting new queries to the instance to allow snapshotting to complete, e.g., a setting like "if the WAL snapshot is more than 5s overdue, return 503 to all new none or weak level queries to the instance"

This would be useful both for analytics and for potentially deferring non-urgent requests, and would guarantee snapshotting completes.

Tjstretchalot · 2024-02-02T16:45:13Z

Building on that, if additional 503s that aren't opt-in aren't desirable, it could be instead accomplished with as a precondition on the request, similar to freshness. I can't think of a good name of it but e.g.

POST /db/query?level=none&freshness=5m&only_if_snapshot_overdue_less_than=5s

This will allow the 503s to be opt-in, which could be friendlier for environments where client retries aren't robustly implemented (e.g., short-term projects, one-off scripts, etc)

otoolep · 2024-02-02T16:50:02Z

Yeah, I like that. Expose the information, so it's there for diagnostics and power users if needed.

Tjstretchalot · 2024-02-02T17:04:01Z

Additional context I'm relinking here from my interpretation of the sqlite forum post you made - https://sqlite.org/forum/forumpost/d6ac6bf752 is that it's not just one read query that slows it down, its this scenario

req6 |           xxxxx
req5 |         xxxxx
req4 |       xxxxx
req3 |     xxxxx
req2 |   xxxxx
req1 | xxxxx
walL | 111112233445566   
     -----------------
       time ->

reqN = incoming none-level request N
walL = WAL snapshotting is prevented due to lock,
       indicating the oldest owner of the lock
       (1,2,3,4,5,6 = request number)

which can be resolved only in one of 2 ways as I understand it:

TRUNCATE tries to take an exclusive lock on the WAL, which will prevent reads/writes going through. your doubling of the time for this lock, assuming the lock strategy is fair, means eventually TRUNCATE will last longer than any reads while it is running and hence succeed:

req6 |           BBxxxxx
req5 |         BBBBxxxxx
req4 |       xxxxx    
req3 |     Bxxxxx
req2 |   xxxxx
req1 | xxxxx
walL | 11111223344 55566   
snap |  F  BF  BBBS  F
     -------------------
       time ->

reqN = incoming none-level request N; x means running,
       B means blocked
walL = WAL snapshotting is prevented due to lock,
       indicating the oldest owner of the lock
       (1,2,3,4,5,6 = request number)
snap = WAL snapshot attempt; B = blocked, F = fail,
       S = success

Depending on the system this could be pretty decent; you don't get failures on your none-queries, but they might take a very long time to respond because they spend most of the request time blocked in the extreme case.

Incoming requests start being rejected in order to allow snapshotting

req6 |           F
req5 |         F
req4 |       xxxxx
req3 |     F
req2 |   F
req1 | xxxxx
walL | 11111 44444    
snap |  FRRRS  FRRS
     -------------------
       time ->

reqN = incoming none-level request N; x means running,
       F means failed due to snapshotting
walL = WAL snapshotting is prevented due to lock,
       indicating the oldest owner of the lock
       (1,2,3,4,5,6 = request number)
snap = WAL snapshot attempt; F = fail, R = causing rejections
       to new queries, S = success

Depending on the system this could be pretty decent; if you have a large enough cluster, presumably there is at least one node not snapshotting at a time and so standard retrying on other nodes will find you a place to put your request. Requests are fast in one sense: they either instantly fail, or are dominated by the actual time required for the query.

Personally, I prefer the latter since it allows for more control later, e.g., prioritizing queries, since rqlite is managing the failures rather than sqlite's locks which will treat all the queries equally

otoolep changed the title ~~Exponential wait time increase if Snapshot fails~~ Exponential wait time increase if Raft Snapshot fails Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exponential wait time increase if Raft Snapshot fails #1664

Exponential wait time increase if Raft Snapshot fails #1664

otoolep commented Feb 1, 2024 •

edited

otoolep commented Feb 1, 2024

otoolep commented Feb 1, 2024

Tjstretchalot commented Feb 2, 2024 •

edited

Tjstretchalot commented Feb 2, 2024

otoolep commented Feb 2, 2024

Tjstretchalot commented Feb 2, 2024 •

edited

Exponential wait time increase if Raft Snapshot fails #1664

Exponential wait time increase if Raft Snapshot fails #1664

Comments

otoolep commented Feb 1, 2024 • edited

otoolep commented Feb 1, 2024

otoolep commented Feb 1, 2024

Tjstretchalot commented Feb 2, 2024 • edited

Tjstretchalot commented Feb 2, 2024

otoolep commented Feb 2, 2024

Tjstretchalot commented Feb 2, 2024 • edited

otoolep commented Feb 1, 2024 •

edited

Tjstretchalot commented Feb 2, 2024 •

edited

Tjstretchalot commented Feb 2, 2024 •

edited