Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build WAL "compact list" as writes occur -- and reduce Snapshotting time #1643

Open
otoolep opened this issue Jan 29, 2024 · 13 comments
Open

Comments

@otoolep
Copy link
Member

otoolep commented Jan 29, 2024

Today Snapshotting requires that the entire WAL file be scanned, and that the last version of each database page be identified in the WAL file. This can be time-consuming, and memory intensive (though in practise it usually isn't memory-intensive) -- and writes are blocked during this time. Call this list the "compact list".

It might be better in practise (though the amount of data being scanned is the same in either case) to incrementally build this list every time a write takes place to the SQLite database. Then, at Snapshot time, the "compact list" would be available for the Snapshot process, shortening the actual Snapshot time. Each write would take a slightly longer, but might be quicker since the portion of the SQLite WAL being read should be in the OS cache (it was just written by SQLite after all). Anyway, performance testing will tell us the answer.

Once Snapshotting is complete, the list is reset, and the cycle starts again.

cc @Tjstretchalot

@otoolep otoolep changed the title Build WAL "compact index" as writes occur -- and reduce Snapshotting time Build WAL "compact list" as writes occur -- and reduce Snapshotting time Jan 29, 2024
@otoolep
Copy link
Member Author

otoolep commented Jan 29, 2024

Corrected initial comment.

@otoolep
Copy link
Member Author

otoolep commented Jan 29, 2024

At Snapshot time the system would then have a list of (offsets, length) in the WAL file it needs to copy to the snapshot store, and could use io.CopyN to do it. As a further optimization perhaps look for contiguous areas within the WAL that need copying, and copy as a single operation.

@Tjstretchalot
Copy link

Tjstretchalot commented Jan 29, 2024

I'm trying to understand this a little better, so I'm going to restate some background knowledge for my own reference:

SQLite can be run in write-ahead-log (WAL) mode. In WAL mode there are 3 operations: read, write, checkpoint. There are three files: the disk file, the WAL file, and the WAL-index file. The index can presumably be recovered from the WAL file and thus isn't important for persistence, just read performance as the WAL grows large.

The sqlite database file format is unchanged in WAL mode.

The WAL file itself consists of a header followed by zero or more frames. The format contains checksums and salts for integrity verification. It is essential that these salts be checked, as resetting the WAL is done by overwriting the header, which causes the frames to fail integrity verification and thus be skipped.

When a read operation is applied, it operates as if against the regular database, except every time it goes to read a page, it first checks if a newer version of that page is available in the WAL at the time the read started. If so, it instead uses that.

When a write operation is applied, it determines what pages are changed (taking into account the changes within the WAL just like a reader) and writes those the new versions to the WAL.

When a checkpoint operation is applied, the original database is updated to incorporate changes in the WAL, and the WAL is updated to invalidate the ones that were checkpointed.

There are four levels that the WAL can be checkpointed:

  • PASSIVE: this is basically just prayer; try to checkpoint the WAL, but if a writer writes while doing so, abort and do nothing
  • FULL: block until theres no writers and readers are on the latest snapshot, then block writers and checkpoint, then release writers. this should always work, but will block for the longest read/write in progress when it initiates. a timeout can be set for this blocking part, and that timeout could trigger which is like failing (though it didn't do anything)
  • RESTART: same as full, but after checkpointing keep blocking writers until readers are using the new database, then release writers (allows the next writer to rewrite the header, invalidating all the entries in the WAL)
  • TRUNCATE: same as restart, but also go ahead and truncate the WAL file before unblocking writers

NOTE: When checkpointing desired in sqlite, reading is at its slowest because the WAL is at its largest. If doing passive checkpoints or including a timeout, reads will continuously get slower and slower, making it progressively more likely that the passive checkpoint will fail or the full block timeout will trigger. In other words, if the first checkpoint fails, the next one is more likely to fail, without limit


Now, in addition, we want to do raft log compaction (section 7). Raft log compaction in its simplest, conceptual form is:

  • (block writes, happens naturally since raft)
  • Ensure the database is up-to-date (checkpointed); this is the snapshot
  • Distribute the snapshot until a majority of voters have it
  • Now safe to delete raft log entries locally that are from before the snapshot

In order for snapshotting to be faster, rqlite uses the WAL to avoid having to distribute the entire database repeatedly, of which only a small portion (or possibly even nothing) has changed.

Currently, this is done in rqlite as follows:

  • Block writes and strong reads (via raft)
  • Copy the WAL file to memory, efficiently handling duplicated pages
  • Checkpoint (TRUNCATE) with the standard busy timeout
  • If checkpointing succeeds, distribute the compressed WAL. Now you're good to delete raft log entries from before the partial snapshot. Otherwise, on the next loop, use simple compaction from above

Issue I was running into: the truncate would timeout (1) and I believe heartbeats would stop while snapshotting (2). I'm not as confident with (2) from the code, thats just what it appeared like.

You're proposing reading from the WAL file after each write to get the new pages, essentially maintaining it in memory for distributing. I don't see how this would help with checkpointing though, unless you did something like block none-level readers, but you would run into the same timeout issue as before? You can't actually apply the changes until those readers aren't potentially mid-read from older versions. However, I suppose what you could do is cleanup the raft log entries even though the entries haven't been checkpointed yet. That is certainly nifty


Possibility 1: none/weak read level already implies that there are very limited guarantees, so none-level readers could be directed to ignore the WAL. This is a super extreme version of non-freshness though, since you would definitely be stale and it wouldn't correct until the snapshot. Might be pretty error-prone, but this is possible, and only sort of mitigates the problem since at least readers aren't getting slower when snapshots miss, and does nothing against queries that were too slow anyway. I don't love this solution

Possibility 2: ideally there'd be a way to timeout sql queries so you could kill slow ones if necessary for snapshotting after a timeout and then proceed with snapshotting, but i don't think theres any way to do this in sqlite. I think it would be good enough to print log messages whenever a snapshot is blocked containing all the ongoing none/weak level reads, which should be pretty doable? I don't think this would require too much overhead, but could be hidden behind an opt-in flag. I would certainly use this and would probably solve the issue. From there a flag in the query like &delayable indicating I want rqlite to 503 (or 412 precondition failed might make more sense) if snapshotting needs to occur soon would still allow slow queries to run without disturbing anything.

@Tjstretchalot
Copy link

Tjstretchalot commented Jan 29, 2024

The more I think about it the more I like the logging the blocking queries / flagging queries as may block snapshotting idea. It'd help detect slow queries without arbitrary numbers (e.g., seconds taken), and if it returned why it was blocked in the response (e.g., {"error": "snapshotting"}), I could use polling/keyword argument for if it should fallback to none for weak queries. Furthermore, rather than having to deal with choosing a magic number of seconds/pages before snapshotting to block requests, you could just detect whenever WAL snapshotting fails and instead of falling back to a full snapshot you just block delayable queries, which would minimize interruptions, and fallback to actually blocking reads later (which would imply some slow queries are not marked delayable)

The reason I think the delayable is nice is I'm fairly certain all/most of my somewhat slow queries are non-urgent background tasks or I could optimize further if I knew which were particular issues (which is partly on me, I could have better logging for seeing slow queries)

@otoolep
Copy link
Member Author

otoolep commented Jan 29, 2024

If doing passive checkpoints or including a timeout, reads will continuously get slower and slower, making it progressively more likely that the passive checkpoint will fail or the full block timeout will trigger. In other words, if the first checkpoint fails, the next one is more likely to fail, without limit

Interesting point, hadn't thought about that, but "without limit"? Even your own system shows it doesn't happen without limit. It just fails occasionally. You're just not seeing the process recover unless you run with INFO level logging. But yes, the large WALs could cause reads to be slower.

Distribute the snapshot until a majority of voters have it

Nothing to do with Raft snapshot. Snapshotting happens on every node, individually. They are standalone processes, not correlated across the cluster.

I don't see how this would help with checkpointing though

Yeah, that's because your mental model of what is going on isn't right. No worries, rqlite is a little complex in spots. But it's actually simpler than you probably realise.

I do think your point about growing WAL being a cause for slow reads is an interesting one - the interaction between the growing WAL and slower reads. But there is an easy way to check that -- just monitor the maximum size of the WAL file between snapshots. This is key. That information is exposed via /status and /debug/vars, or could be checked by simply monitoring the file system. And if it's getting too big, consider snapshotting more often via -raft-snap-int.

@otoolep
Copy link
Member Author

otoolep commented Jan 29, 2024

raft-snap-int sets the interval between rqlite checking if a Snapshot is needed. It's pretty harmless to set it on the lower side because if there is nothing to do, then it will do almost nothing (but writes are blocked during the check I reckon, but again it's fast -- check the Raft log for number new entries since last snapshot, and check the size of the WAL. A millisecond maybe?) But if you set it too high, then the WAL might get quite large before it's snapshot.

@Tjstretchalot
Copy link

Tjstretchalot commented Jan 29, 2024

Interesting point, hadn't thought about that, but "without limit"? Even your own system shows it doesn't happen without limit. It just fails occasionally. You're just not seeing the process recover unless you run with INFO level logging. But yes, the large WALs could cause reads to be slower.

Yes sorry, I meant it conceptually. In theory, there's nothing that stops it getting slower and slower. In practice, it's probably not getting 100x larger. At the largest points from quickly grepping, it was around 3000 pages (vs the configured 1000 pages). I can see this since it logs the number of pages when it fails.

Nothing to do with Raft snapshot. Snapshotting happens on every node, individually. They are standalone processes, not correlated across the cluster.

Thank you for the correction! This is why I restated my background knowledge; to make sure we're talking about the same thing. I got this confused with InstallSnapshot which wouldn't need to be called for just standard snapshotting

The fact that its individual makes it really surprising to me it doesn't tend towards more self-recovery; is it possible all the nodes will tend to hit the "snapshot required" check at the same time due to it being deterministic? It might mitigate the chances of issues if the snapshots were spread out a bit. However I know other parts of raft use random durations to spread things out, so this is probably not the issue.

Does raft-snap-int change the requirements for snapshotting or just the interval that they are checked? Setting a lower WAL interval, if the issue is slow queries blocking it, without mitigating anything else would just increase the amount of cpu/memory uselessly computing the compressed WAL only to discard it since the truncate fails, and increase intracluster communication due to leadership changes since the node appears to become unresponsive for 5s when the snapshot fails AFAIK

@Tjstretchalot
Copy link

I don't see how this would help with checkpointing though

Yeah, that's because your mental model of what is going on isn't right. No worries, rqlite is a little complex in spots. But it's actually simpler than you probably realise.

I still don't see how snapshotting being individual helps with the fact you cannot checkpoint the WAL to the sqlite database file while readers are reading from the sqlite database file, as you could be changing the pages that the reader is actively using. Is the idea behind this in-memory compact list related to checkpointing the WAL to the sqlite database file or exclusively for compacting the raft log? The reason I assumed it was the former is that's the part timing out; it computes the in-memory compressed WAL fine.

In other words, if theres a slow query thats running for 10s right now reading from the sqlite file, regardless of if you have a compact list available in memory, you still won't be able to write that to the sqlite file until that sql query is over. And that's the part timing out right now

@otoolep
Copy link
Member Author

otoolep commented Jan 29, 2024

is it possible all the nodes will tend to hit the "snapshot required" check at the same time due to it being deterministic?

No, they have thought of that: https://pkg.go.dev/github.com/hashicorp/raft#Config

// SnapshotInterval controls how often we check if we should perform a
	// snapshot. We randomly stagger between this value and 2x this value to avoid
	// the entire cluster from performing a snapshot at once. The value passed
	// here is the initial setting used. This can be tuned during operation using
	// ReloadConfig.
	SnapshotInterval [time](https://pkg.go.dev/time).[Duration](https://pkg.go.dev/time#Duration)

Does raft-snap-int change the requirements for snapshotting or just the interval that they are checked?

Just the interval. You may also wish to change -raft-snap-wal-size

intracluster communication due to leadership changes since the node appears to become unresponsive for 5s when the snapshot fails AFAIK

I am not sure the connection is as clear as that. Not saying it isn't but the Raft source is the only thing that can tell us. Empirically it may appear to be true, I accept.

I still don't see how snapshotting being individual helps with the fact you cannot checkpoint the WAL to the sqlite database file while readers are reading from the sqlite database file, as you could be changing the pages that the reader is actively using.

I know you don't. :-) It'll be easier in person.

@otoolep
Copy link
Member Author

otoolep commented Jan 29, 2024

At the largest points from quickly grepping, it was around 3000 pages (vs the configured 1000 pages). I can see this since it logs the number of pages when it fails.

Right, that's just not that much larger. In my experience a 3x increase is just not something that can push a system over the edge, unless it was fairly close already. That's only from 4MB to 12MB. They are not huge numbers by modern standards. Most decent software is designed with that sort or range in mind. Now 10x, or 100x, that might be different.

@otoolep
Copy link
Member Author

otoolep commented Jan 30, 2024

@Tjstretchalot -- good to talk to you yesterday. You might like to upgrade to the latest version I released this morning, as it uses a memory a bit more efficiently during the snapshotting process.

@otoolep
Copy link
Member Author

otoolep commented Jan 30, 2024

Also, as discussed, it might make sense for you to decrease the interval on which Raft checks to see if snapshotting is required. Maybe try 5s.

@otoolep
Copy link
Member Author

otoolep commented Feb 1, 2024

Thinking about this has triggered a question on the SQLite forum: https://sqlite.org/forum/forumpost/d6ac6bf752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants