New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build WAL "compact list" as writes occur -- and reduce Snapshotting time #1643
Comments
Corrected initial comment. |
At Snapshot time the system would then have a list of (offsets, length) in the WAL file it needs to copy to the snapshot store, and could use |
I'm trying to understand this a little better, so I'm going to restate some background knowledge for my own reference: SQLite can be run in write-ahead-log (WAL) mode. In WAL mode there are 3 operations: read, write, checkpoint. There are three files: the disk file, the WAL file, and the WAL-index file. The index can presumably be recovered from the WAL file and thus isn't important for persistence, just read performance as the WAL grows large. The sqlite database file format is unchanged in WAL mode. The WAL file itself consists of a header followed by zero or more frames. The format contains checksums and salts for integrity verification. It is essential that these salts be checked, as resetting the WAL is done by overwriting the header, which causes the frames to fail integrity verification and thus be skipped. When a read operation is applied, it operates as if against the regular database, except every time it goes to read a page, it first checks if a newer version of that page is available in the WAL at the time the read started. If so, it instead uses that. When a write operation is applied, it determines what pages are changed (taking into account the changes within the WAL just like a reader) and writes those the new versions to the WAL. When a checkpoint operation is applied, the original database is updated to incorporate changes in the WAL, and the WAL is updated to invalidate the ones that were checkpointed. There are four levels that the WAL can be checkpointed:
NOTE: When checkpointing desired in sqlite, reading is at its slowest because the WAL is at its largest. If doing passive checkpoints or including a timeout, reads will continuously get slower and slower, making it progressively more likely that the passive checkpoint will fail or the full block timeout will trigger. In other words, if the first checkpoint fails, the next one is more likely to fail, without limit Now, in addition, we want to do raft log compaction (section 7). Raft log compaction in its simplest, conceptual form is:
In order for snapshotting to be faster, rqlite uses the WAL to avoid having to distribute the entire database repeatedly, of which only a small portion (or possibly even nothing) has changed. Currently, this is done in rqlite as follows:
Issue I was running into: the truncate would timeout (1) and I believe heartbeats would stop while snapshotting (2). I'm not as confident with (2) from the code, thats just what it appeared like. You're proposing reading from the WAL file after each write to get the new pages, essentially maintaining it in memory for distributing. I don't see how this would help with checkpointing though, unless you did something like block none-level readers, but you would run into the same timeout issue as before? You can't actually apply the changes until those readers aren't potentially mid-read from older versions. However, I suppose what you could do is cleanup the raft log entries even though the entries haven't been checkpointed yet. That is certainly nifty Possibility 1: none/weak read level already implies that there are very limited guarantees, so none-level readers could be directed to ignore the WAL. This is a super extreme version of non-freshness though, since you would definitely be stale and it wouldn't correct until the snapshot. Might be pretty error-prone, but this is possible, and only sort of mitigates the problem since at least readers aren't getting slower when snapshots miss, and does nothing against queries that were too slow anyway. I don't love this solution Possibility 2: ideally there'd be a way to timeout sql queries so you could kill slow ones if necessary for snapshotting after a timeout and then proceed with snapshotting, but i don't think theres any way to do this in sqlite. I think it would be good enough to print log messages whenever a snapshot is blocked containing all the ongoing none/weak level reads, which should be pretty doable? I don't think this would require too much overhead, but could be hidden behind an opt-in flag. I would certainly use this and would probably solve the issue. From there a flag in the query like |
The more I think about it the more I like the logging the blocking queries / flagging queries as may block snapshotting idea. It'd help detect slow queries without arbitrary numbers (e.g., seconds taken), and if it returned why it was blocked in the response (e.g., The reason I think the delayable is nice is I'm fairly certain all/most of my somewhat slow queries are non-urgent background tasks or I could optimize further if I knew which were particular issues (which is partly on me, I could have better logging for seeing slow queries) |
Interesting point, hadn't thought about that, but "without limit"? Even your own system shows it doesn't happen without limit. It just fails occasionally. You're just not seeing the process recover unless you run with INFO level logging. But yes, the large WALs could cause reads to be slower.
Nothing to do with Raft snapshot. Snapshotting happens on every node, individually. They are standalone processes, not correlated across the cluster.
Yeah, that's because your mental model of what is going on isn't right. No worries, rqlite is a little complex in spots. But it's actually simpler than you probably realise. I do think your point about growing WAL being a cause for slow reads is an interesting one - the interaction between the growing WAL and slower reads. But there is an easy way to check that -- just monitor the maximum size of the WAL file between snapshots. This is key. That information is exposed via /status and /debug/vars, or could be checked by simply monitoring the file system. And if it's getting too big, consider snapshotting more often via |
|
Yes sorry, I meant it conceptually. In theory, there's nothing that stops it getting slower and slower. In practice, it's probably not getting 100x larger. At the largest points from quickly grepping, it was around 3000 pages (vs the configured 1000 pages). I can see this since it logs the number of pages when it fails.
Thank you for the correction! This is why I restated my background knowledge; to make sure we're talking about the same thing. I got this confused with The fact that its individual makes it really surprising to me it doesn't tend towards more self-recovery; is it possible all the nodes will tend to hit the "snapshot required" check at the same time due to it being deterministic? It might mitigate the chances of issues if the snapshots were spread out a bit. However I know other parts of raft use random durations to spread things out, so this is probably not the issue. Does |
I still don't see how snapshotting being individual helps with the fact you cannot checkpoint the WAL to the sqlite database file while readers are reading from the sqlite database file, as you could be changing the pages that the reader is actively using. Is the idea behind this in-memory compact list related to checkpointing the WAL to the sqlite database file or exclusively for compacting the raft log? The reason I assumed it was the former is that's the part timing out; it computes the in-memory compressed WAL fine. In other words, if theres a slow query thats running for 10s right now reading from the sqlite file, regardless of if you have a compact list available in memory, you still won't be able to write that to the sqlite file until that sql query is over. And that's the part timing out right now |
No, they have thought of that: https://pkg.go.dev/github.com/hashicorp/raft#Config
Just the interval. You may also wish to change
I am not sure the connection is as clear as that. Not saying it isn't but the Raft source is the only thing that can tell us. Empirically it may appear to be true, I accept.
I know you don't. :-) It'll be easier in person. |
Right, that's just not that much larger. In my experience a 3x increase is just not something that can push a system over the edge, unless it was fairly close already. That's only from 4MB to 12MB. They are not huge numbers by modern standards. Most decent software is designed with that sort or range in mind. Now 10x, or 100x, that might be different. |
@Tjstretchalot -- good to talk to you yesterday. You might like to upgrade to the latest version I released this morning, as it uses a memory a bit more efficiently during the snapshotting process. |
Also, as discussed, it might make sense for you to decrease the interval on which Raft checks to see if snapshotting is required. Maybe try 5s. |
Thinking about this has triggered a question on the SQLite forum: https://sqlite.org/forum/forumpost/d6ac6bf752 |
Today Snapshotting requires that the entire WAL file be scanned, and that the last version of each database page be identified in the WAL file. This can be time-consuming, and memory intensive (though in practise it usually isn't memory-intensive) -- and writes are blocked during this time. Call this list the "compact list".
It might be better in practise (though the amount of data being scanned is the same in either case) to incrementally build this list every time a write takes place to the SQLite database. Then, at Snapshot time, the "compact list" would be available for the Snapshot process, shortening the actual Snapshot time. Each write would take a slightly longer, but might be quicker since the portion of the SQLite WAL being read should be in the OS cache (it was just written by SQLite after all). Anyway, performance testing will tell us the answer.
Once Snapshotting is complete, the list is reset, and the cycle starts again.
cc @Tjstretchalot
The text was updated successfully, but these errors were encountered: