Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect missing records in WAL recovery #12488

Open
ajkr opened this issue Mar 29, 2024 · 7 comments
Open

Detect missing records in WAL recovery #12488

ajkr opened this issue Mar 29, 2024 · 7 comments

Comments

@ajkr
Copy link
Contributor

ajkr commented Mar 29, 2024

Current methods can only detect missing records in certain scenarios.

One scenario not covered is clean truncation, which is when the WAL is truncated to an exact record boundary. The simplest case is truncating a WAL to be empty and losing all its records.

Another scenario not covered is corruption with zero bytes. Zero bytes at the WAL tail naturally occur when pre-allocating WAL space, so LevelDB chose to treat such records as valid. But this causes us to miss cases where the WAL tail is unintentionally corrupted with zero bytes.

The solution I was thinking of was: "We could start requiring consecutive WALs to have consecutive sequence numbers. To prevent disableWAL=true writes from getting in the way, we could begin every WAL with a dummy entry with sequence number consecutive to that of the final record in the previous WAL." (#12467 (comment)).

Other ideas are welcome.

@ajkr
Copy link
Contributor Author

ajkr commented Mar 29, 2024

Another way could be to append a marker record to the active WAL before switching it. Recovery would only continue to recover a newer WAL after seeing such a marker record. Unlike the existing code for writing dummy entries, which is constrained to writing to the new WAL as it is part of DB open, the new code should be able to mutate a WAL right before switching over to a new one.

@cbi42
Copy link
Member

cbi42 commented Apr 1, 2024

The second approach sounds cleaner since it does not require consecutive sequence numbers in consecutive WALs. We may still need to put some special value in the marker record to tell it apart from an old marker record in a reused WAL.

@ajkr
Copy link
Contributor Author

ajkr commented Apr 1, 2024

I guess we can use the recyclable record format for this marker record - https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format#the-recyclable-record-format. It has a log number that we can match against the log number in the filename.

@ajkr
Copy link
Contributor Author

ajkr commented Apr 1, 2024

The second approach sounds cleaner since it does not require consecutive sequence numbers in consecutive WALs.

For the first approach, a dummy record with a new type at the start of the WAL could require recovery to have reached a specific sequence number before continuing. The feature would be opt-in to not break downgrade compatibility.

For the second approach, how would we know that an existing WAL should end with an ending marker, as the WAL format is unversioned? It feels like we would need a new record type at the start anyways to indicate that. It may be worthwhile if we make this new record type carry a version number.

@cbi42
Copy link
Member

cbi42 commented Apr 1, 2024

It feels like we would need a new record type at the start anyways to indicate that.

Makes sense.

For the dummy record, if it contains the size of the previous WAL file, it seems this feature provides a better protection compared to track_and_verify_wals_in_manifest for non-active WALs.

@nkg-
Copy link
Contributor

nkg- commented Apr 4, 2024

@ajkr : Another case, where some write is not fsynced to file, and missing completely from the file. That should cause missing records from WAL. Would that be caught today. Atleast I have seen (based on corruption logs), we check for crc checksum per record, and truncated records (which are a result of partial writes).

@ywave620
Copy link
Contributor

where some write is not fsynced to file, and missing completely from the file

@nkg- I think this only results from write with option.sycn=false, so it is expected to both DB user and DB developer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants