Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-latest WAL log corruption after EBS snapshot restore in AWS #12467

Open
tonyxuqqi opened this issue Mar 21, 2024 · 27 comments
Open

non-latest WAL log corruption after EBS snapshot restore in AWS #12467

tonyxuqqi opened this issue Mar 21, 2024 · 27 comments

Comments

@tonyxuqqi
Copy link

tonyxuqqi commented Mar 21, 2024

Version: RocksDB 6.29

Expected behavior

Using kPointInTimeRecovery, after EBS restore, the non-latest WAL log file should not be corrupted.

Actual behavior

The non-latest WAL file is corrupted after EBS restore.
And we also noticed that after enable track_and_verify_wals_in_manifest, some WAL files' size in manifest does not match the actual one---the actual size is bigger than the one in manifest.

"[2024/03/08 14:14:16.534 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:818] /var/lib/wal/rocksdb/15557239.log: dropping 32768 bytes; Corruption: checksum mismatch
[2024/03/08 14:14:16.534 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:818] /var/lib/wal/rocksdb/15557239.log: dropping 64430 bytes; Corruption: error in middle of record
[2024/03/08 14:14:16.534 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:1140] Point in time recovered to log #15557239 seq #296119395244
[2024/03/08 14:14:16.534 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:890] Recovering log #15557242 mode 2
[2024/03/08 14:14:16.535 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:896] /var/lib/wal/rocksdb/15557242.log: dropping 36031876 bytes
[2024/03/08 14:14:16.535 +00:00][4][INFO] [db/db_impl/db_impl_open.cc:890] Recovering log #15557245 mode 2
[2024/03/08 14:14:16.536 +00:00][4][WARN] [db/db_impl/db_impl_open.cc:896] /var/lib/wal/rocksdb/15557245.log: dropping 35957673 bytes
"

Steps to reproduce the behavior

Running some heavy write workload, take the EBS snapshot and then restore it in another machine. Open the rocksdb.

@ajkr
Copy link
Contributor

ajkr commented Mar 22, 2024

Could you please provide the setting you have for recycle_log_file_num? Additionally, if you have the full contents of the OPTIONS file used to write the DB, that could potentially give even more clues.

@nkg-
Copy link
Contributor

nkg- commented Mar 22, 2024

TiKV uses kPointInTimeRecovery and kAbsoluteConsistency, so it has WAL recycle disabled. ie recycle_log_file_num=0.

As for WriteOptions, I only see 3 options ever being set.

  • sync: Across writes, set to true (set when internal storage topology like sharding changes) and false (normal application writes)
  • no_slowdown: false (default. Set sometimes. Not sure if set during this repro)
  • disable_wal: false

@ajkr
Copy link
Contributor

ajkr commented Mar 22, 2024

Thanks for the info.

Does TiKV also use an external syncer thread, that is, setting manual_wal_flush==true + a loop of FlushWAL(true /* sync */)?

Did your LOG file contain any occurrences of "WAL Sync error"?

Do you still have 15557239.log? I can't tell from the error message, but it would be interesting to know if this error is around the end of the log file.

@nkg-
Copy link
Contributor

nkg- commented Mar 22, 2024

Does TiKV also use an external syncer thread, that is, setting manual_wal_flush==true + a loop of FlushWAL(true /* sync */)?

Great question. I think @tonyxuqqi mentioned tikv does trigger flushWAL manually, except I can't find a code reference. So @tonyxuqqi , if you can confirm.

Did your LOG file contain any occurrences of "WAL Sync error"?

Yeah. We had checked earlier. Unfortunately no "sync error". THat would have a perfect smoking gun.

Do you still have 15557239.log? I can't tell from the error message, but it would be interesting to know if this error is around the end of the log file.

We don't have the log file. Right now. But we 10s of repro per week. How can you tell if corruption is in middle or end. FWIW, what I saw was there were

  • 359 records missing (compared to last readable seqnum in the corrupt wal, and first seqnum in next wal).
  • Corrupt wal had 76k bytes extra. When compared to the size in the manifest. We have track wal in manifest enabled. https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST. Is it possible MANIFEST doesn't get updated, with every corresponding call to fsync WAL.

@ajkr
Copy link
Contributor

ajkr commented Mar 22, 2024

This is good info. I think 359 missing records should mean the corruption was near the tail. It might be the number of records between the last FlushWAL(true /* sync */) finished and the WAL ended.

There is one bug fix I see that could be relevant: #10560

I think you said you hit this problem even before turning on WAL tracking? WAL tracking is a good feature, but important bug fixes are needed that probably are not present in version 6.29. #10185 sounds like a possible fix for the symptom you described.

@nkg-
Copy link
Contributor

nkg- commented Mar 22, 2024

I think you said you hit this problem even before turning on WAL tracking?

Yes. We enabled to debug the corruption. So corruption was there before.

Both fixes seem very relevant. #10560 particularly seems to match @tonyxuqqi hypothesis we were discussing earlier this week. I can try patching this one, to fix the corruption.

And #10185 to fix the manifest tracking. We indeed see 100s/hour of size mismatches, which don't point to a corrupt WALs.

@mittalrishabh
Copy link

@ajkr can rocksdb start writing to a new WAL before previous WALs are fsynced.

@ajkr
Copy link
Contributor

ajkr commented Mar 22, 2024

Hi, yes, it can. When next WAL sync is requested, though, we fsync in order from oldest unsynced WAL up to the active WAL

@mittalrishabh
Copy link

mittalrishabh commented Mar 22, 2024

I think It doesn't mean that latest WAL file is written only after old WAL files. Filesystem can flush the blocks independently. Since the file system's flushing behavior may not guarantee the sequential writing of WAL files, is it safe to assume that the tail of the non latest WAL file is always completely written in case of unclean shutdown ?

@ajkr
Copy link
Contributor

ajkr commented Mar 23, 2024

Yes you are right. As you described it, there is a valid scenario where point-in-time recovery encounters "Corruption: checksum mismatch" and does not proceed to recover the most recent log.

However, this scenario should not be enough to trigger "Corruption: SST file is ahead of WALs in CF lock". There is another fsync procedure to prevent that, which is we fsync the WALs whose data was flushed before recording the SST file to the MANIFEST.

@mittalrishabh
Copy link

Thanks, It makes sense.

@mittalrishabh
Copy link

Hi Andrew, I have put some comments on the PR #10560.

@ajkr
Copy link
Contributor

ajkr commented Mar 26, 2024

Replied there - feel free to follow up there if I missed something

@mittalrishabh
Copy link

mittalrishabh commented Mar 26, 2024

@ajkr We already have this #10185 in our TiKV branch. It appears that PingCAP cherry-picked a few changes after forking from the 6.29 release. I have attached a screenshot of the cherry-picked changes for your reference.
Based on the current code, it seems that there might be a bug in tracking wal size in manifest, or there could be missing fsync calls.
I am cherrypicking the other change #10560 to see if it resolves the issue.

Screenshot 2024-03-25 at 9 03 00 PM

@nkg-
Copy link
Contributor

nkg- commented Mar 26, 2024

+1 to @mittalrishabh said. We are cherrypicking the fsync fix #10560. We will update here, if it fixes the WAL corruption.

For WAL size mismatch with manifest, as @mittalrishabh said, its already included, and we still see LOTs (every rocksdb instance shows 1 mismatch per hour) of mismatches. So wondering if its possible there is another bug in WAL size tracking. @ajkr : Is there anything we can provide to help you debug that.

@ajkr
Copy link
Contributor

ajkr commented Mar 27, 2024

some WAL files' size in manifest does not match the actual one---the actual size is bigger than the one in manifest.

Do you have an example error message? I am not sure if I'm reading something backwards or looking in the wrong place. The code I am looking at is this -

rocksdb/db/wal_edit.cc

Lines 198 to 205 in 1856734

if (log_file_size < wal_meta.GetSyncedSizeInBytes()) {
std::stringstream ss;
ss << "Size mismatch: WAL (log number: " << log_number
<< ") in MANIFEST is " << wal_meta.GetSyncedSizeInBytes()
<< " bytes , but actually is " << log_file_size << " bytes on disk.";
s = Status::Corruption(ss.str());
break;
}

For that check at least, actual WAL size bigger than the manifest size should be tolerated.

@nkg-
Copy link
Contributor

nkg- commented Mar 27, 2024

@ajkr : Our size mismatch detection is not based on that check, you referenced above. But rather, we have a script, which runs every 2 mins, an on every storage node (running rocksdb), it checks the current WALs (except latest one), and compares the actual file size, with the file size in manifest.

So a question:

  1. Does the above comparison make sense, wrt does it point to something unexpected.
  2. More specifically, does this size mismatch point to any corruption, or a missing fsync.
  3. Or maybe mismatch bet actual file and size in manifest is somewhat expected, since the code doesn't track size in manifest, for every file fsync.

In general, we just want to know if size mismatch points to something nefarious (like another bug) at all.

@ajkr
Copy link
Contributor

ajkr commented Mar 27, 2024

The short answer is, no, what you're checking is not guaranteed.

Longer answer: the synced size recorded in the MANIFEST is just a lower bound on the latest synced offset. The actual latest synced offset (and thus file size) may be larger. It was originally built like this because we also recorded synced size for the active WAL, which is continually growing. Currently track_and_verify_wals_in_manifest only takes effect for the inactive WALs so I agree it is confusing that we don't simply record a final size after a final sync. I hope that we bring this feature back for active WAL again some day, though, so don't want to specialize it for the case of inactive WALs.

@nkg-
Copy link
Contributor

nkg- commented Mar 27, 2024

Currently track_and_verify_wals_in_manifest only takes effect for the inactive WALs so I agree it is confusing that we don't simply record a final size after a final sync.

Just confirming, even for inactive (what we called non-latest in OP) WALs, it (size matching) is not guaranteed. Would it 'eventually' match after your fix in https://github.com/facebook/rocksdb/pull/10560/files#diff-d9341fbe2a5d4089b93b22c5ed7f666bc311b378c26d0786f4b50c290e460187R1541. It seems so, since we don't stop tracking WALs, until presync size and flushed size are equal. I understand they might NOT match for a short duration, bet finishsync (line, 1547) and addWAL (line 1539) in next call to markWALSynced, which might be a few seconds.

I hope that we bring this feature back for active WAL again some day, though, so don't want to specialize it for the case of inactive WALs.

What do you mean by special casing for inactive WALs. Writing both preSyncSize and postSyncSize seems reasonable for all WALs.

@ajkr
Copy link
Contributor

ajkr commented Mar 28, 2024

Yes I agree the MANIFEST synced size should eventually match the actual synced size (with that fix).

What do you mean by special casing for inactive WALs

Currently this feature records pre-sync sizes upon sync completing for inactive WALs only. Say a WAL is synced N times in its lifetime. It is active the first (N-M) times a sync completes, and inactive for the final M sync completions (I think M must be either 1 or 2...). This feature records the final M syncs to the MANIFEST, which feels odd. My intuition is that the feature should either (a) record all N syncs to the MANIFEST, or (b) record only the final 1 sync to the MANIFEST. (b) is what I meant would special case the feature for inactive WALs, but I hope we do (a) instead.

@mittalrishabh
Copy link

@ajkr We are trying to reproduce this issue in our setup. For faster reproduction, instead of doing the full recovery, we are only doing WAL dump and depending on " "Corruption: checksum mismatch" message to see if it is reproducible.

Based on the missing fsync bug #10560 , it seems that it could lead to missing records too. What would be the error message if rocksdb detect missing records.

@ajkr
Copy link
Contributor

ajkr commented Mar 29, 2024

There isn't a specific error message for missing records because we don't have a way to detect that in general. (Extra detail about how we're thinking about doing it) We could start requiring consecutive WALs to have consecutive sequence numbers. To prevent disableWAL=true writes from getting in the way, we could begin every WAL with a dummy entry with sequence number consecutive to that of the final record in the previous WAL.

Lacking that, do you have a way to check for holes in the recovered data? Verifying that in our test tooling is actually how I found that bug in the first place - https://rocksdb.org/blog/2022/10/05/lost-buffered-write-recovery.html. I think applications storing consensus logs in RocksDB may be able to detect a hole using their log sequence numbers.

@mittalrishabh
Copy link

Got it. In TiDB, writes from multiple raft logs are asynchronously written to RocksDb. I don't think we can use it to detect holes.
Is there a github issue open to detect the missing records. I think there could be other solutions to detect it.

@ajkr
Copy link
Contributor

ajkr commented Mar 29, 2024

There isn't a specific error message for missing records because we don't have a way to detect that in general

There are some ways that can catch recovery not recovering far enough. By "far enough", I mean recovery should at least reach the last write that completed before the last successful WAL sync started. At the RocksDB level, the main detection would be the "SST file is ahead of WALs" error message. At the application level, this scenario could probably be detected more easily than a hole. I recall in Cockroach it caused some sort of Raft log panic.

@nkg-
Copy link
Contributor

nkg- commented Apr 5, 2024

Going back to the original issue. I'm still wondering, if the corruption we see, is indeed fixed by #10560. Its a bit hard for us to confirm, since we can only repro in production, which happens rarely (2% of disk snapshots, across all storage nodes had a single corrupt WAL).

I have a few questions:

  1. For Ensure writes to WAL tail during FlushWAL(true /* sync */) will be synced #10560, what is trigger scenario for the bug. Does it need an explicit flushWAL (public API) to be called. For our use-case, we just do put. By default sync=false. For some critical writes, sync=true. Would the race condition between. WAL append and switch (I'm assuming this is triggered by memtable flush) can both happen between FlushWAL(true /* sync */)'s sync operations and its call to MarkLogsSynced().
  2. If there is a missing fsync, then wouldn't it cause a completely missing record. But we always see partial record. Its not intuitive to me, what is causing partial record in a WAL, compared to missing record in WAL.
  3. Under what conditions does rocksdb do a WAL swap. Possible reasons, i) memtable is full, ii) memtable flush call. Anything else.
  4. In our test environment, we just reproed another Corruption (let me know if we should file a separate issue).
    Corruption detected in log file Corruption: missing start of fragmented record(2). Can you share how its different than Corruption: checksum mismatch, and if it has different trigger conditions. The repro condition was this corruption was:
    a) flushWAL (every 100ms),
    b) reducing memtable size (500 KB) triggering memtable flush/wal swap every 100ms, for our write traffic.
    FWIW, the Ensure writes to WAL tail during FlushWAL(true /* sync */) will be synced #10560 fixed this corruption, under the same repro conditions.

@nkg-
Copy link
Contributor

nkg- commented Apr 5, 2024

Couple of other questions:

  1. Lets say, we have the fix #10560, now in the case of manual fsync, is it possible to have a write in non-active WAL which is not fsynced, until flushWAL(sync=true) or put(sync=true) is called. Or does wal swap (during memtable flush) guarantee, the swaped WAL is fsynced with all writes.
  2. A more basic question, WHen memtable is full, and flushed, why do we swap the WAL. Compared to size based WAL rotation.

@ajkr
Copy link
Contributor

ajkr commented Apr 11, 2024

Going back to the original issue. I'm still wondering, if the corruption we see, is indeed fixed by #10560. Its a bit hard for us to confirm, since we can only repro in production, which happens rarely (2% of disk snapshots, across all storage nodes had a single corrupt WAL).

I have a few questions:

  1. For Ensure writes to WAL tail during FlushWAL(true /* sync */) will be synced #10560, what is trigger scenario for the bug. Does it need an explicit flushWAL (public API) to be called. For our use-case, we just do put. By default sync=false. For some critical writes, sync=true. Would the race condition between. WAL append and switch (I'm assuming this is triggered by memtable flush) can both happen between FlushWAL(true /* sync */)'s sync operations and its call to MarkLogsSynced().

OK, I revisited #10560. I think it cannot be triggered by a workload that simply writes with varying values for WriteOptions::sync. That is because we atomically perform the following steps: (1) appending to the active WAL, (2) syncing the WALs (including inactive ones with unsynced data at the tail), and (3) recording the synced WALs. ((2) and (3) are skipped when WriteOptions::sync == false.) The way those steps are atomic is by doing them all in our write group leader.

#10560's race condition required writes to happen between (2) and (3) causing buggy logic in (3) to incorrectly treat an inactive WAL as fully synced when it actually wasn't. FlushWAL(true /* sync */) does (2) and (3) in a non write group leader thread, which made it vulnerable.

That doesn't mean the workload needs an explicit FlushWAL(true /* sync */) to trigger the bug, however. There are also internal callers of FlushWAL(true /* sync */), and potentially other things internally that do (2) and (3) in a non write group leader thread.

Based on that criteria, other vulnerable APIs I found were SyncWAL(), GetLiveFilesStorageInfo(), CreateNewBackup*(), CreateCheckpoint().

By the way, are you able to share more about what options you use? From the OPTIONS file I'd be interested in manual_wal_flush, enable_pipelined_write, wal_bytes_per_sync, two_write_queues, track_and_verify_wals_in_manifest. If there's other WriteOptions set besides sync that could be interesting too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants