Corrupted indexes #149

staltz · 2021-04-11T10:01:25Z

This is a placeholder for more details that I'll post later, because I'm on mobile.

We found a bug in Manyverse where opening a thread for msg root X would actually open msg Y, completely unrelated msg. Also we had other spooky issues, like some threads not appearing on the public feed despite them being in the local log, or even when I publish a msg myself, sometimes it didn't show on the public feed.

I confirmed that this was caused by jitdb or ssb-db2 by deleting the indexes folder and letting it reindex. After that, the bugs disappeared.

Gladly, I kept a copy of the corrupted indexes folder. I also kept a copy of the newly built good indexes folder, and both of these should be matching the same log, because I put airplane mode on and had no new replication. This way, we can do a raw diff on the corrupted-vs-good indexes folders and discover what's going on.

If I'd have to guess, I would say there was a bug in atomically and the core indexes like offset and timestamp began accumulating errors as the file got corrupted.

arj03 · 2021-04-11T11:43:42Z

Good you saved the indexes. Would be very interested in taking a look.

arj03 · 2021-04-11T16:07:34Z

Vote & author indexes are the same. Seq & timestamp are also ok after loading and checking. Post does have 2 differences at index 421 and 422. The rest is the same. I'll do some more debugging and post my tool to look at the files.

arj03 · 2021-04-11T19:13:30Z

Some more debugging info mostly for myself:

value_content_root_.index (isRoot)

bitset length diff
missing in corrupt [
  1055962, 1055963, 1055964, 1055965, 1055966,
  1055968, 1055969, 1055972, 1055974, 1055975,
  1055976, 1055977, 1055978, 1055979, 1055980,
  1055981, 1055983, 1055985, 1055987, 1055988,
  1055991, 1055992, 1055993, 1055994, 1055995,
  1055997, 1055998, 1055999, 1056000, 1056001,
  1056002, 1056003, 1056004, 1056005, 1056006,
  1056007, 1056008, 1056010, 1056012, 1056013,
  1056015, 1056016, 1056017, 1056018, 1056020,
  1056022, 1056023, 1056024, 1056026, 1056027,
  1056028, 1056029, 1056030, 1056031, 1056032,
  1056033, 1056034, 1056036, 1056041, 1056043
]
missing in ok [ 13505, 13506, 13507, 13508 ]

value_content_root__map.32prefixmap (hasRoot)

missing keys in corrupt []
missing keys in ok []
missing values in corrupt [ 13505, 13506, 13507, 13508 ]
missing values in ok []

value_content_type_post.index

missing in corrupt [ 13503, 13505, 13506, 13507, 13508 ]

value_content_contact__map.32prefixmap

missing keys in corrupt []
missing keys in ok []
missing values in corrupt [
   8084,  4180,  4181,  3924,
   7985, 11822, 16399,  4200,
   8044,  3926, 11818, 16398,
  11819,  7996,  8119,  8038,
   8052,  4185,  8091
]
missing values in ok []

Fork, vote link (both prefix maps) ok. Pub (bitset) also ok.

Barbarrosa · 2021-04-12T00:23:18Z

I wonder if this could happen due to the log losing data rather than the indexes. async-append-only-log relies on random-access-file, which writes the file using fs.write. Maybe the indexes could be made more reliable by using fs.fdatasync before writing so the log gets flushed. I'm not sure what may be involved for supporting that in a browser context, looks like there may be several storage types to check for RW consistency. An easier solution might be to discard or ignore old indexes based on the starting size of the log.

The kind of bug I'm thinking of would follow these steps:

Start writing log
Start writing indexes
Logs and indexes get saved to virtual filesystem, not sync'd to disk
OS syncs indexes to disk
App or OS encounters problems before logs get saved to disk
App boots up and starts writing logs again, making indexes point to "valid" locations in file that contain the wrong information

staltz · 2021-04-12T15:18:50Z

@Barbarrosa Thanks for the suggestion, I wouldn't have thought what you thought, I didn't even know about fs.fdatasync.

I think the difficulty in fixing this issue is reproducibility. We have hard evidence for it, but we don't have a way of knowing for sure whether a candidate fix was effective. In my opinion we should look for that first, maybe by simulating abrupt closing of the ssb instance and associated filesystem operations.

staltz · 2021-04-12T16:23:12Z

More discoveries as I'm debugging the evidence too (I can use log.bipf to confirm what's the true correct value):

This shows towards the end of the log for good-vs-corrupted value_content_root_.index (isRoot):

Click to open

Offset, bit from good, bit from corrupted

791816032    1 1 
791816512    1 1 
791816992    1 1 
791817472    1 1 
791817952    1 1 
791818432    1 1 
791818912    1 1 
791819392    1 1 
791819872    1 1 
791820352    1 1 
791820832    1 1 
791821312    1 1 
791821792    1 1 
791822272    1 1 
791822752    1 1 
791823232    1 1 
791823712    1 1 
791824192    1 1 
791824672    1 1 
791825152    1 1 
791825632    1 1 
791826112    1 1 
791826592    1 1 
791827072    1 1 
791827552    1 1 
791828036    1 1 
791828511    1 1 
791828995    1 1 
791829488    1 1 
791829981    1 1 
791830474    0 0 
791831333    1 1 
791831826    1 1 
791832319    1 0 MISMATCH
791832812    1 0 MISMATCH
791833296    1 0 MISMATCH
791833780    1 0 MISMATCH
791834273    1 0 MISMATCH
791834759    0 0 
791836609    1 0 MISMATCH
791839108    1 0 MISMATCH
791839583    0 0 
791840473    0 0 
791841298    1 0 MISMATCH
791841782    0 0 
791842845    1 0 MISMATCH
791843505    1 0 MISMATCH
791844007    1 0 MISMATCH
791844500    1 0 MISMATCH
791844986    1 0 MISMATCH
791845455    1 0 MISMATCH
791846418    1 0 MISMATCH
791846902    1 0 MISMATCH
791847395    0 0 
791849515    1 0 MISMATCH
791849999    0 0 
791850540    1 0 MISMATCH
791851033    0 0 
791852232    1 0 MISMATCH
791852707    1 0 MISMATCH
791853200    0 0 
791854177    0 0 
791856537    1 0 MISMATCH
791856957    1 0 MISMATCH
791857450    1 0 MISMATCH
791857936    1 0 MISMATCH
791858405    1 0 MISMATCH
791858898    0 0 
791859733    1 0 MISMATCH
791860226    1 0 MISMATCH
791860710    1 0 MISMATCH
791861179    1 0 MISMATCH
791861683    1 0 MISMATCH
791862152    1 0 MISMATCH
791862572    1 0 MISMATCH
791863065    1 0 MISMATCH
791863549    1 0 MISMATCH
791864042    1 0 MISMATCH
791864526    1 0 MISMATCH
791865010    1 0 MISMATCH
791866075    0 0 
791867257    1 0 MISMATCH
791868047    0 0 
791869126    1 0 MISMATCH
791869610    1 0 MISMATCH
791870094    0 0 
791870677    1 0 MISMATCH
791871488    1 0 MISMATCH
791871981    1 0 MISMATCH
791872467    1 0 MISMATCH
791872951    0 0 
791873619    1 0 MISMATCH
791874112    0 0 
791874964    1 0 MISMATCH
791875448    1 0 MISMATCH
791875932    1 0 MISMATCH
791876416    0 0 
791877351    1 0 MISMATCH
791877837    1 0 MISMATCH
791878330    1 0 MISMATCH
791878903    1 0 MISMATCH
791879389    1 0 MISMATCH
791879882    1 0 MISMATCH
791880531    1 0 MISMATCH
791881015    1 0 MISMATCH
791882077    1 0 MISMATCH
791882608    0 0 
791883337    1 0 MISMATCH
791883868    0 0 
791885206    0 0 
791885969    0 0 
791886942    0 0 
791887784    1 0 MISMATCH
791888268    0 0 
791889271    1 0 MISMATCH
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0

However, it seems that the corrupted index was simply shorter, because:

good offset: 791889271
bad  offset: 791831826

Notice that 791831826 is the last offset before the zeroed out section begins.

The other thing I noticed was near the middle of the log, starting at seq 13505:

Offset, bit from good, bit from corrupted

9954055      0 0 
9956153      1 1 
9959193      1 1 
9961472      0 1 MISMATCH
9964632      0 1 MISMATCH
9966698      0 1 MISMATCH
9969830      0 1 MISMATCH
9971695      1 1 
9974117      1 1 
9974601      0 0

In log.bipf, I noticed that those 4 messages are all encrypted.

arj03 · 2021-04-12T18:41:27Z

Oh that is quite interesting. Let me check that.

arj03 · 2021-04-12T19:40:08Z

There is a diff in canDecrypt:

missing in corrupt [
  2511068,  2512345,  2773002,
  2773795,  2776173,  2784821,
  5530979,  5537442,  5561325,
  5564875,  5569051,  5589248,
  5593390,  5607279,  8450162,
  8450955,  8452716,  9956153,
  9961472,  9964632,  9966698,
  9969830, 12084820, 12085613
]
missing in ok []

And even more diffs in encrypted:

  2101293, 2112364, 2121201, 2122619, 2141595, 2142564, 2144665,
  2186948, 2189466, 2193294, 2196100, 2198524, 2200334, 2201539,
  2202940, 2207331, 2208472, 2211643, 2213197, 2214338, 2221175,
  2223016, 2223985, 2225206, 2226399, 2256770, 2286345, 2287294,
  2290086, 2291867, 2295980, 2304452, 2305401, 2341310, 2342763,
  2403704, 2405376, 2406149, 2406926, 2511068, 2512345, 2572596,
  2611872, 2612953, 2615132, 2621440, 2624816, 2633542, 2634915,
  2663236, 2723449, 2739636, 2743980, 2745369, 2748512, 2749901,
  2754176, 2773002, 2773795, 2776173, 2784821, 2808505, 2810042,
  2811566, 2813269, 2814546, 2826278, 2828557, 2829934, 2831327,
  2832504, 2833793, 2835190, 2840071, 2842678, 2844503, 2845944,
  2848323, 2849844, 2851856, 2853229, 2854554, 2855795, 2859050,
  2870953, 2872206, 2877092, 2881978, 2884054, 2894733, 2896354,
  2897799, 2908509, 2910318, 2913276, 2915013, 2923078, 2925315,
  2929769, 2930894,
  ... 1421 more items
]
missing in ok []

Was trying a few thing, no dice sadly.

arj03 · 2021-04-12T19:41:38Z

@Barbarrosa async log should have been made so that nothing is streamed to indexes that has not been written to disc. I'd be really interested in a test case that shows that is not the case :)

arj03 · 2021-04-12T20:27:40Z

I'm reminded of this old bug report in atomic-file. I wonder if we should start adding content hashing to detect non-atomic moves. It just seems like someone must have a good solution for this problem on android.

staltz · 2021-04-12T20:33:26Z

I never thought of the hash idea. Sounds interesting!

Barbarrosa · 2021-04-13T04:51:34Z

@staltz @arj03 The loss via differently ordered writes seems a little difficult to reliably reproduce, but I think this NodeJS issue roughly gets at what I'm thinking.

Side notes:
While I was looking into this, I noticed an issue about log write memory usage that you may find interesting. I also encountered an interesting paper from the 2020 USENIX conference discussing how database systems often fail to fully account for filesystem reliability issues.

arj03 · 2021-04-13T05:04:07Z

That paper is very interesting, thanks for linking @Barbarrosa. Especially the https://github.com/WiscADSL/cuttlefs tool that can be used to test a solution for fs corruption.

staltz · 2021-04-13T08:19:13Z

I wonder if and how flumedb accounted for FS failures? Because I haven't bumped into these kinds of corruptions when using flumedb for years in ssb.

arj03 · 2021-04-13T08:40:40Z

flumeview-reduce uses atomic-file
flumeview-level naturally uses level
flumeview-hashtable is only used by ssb-ooo

So it seems quite similar to what we do with in db2 (level + atomically).

arj03 · 2021-04-13T08:42:39Z

The lingering tmp is a bad sign. I can see that atomically should clean up the files on exit, but it probably doesn't work if the process is killed. It needs a garbage collection step, but the files hints that at least 4-5 times the process was shut down during a write.

staltz · 2021-04-13T08:52:34Z

Yeah, I'm curious if we could bring back atomic-file and improve it with the hash idea and other things. It's also a library that we control and understand, even if it's a bit abandoned currently.

arj03 · 2021-04-13T09:00:47Z

Would be fine with that. I would like to keep the browser part because if I remember correctly atomic-file does some things in the browser that is slower than https://github.com/ssb-ngi-pointer/atomically-universal/blob/master/index.js.

About hashing, here is a thread on speed.

arj03 · 2021-04-13T19:54:43Z

Was reading up on things in this thread. Android uses ext4 as the filesystem so rename should be atomic, BUTT without the fsync() the metadata may be written before the data, which if there was a crash in between, would cause the new renamed file to have partial/empty data. From this nice thread here on is-rename-without-fsync-safe. I think the combination of adding hashes & adding fsync after write to atomic-file should get us a long way here.

staltz · 2021-04-13T20:31:06Z

Wow, good find!

staltz · 2021-04-14T15:09:52Z

@arj03 I've been trying to reproduce on Termux using https://github.com/staltz/jitdb-files-fs but no luck so far

staltz · 2021-04-19T11:34:52Z

I'm very thankful you worked on this one @arj03. I'll update and let it run for some weeks to see whether we haven't bumped into more corruptions. If I find something, can open another issue, but let's hope there isn't anything.

arj03 mentioned this issue Apr 18, 2021

Crc #153

Merged

staltz closed this as completed Apr 19, 2021

staltz mentioned this issue May 10, 2021

Indexes still seem corrupted #159

Closed

staltz mentioned this issue Dec 14, 2021

Broken seq.index, keys index, and duplicate records on log ssbc/ssb-db2#291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted indexes #149

Corrupted indexes #149

staltz commented Apr 11, 2021

arj03 commented Apr 11, 2021

arj03 commented Apr 11, 2021

arj03 commented Apr 11, 2021

Barbarrosa commented Apr 12, 2021 •

edited

staltz commented Apr 12, 2021

staltz commented Apr 12, 2021 •

edited

arj03 commented Apr 12, 2021

arj03 commented Apr 12, 2021

arj03 commented Apr 12, 2021

arj03 commented Apr 12, 2021

staltz commented Apr 12, 2021

Barbarrosa commented Apr 13, 2021 •

edited

arj03 commented Apr 13, 2021

staltz commented Apr 13, 2021

arj03 commented Apr 13, 2021 •

edited

arj03 commented Apr 13, 2021

staltz commented Apr 13, 2021

arj03 commented Apr 13, 2021

arj03 commented Apr 13, 2021

staltz commented Apr 13, 2021

staltz commented Apr 14, 2021

staltz commented Apr 19, 2021

Corrupted indexes #149

Corrupted indexes #149

Comments

staltz commented Apr 11, 2021

arj03 commented Apr 11, 2021

arj03 commented Apr 11, 2021

arj03 commented Apr 11, 2021

value_content_root_.index (isRoot)

value_content_root__map.32prefixmap (hasRoot)

value_content_type_post.index

value_content_contact__map.32prefixmap

Barbarrosa commented Apr 12, 2021 • edited

staltz commented Apr 12, 2021

staltz commented Apr 12, 2021 • edited

arj03 commented Apr 12, 2021

arj03 commented Apr 12, 2021

arj03 commented Apr 12, 2021

arj03 commented Apr 12, 2021

staltz commented Apr 12, 2021

Barbarrosa commented Apr 13, 2021 • edited

arj03 commented Apr 13, 2021

staltz commented Apr 13, 2021

arj03 commented Apr 13, 2021 • edited

arj03 commented Apr 13, 2021

staltz commented Apr 13, 2021

arj03 commented Apr 13, 2021

arj03 commented Apr 13, 2021

staltz commented Apr 13, 2021

staltz commented Apr 14, 2021

staltz commented Apr 19, 2021

Barbarrosa commented Apr 12, 2021 •

edited

staltz commented Apr 12, 2021 •

edited

Barbarrosa commented Apr 13, 2021 •

edited

arj03 commented Apr 13, 2021 •

edited