Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted indexes #149

Closed
staltz opened this issue Apr 11, 2021 · 22 comments
Closed

Corrupted indexes #149

staltz opened this issue Apr 11, 2021 · 22 comments

Comments

@staltz
Copy link
Member

staltz commented Apr 11, 2021

This is a placeholder for more details that I'll post later, because I'm on mobile.

We found a bug in Manyverse where opening a thread for msg root X would actually open msg Y, completely unrelated msg. Also we had other spooky issues, like some threads not appearing on the public feed despite them being in the local log, or even when I publish a msg myself, sometimes it didn't show on the public feed.

I confirmed that this was caused by jitdb or ssb-db2 by deleting the indexes folder and letting it reindex. After that, the bugs disappeared.

Gladly, I kept a copy of the corrupted indexes folder. I also kept a copy of the newly built good indexes folder, and both of these should be matching the same log, because I put airplane mode on and had no new replication. This way, we can do a raw diff on the corrupted-vs-good indexes folders and discover what's going on.

If I'd have to guess, I would say there was a bug in atomically and the core indexes like offset and timestamp began accumulating errors as the file got corrupted.

@arj03
Copy link
Member

arj03 commented Apr 11, 2021

Good you saved the indexes. Would be very interested in taking a look.

@arj03
Copy link
Member

arj03 commented Apr 11, 2021

Vote & author indexes are the same. Seq & timestamp are also ok after loading and checking. Post does have 2 differences at index 421 and 422. The rest is the same. I'll do some more debugging and post my tool to look at the files.

@arj03
Copy link
Member

arj03 commented Apr 11, 2021

Some more debugging info mostly for myself:

value_content_root_.index (isRoot)

bitset length diff
missing in corrupt [
  1055962, 1055963, 1055964, 1055965, 1055966,
  1055968, 1055969, 1055972, 1055974, 1055975,
  1055976, 1055977, 1055978, 1055979, 1055980,
  1055981, 1055983, 1055985, 1055987, 1055988,
  1055991, 1055992, 1055993, 1055994, 1055995,
  1055997, 1055998, 1055999, 1056000, 1056001,
  1056002, 1056003, 1056004, 1056005, 1056006,
  1056007, 1056008, 1056010, 1056012, 1056013,
  1056015, 1056016, 1056017, 1056018, 1056020,
  1056022, 1056023, 1056024, 1056026, 1056027,
  1056028, 1056029, 1056030, 1056031, 1056032,
  1056033, 1056034, 1056036, 1056041, 1056043
]
missing in ok [ 13505, 13506, 13507, 13508 ]

value_content_root__map.32prefixmap (hasRoot)

missing keys in corrupt []
missing keys in ok []
missing values in corrupt [ 13505, 13506, 13507, 13508 ]
missing values in ok []

value_content_type_post.index

missing in corrupt [ 13503, 13505, 13506, 13507, 13508 ]

value_content_contact__map.32prefixmap

missing keys in corrupt []
missing keys in ok []
missing values in corrupt [
   8084,  4180,  4181,  3924,
   7985, 11822, 16399,  4200,
   8044,  3926, 11818, 16398,
  11819,  7996,  8119,  8038,
   8052,  4185,  8091
]
missing values in ok []

Fork, vote link (both prefix maps) ok. Pub (bitset) also ok.

@Barbarrosa
Copy link
Contributor

Barbarrosa commented Apr 12, 2021

I wonder if this could happen due to the log losing data rather than the indexes. async-append-only-log relies on random-access-file, which writes the file using fs.write. Maybe the indexes could be made more reliable by using fs.fdatasync before writing so the log gets flushed. I'm not sure what may be involved for supporting that in a browser context, looks like there may be several storage types to check for RW consistency. An easier solution might be to discard or ignore old indexes based on the starting size of the log.

The kind of bug I'm thinking of would follow these steps:

  1. Start writing log
  2. Start writing indexes
  3. Logs and indexes get saved to virtual filesystem, not sync'd to disk
  4. OS syncs indexes to disk
  5. App or OS encounters problems before logs get saved to disk
  6. App boots up and starts writing logs again, making indexes point to "valid" locations in file that contain the wrong information

@staltz
Copy link
Member Author

staltz commented Apr 12, 2021

@Barbarrosa Thanks for the suggestion, I wouldn't have thought what you thought, I didn't even know about fs.fdatasync.

I think the difficulty in fixing this issue is reproducibility. We have hard evidence for it, but we don't have a way of knowing for sure whether a candidate fix was effective. In my opinion we should look for that first, maybe by simulating abrupt closing of the ssb instance and associated filesystem operations.

@staltz
Copy link
Member Author

staltz commented Apr 12, 2021

More discoveries as I'm debugging the evidence too (I can use log.bipf to confirm what's the true correct value):

This shows towards the end of the log for good-vs-corrupted value_content_root_.index (isRoot):

Click to open

Offset, bit from good, bit from corrupted

791816032    1 1 
791816512    1 1 
791816992    1 1 
791817472    1 1 
791817952    1 1 
791818432    1 1 
791818912    1 1 
791819392    1 1 
791819872    1 1 
791820352    1 1 
791820832    1 1 
791821312    1 1 
791821792    1 1 
791822272    1 1 
791822752    1 1 
791823232    1 1 
791823712    1 1 
791824192    1 1 
791824672    1 1 
791825152    1 1 
791825632    1 1 
791826112    1 1 
791826592    1 1 
791827072    1 1 
791827552    1 1 
791828036    1 1 
791828511    1 1 
791828995    1 1 
791829488    1 1 
791829981    1 1 
791830474    0 0 
791831333    1 1 
791831826    1 1 
791832319    1 0 MISMATCH
791832812    1 0 MISMATCH
791833296    1 0 MISMATCH
791833780    1 0 MISMATCH
791834273    1 0 MISMATCH
791834759    0 0 
791836609    1 0 MISMATCH
791839108    1 0 MISMATCH
791839583    0 0 
791840473    0 0 
791841298    1 0 MISMATCH
791841782    0 0 
791842845    1 0 MISMATCH
791843505    1 0 MISMATCH
791844007    1 0 MISMATCH
791844500    1 0 MISMATCH
791844986    1 0 MISMATCH
791845455    1 0 MISMATCH
791846418    1 0 MISMATCH
791846902    1 0 MISMATCH
791847395    0 0 
791849515    1 0 MISMATCH
791849999    0 0 
791850540    1 0 MISMATCH
791851033    0 0 
791852232    1 0 MISMATCH
791852707    1 0 MISMATCH
791853200    0 0 
791854177    0 0 
791856537    1 0 MISMATCH
791856957    1 0 MISMATCH
791857450    1 0 MISMATCH
791857936    1 0 MISMATCH
791858405    1 0 MISMATCH
791858898    0 0 
791859733    1 0 MISMATCH
791860226    1 0 MISMATCH
791860710    1 0 MISMATCH
791861179    1 0 MISMATCH
791861683    1 0 MISMATCH
791862152    1 0 MISMATCH
791862572    1 0 MISMATCH
791863065    1 0 MISMATCH
791863549    1 0 MISMATCH
791864042    1 0 MISMATCH
791864526    1 0 MISMATCH
791865010    1 0 MISMATCH
791866075    0 0 
791867257    1 0 MISMATCH
791868047    0 0 
791869126    1 0 MISMATCH
791869610    1 0 MISMATCH
791870094    0 0 
791870677    1 0 MISMATCH
791871488    1 0 MISMATCH
791871981    1 0 MISMATCH
791872467    1 0 MISMATCH
791872951    0 0 
791873619    1 0 MISMATCH
791874112    0 0 
791874964    1 0 MISMATCH
791875448    1 0 MISMATCH
791875932    1 0 MISMATCH
791876416    0 0 
791877351    1 0 MISMATCH
791877837    1 0 MISMATCH
791878330    1 0 MISMATCH
791878903    1 0 MISMATCH
791879389    1 0 MISMATCH
791879882    1 0 MISMATCH
791880531    1 0 MISMATCH
791881015    1 0 MISMATCH
791882077    1 0 MISMATCH
791882608    0 0 
791883337    1 0 MISMATCH
791883868    0 0 
791885206    0 0 
791885969    0 0 
791886942    0 0 
791887784    1 0 MISMATCH
791888268    0 0 
791889271    1 0 MISMATCH
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 
0            0 0 

However, it seems that the corrupted index was simply shorter, because:

good offset: 791889271
bad  offset: 791831826

Notice that 791831826 is the last offset before the zeroed out section begins.

The other thing I noticed was near the middle of the log, starting at seq 13505:

Offset, bit from good, bit from corrupted

9954055      0 0 
9956153      1 1 
9959193      1 1 
9961472      0 1 MISMATCH
9964632      0 1 MISMATCH
9966698      0 1 MISMATCH
9969830      0 1 MISMATCH
9971695      1 1 
9974117      1 1 
9974601      0 0 

In log.bipf, I noticed that those 4 messages are all encrypted.

@arj03
Copy link
Member

arj03 commented Apr 12, 2021

Oh that is quite interesting. Let me check that.

@arj03
Copy link
Member

arj03 commented Apr 12, 2021

There is a diff in canDecrypt:

missing in corrupt [
  2511068,  2512345,  2773002,
  2773795,  2776173,  2784821,
  5530979,  5537442,  5561325,
  5564875,  5569051,  5589248,
  5593390,  5607279,  8450162,
  8450955,  8452716,  9956153,
  9961472,  9964632,  9966698,
  9969830, 12084820, 12085613
]
missing in ok []

And even more diffs in encrypted:

  2101293, 2112364, 2121201, 2122619, 2141595, 2142564, 2144665,
  2186948, 2189466, 2193294, 2196100, 2198524, 2200334, 2201539,
  2202940, 2207331, 2208472, 2211643, 2213197, 2214338, 2221175,
  2223016, 2223985, 2225206, 2226399, 2256770, 2286345, 2287294,
  2290086, 2291867, 2295980, 2304452, 2305401, 2341310, 2342763,
  2403704, 2405376, 2406149, 2406926, 2511068, 2512345, 2572596,
  2611872, 2612953, 2615132, 2621440, 2624816, 2633542, 2634915,
  2663236, 2723449, 2739636, 2743980, 2745369, 2748512, 2749901,
  2754176, 2773002, 2773795, 2776173, 2784821, 2808505, 2810042,
  2811566, 2813269, 2814546, 2826278, 2828557, 2829934, 2831327,
  2832504, 2833793, 2835190, 2840071, 2842678, 2844503, 2845944,
  2848323, 2849844, 2851856, 2853229, 2854554, 2855795, 2859050,
  2870953, 2872206, 2877092, 2881978, 2884054, 2894733, 2896354,
  2897799, 2908509, 2910318, 2913276, 2915013, 2923078, 2925315,
  2929769, 2930894,
  ... 1421 more items
]
missing in ok []

Was trying a few thing, no dice sadly.

@arj03
Copy link
Member

arj03 commented Apr 12, 2021

@Barbarrosa async log should have been made so that nothing is streamed to indexes that has not been written to disc. I'd be really interested in a test case that shows that is not the case :)

@arj03
Copy link
Member

arj03 commented Apr 12, 2021

I'm reminded of this old bug report in atomic-file. I wonder if we should start adding content hashing to detect non-atomic moves. It just seems like someone must have a good solution for this problem on android.

@staltz
Copy link
Member Author

staltz commented Apr 12, 2021

I never thought of the hash idea. Sounds interesting!

@Barbarrosa
Copy link
Contributor

Barbarrosa commented Apr 13, 2021

@staltz @arj03 The loss via differently ordered writes seems a little difficult to reliably reproduce, but I think this NodeJS issue roughly gets at what I'm thinking.

Side notes:
While I was looking into this, I noticed an issue about log write memory usage that you may find interesting. I also encountered an interesting paper from the 2020 USENIX conference discussing how database systems often fail to fully account for filesystem reliability issues.

@arj03
Copy link
Member

arj03 commented Apr 13, 2021

That paper is very interesting, thanks for linking @Barbarrosa. Especially the https://github.com/WiscADSL/cuttlefs tool that can be used to test a solution for fs corruption.

@staltz
Copy link
Member Author

staltz commented Apr 13, 2021

I wonder if and how flumedb accounted for FS failures? Because I haven't bumped into these kinds of corruptions when using flumedb for years in ssb.

@arj03
Copy link
Member

arj03 commented Apr 13, 2021

So it seems quite similar to what we do with in db2 (level + atomically).

@arj03
Copy link
Member

arj03 commented Apr 13, 2021

The lingering tmp is a bad sign. I can see that atomically should clean up the files on exit, but it probably doesn't work if the process is killed. It needs a garbage collection step, but the files hints that at least 4-5 times the process was shut down during a write.

@staltz
Copy link
Member Author

staltz commented Apr 13, 2021

Yeah, I'm curious if we could bring back atomic-file and improve it with the hash idea and other things. It's also a library that we control and understand, even if it's a bit abandoned currently.

@arj03
Copy link
Member

arj03 commented Apr 13, 2021

Would be fine with that. I would like to keep the browser part because if I remember correctly atomic-file does some things in the browser that is slower than https://github.com/ssb-ngi-pointer/atomically-universal/blob/master/index.js.

About hashing, here is a thread on speed.

@arj03
Copy link
Member

arj03 commented Apr 13, 2021

Was reading up on things in this thread. Android uses ext4 as the filesystem so rename should be atomic, BUTT without the fsync() the metadata may be written before the data, which if there was a crash in between, would cause the new renamed file to have partial/empty data. From this nice thread here on is-rename-without-fsync-safe. I think the combination of adding hashes & adding fsync after write to atomic-file should get us a long way here.

@staltz
Copy link
Member Author

staltz commented Apr 13, 2021

Wow, good find!

@staltz
Copy link
Member Author

staltz commented Apr 14, 2021

@arj03 I've been trying to reproduce on Termux using https://github.com/staltz/jitdb-files-fs but no luck so far

@arj03 arj03 mentioned this issue Apr 18, 2021
Merged
@staltz
Copy link
Member Author

staltz commented Apr 19, 2021

I'm very thankful you worked on this one @arj03. I'll update and let it run for some weeks to see whether we haven't bumped into more corruptions. If I find something, can open another issue, but let's hope there isn't anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants