Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDB_NOTFOUND: No matching key/data pair found after full disk #587

Open
datryn-ribdun opened this issue Jan 21, 2024 · 12 comments
Open

MDB_NOTFOUND: No matching key/data pair found after full disk #587

datryn-ribdun opened this issue Jan 21, 2024 · 12 comments

Comments

@datryn-ribdun
Copy link

datryn-ribdun commented Jan 21, 2024

Just found out my VPS that hosts 3 urbit ships had ran out of disk space and so those ships had crashed. I ran chop on 2 of them and they started working fine. My main (~datryn-ribdun) was giving me some "loom corrupt" and mentioned "north" (the exact error is lost after some VPS reboots), but I remembered that deleting /.urb/chk lets you trigger a replay of all events that resolves snapshot corruption issues.
Now when I run datryn-ribdun/.run --loom 32 (my usual command) I get the typical start lines

urbit 2.12
boot: home is /home/urbit/urbit-ships/datryn-ribdun
loom: mapped 2048MB
lite: arvo formula 2a2274c9
lite: core 4bb376f0
lite: final state 4bb376f0
loom: mapped 4096MB
boot: protected loom
live: logical boot
boot: installed 661 jets
------------------playback starting ----------------------
pier: replaying events 1-2907645618
lmdb: read: initial cursor_get failed at 1: MDB_NOTFOUND: No matching key/data pair found
pier: disk read bail

I also tried <pier>/.run vile and got vile: unable to extract key file
i was pretty confident that deleting <pier>/.urb/chk was safe, but now I'm worried I somehow deleted some key file. Looking at <pier>/.urb/log is still 37G, so i believe I still have my event history. Any ideas how to proceed? I'd hate to have to breach my main

Vere 2.12

@mrdomino
Copy link
Contributor

mrdomino commented Jan 21, 2024

Issue is also present for roll on the develop branch. AFAICT it happens any time the checkpoint is deleted on a pier that has been rolled or (apparently) chopped. Nothing to do with the full disk.

@mrdomino
Copy link
Contributor

May not be that cut-and-dry. I should say instead: I have experienced the MDB_NOTFOUND error as well on piers that have been rolled on 3.0 prerelease.

My testing so far (IIRC - this was yesterday or so) has revealed, all on 3.0 prerelease:

  • Delete chk, no roll: pier replays events successfully
  • Delete chk, roll: MDB_NOTFOUND
  • Don't delete chk, roll: no errors

@datryn-ribdun
Copy link
Author

^^ Seems like a related issue, but I never ran roll and never even had a successful chop because it was complaining about loom being corrupted.

@mrdomino
Copy link
Contributor

The roll issue seems easily resolvable; just a matter of the correct checkpoint not being copied in. Manually copying in the north.bin and south.bin from the checkpoint fixes it.

Are there any contents under .urb/chk on your pier? In the error state, I had a north.bin and south.bin that were both size 0.

@datryn-ribdun
Copy link
Author

datryn-ribdun commented Jan 27, 2024

Yup I see,

-rw-rw-r-- 1 urbit urbit    0 Jan 15 22:38 north.bin
-rw-rw-r-- 1 urbit urbit    0 Jan 15 22:38 south.bin

urbit is my user on this vm.

@datryn-ribdun
Copy link
Author

I just tried rm -r .urb/chk followed by ./.run play and get the following

loom: mapped 2048MB
boot: protected loom
live: logical boot
boot: installed 661 jets
lmdb: read: initial cursor_get failed at 1: MDB_NOTFOUND: No matching key/data pair found
boot: read failed
mars: boot fail

@mrdomino
Copy link
Contributor

You don't have any other good checkpoints, e.g. under bhk?

@datryn-ribdun
Copy link
Author

datryn-ribdun commented Jan 28, 2024

I had no idea bhk was backup that could be swapped in for chk.
Tried a cp bhk/* chk/ and started the ship. It' been replaying for a few hours, so hopefully this will work.

Assuming this fixes things, there's probably 2 things that could be improved with vere:

  1. If there is no .urb/chk/ directory, why does vere make one and the create a 0byte north.bin and south.bin, then complain that "No matching key/data pair found"? Seems like before this point there should be a failure for No bin files found, did you delete chk/? Try moving the .bin files from .urb/bhk into .urb/chk. IDK on wording, but someway to not scare the user into thinking their ship is perma-broken.
  2. Not filling disk to 0b remaining. Once disk is full its a pain to have to find something to delete, then chop, then boot ship to make sure things work, then delete backup chop. I might be overfitting and thinking this is a more general problem than it actually is, but anyone who runs on a cheap VPS probably runs on <100GB of disk and a well used ship can easily pass that if you're not regularly chopping.

@datryn-ribdun
Copy link
Author

datryn-ribdun commented Jan 28, 2024

After many hours of

pier: ($event_number): play: done
pier: ($event_number+1): play: done

my terminal was spammed ith

recover: top: meme

recover: top: meme

recover: top: meme
.....
....
loom: external fault: 0x50

@datryn-ribdun
Copy link
Author

Trying again with ./.run play --loom 32 and killing all other RAM heavy processes running on this VPS.

@datryn-ribdun
Copy link
Author

Tried with above command and even ./.run --loom 33 thinking that maybe adding some loom headroom would help, but every time I hit the same issue of

recover: top: meme
loom: external fault: 0x50 (0x20000000 : 0x280000000)

Assertion '0' failed in pkg/noun/manage.c:1791
home:bailing out
Aborted

@Tenari
Copy link

Tenari commented May 13, 2024

seems related urbit/urbit#6989

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants