Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Forknet] Database trimmer that preserves only flat state. #11310

Closed
wants to merge 2 commits into from

Conversation

robin-near
Copy link
Contributor

Closes #11280

Includes two new commands:

  • neard database aggressive-trimming --obliterate-disk-trie performs the deletion of all State keys except those large values needed by flat storage. This is used to test that memtrie-only nodes can work.
  • neard fork-network finalize now rebuilds the database by copying out FlatState as well as the State keys that are needed for large values, and then swapping in the new database replacing the old.

@marcelo-gonzalez Not sure how to test the fork-network changes.

@robin-near robin-near requested a review from a team as a code owner May 14, 2024 18:15
Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 55.54286% with 389 lines in your changes are missing coverage. Please review.

Project coverage is 70.90%. Comparing base (1193219) to head (da63b16).
Report is 1 commits behind head on master.

Files Patch % Lines
tools/database/src/aggressive_trimming.rs 0.00% 160 Missing and 1 partial ⚠️
tools/fork-network/src/trim_database.rs 0.00% 135 Missing ⚠️
core/store/src/parallel_iter.rs 94.26% 23 Missing and 5 partials ⚠️
tools/fork-network/src/cli.rs 0.00% 22 Missing ⚠️
core/store/src/db/splitdb.rs 0.00% 15 Missing ⚠️
core/store/src/db/mixeddb.rs 0.00% 11 Missing ⚠️
core/store/src/db/colddb.rs 0.00% 9 Missing ⚠️
core/store/src/db/rocksdb.rs 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11310      +/-   ##
==========================================
- Coverage   70.98%   70.90%   -0.09%     
==========================================
  Files         782      785       +3     
  Lines      156205   157066     +861     
  Branches   156205   157066     +861     
==========================================
+ Hits       110889   111370     +481     
- Misses      40496    40875     +379     
- Partials     4820     4821       +1     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.39% <0.00%> (-0.01%) ⬇️
integration-tests 36.98% <0.00%> (-0.22%) ⬇️
linux 68.96% <55.54%> (-0.11%) ⬇️
linux-nightly 70.37% <55.54%> (-0.10%) ⬇️
macos 52.54% <55.54%> (+1.65%) ⬆️
pytests 1.61% <0.00%> (-0.02%) ⬇️
sanity-checks 1.40% <0.00%> (-0.01%) ⬇️
unittests 65.36% <55.54%> (-0.05%) ⬇️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcelo-gonzalez could you review this PR?

for kv in iter {
let (key, value) = kv.unwrap();
(options.callback)(&key, Some(&value));
if Instant::now() > min_time_to_split {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if min_time_to_split is way far in the future, it feels a little sad to me to be putting an Instant::now() and associated syscall in this otherwise tight loop no? im not sure what exactly it is we're trying to avoid/achieve here, but i wonder if theres something better. bc my guess is that calling Instant::now() on every key/val returned by the iterator is actually a pretty non trivial slowdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought too, but given how fast the parallel iteration is, it doesn't seem to me it's a noticeable bottleneck.

if Instant::now() > min_time_to_split {
if shared_state
.num_threads_free
.load(std::sync::atomic::Ordering::Relaxed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it's just Relaxed, if you read it and it's greater than 0, i think theres no guarantee that theres at least 1 thread that is not currently in the loop doing some work right? i bet it's not a big deal in practice esp on x86, and also since there's the shared_state.thread_work.lock() right after setting it, which is prob enough of a barrier. But in any case Relaxed feels a little fragile to me for what's being done here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee about anything in num_threads_free. It doesn't matter anyway, it's just a signal to suggest the current workers that it's probably wise to split. Split or not, it doesn't affect correctness. Thus the use of Relaxed.

break;
}
}
for byte in start[byte_not_ff] + 1..=0xff {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if end is empty and start == [0xff] ? then forgetting about what we want to do about splitting, this start[byte_not_ff] + 1 will panic with overflow

this triggers the panic:

diff --git a/core/store/src/parallel_iter.rs b/core/store/src/parallel_iter.rs
index eb7c6d7b9..ce3998c2c 100644
--- a/core/store/src/parallel_iter.rs
+++ b/core/store/src/parallel_iter.rs
@@ -489,6 +489,9 @@ mod tests {
 
     #[test]
     fn test_iteration_range_divide() {
+        let ranges = IterationRange::from_str("ff..").unwrap().divide();
+        println!("{}", ranges[0].to_string());
+
         let ranges = IterationRange::from_str("..").unwrap().divide();
         assert_eq!(ranges.len(), 257);
         assert_eq!(ranges[0].to_string(), "..00");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha... nice catch. Well I'm removing all this code so, phew. This code is so messy. I wonder if there's a good way to make it cleaner.

let mut update = store.store_update();
let mut data = BTreeMap::new();
for _ in 0..1000000 {
let key_len = rand::random::<usize>() % 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could consider using a hardcoded seed in these rand calls so if it ever fails it will be easy to reproduce

StoreParallelIterator::lookup_keys(
store.clone(),
DBCol::State,
keys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are no inlined keys we will hit the assert in WorkItem::new_lookup(). this can actually happen and i hit this on localnet

// Now that we've read all the non-inlined keys into memory, delete the State column, and
// write back these values.
let mut update = store.store_update();
update.delete_all(DBCol::State);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i wonder if it is worth adding a way to delete a column like this via drop_cf(), like we do for checkpoints, because otherwise the space wont be cleared up until a compaction happens at some point. obviously not a big deal to implement that in this PR, but in the future it would prob be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggressive trimming is just a testing tool for this trimming, it isn't meant to shrink the db size. But you're right, when it comes time to maybe actually delete most of the State column, then yeah, drop_cf would be awesome.

.join(near_config.config.store.path.clone().unwrap_or_else(|| PathBuf::from("data")));
let temp_store_home = store_path.join("temp-trimmed-db");
if std::fs::metadata(&temp_store_home).is_ok() {
panic!("Temp trimmed DB already exists; please delete temp-trimmed-db in the data directory first");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return an anyhow::Error instead of panicking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sure.

let (tx, rx) = std::sync::mpsc::sync_channel(1000000);
let iter_handle = {
let old = old.clone();
std::thread::spawn(move || {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but why do we need a new thread for this? this new thread will just be spawning other ones and then waiting in join() right? again, not a big deal, but could just do the joining later when you feel like it right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well both sides of the channel needs to be operated on in parallel. So either this is in a thread, or the consumer side is in a thread.

StoreParallelIterator::lookup_keys(
old,
DBCol::State,
non_inlined_keys,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again I hit the same assert here because there were no non inlined keys. if you want to reproduce, what i did is i ran python3 tests/sanity/transactions.py with this change:

diff --git a/pytest/tests/sanity/transactions.py b/pytest/tests/sanity/transactions.py
index c628fbee0..a936e1e56 100755
--- a/pytest/tests/sanity/transactions.py
+++ b/pytest/tests/sanity/transactions.py
@@ -23,7 +23,7 @@ nodes = start_cluster(
     num_shards=4,
     config=None,
     genesis_config_changes=[["min_gas_price",
-                             0], ["max_inflation_rate", [0, 1]],
+                             0], ["use_production_config", True], ["max_inflation_rate", [0, 1]],
                             ["epoch_length", 10],
                             ["block_producer_kickout_threshold", 70]],
     client_config_changes={
@@ -33,7 +33,8 @@ nodes = start_cluster(
                     "secs": 2,
                     "nanos": 0
                 }
-            }
+            },
+            "tracked_shards": [0, 1, 2, 3]
         },
         1: {
             "consensus": {
@@ -41,7 +42,8 @@ nodes = start_cluster(
                     "secs": 2,
                     "nanos": 0
                 }
-            }
+            },
+            "tracked_shards": [0, 1, 2, 3]
         },
         2: {
             "consensus": {
@@ -49,7 +51,8 @@ nodes = start_cluster(
                     "secs": 2,
                     "nanos": 0
                 }
-            }
+            },
+            "tracked_shards": [0, 1, 2, 3]
         },
         3: {
             "consensus": {
@@ -57,7 +60,8 @@ nodes = start_cluster(
                     "secs": 2,
                     "nanos": 0
                 }
-            }
+            },
+            "tracked_shards": [0, 1, 2, 3]
         },
         4: {
             "consensus": {
@@ -72,7 +76,7 @@ nodes = start_cluster(
 
 started = time.time()
 
-act_to_val = [4, 4, 4, 4, 4]
+act_to_val = [0, 0, 0, 0, 0]
 
 ctx = utils.TxContext(act_to_val, nodes)
 
@@ -97,8 +101,12 @@ cur_balances = ctx.get_balances()
 assert cur_balances == ctx.expected_balances, "%s != %s" % (
     cur_balances, ctx.expected_balances)
 
+killed_it = False
 # we are done with the sanity test, now let's stress it
-for height, _ in utils.poll_blocks(nodes[4], timeout=TIMEOUT):
+for height, _ in utils.poll_blocks(nodes[0], timeout=TIMEOUT):
+    if height >= 60 and not killed_it:
+        nodes[4].kill()
+        killed_it=True
     if ctx.get_balances() == ctx.expected_balances:
         count = height - sent_height
         logger.info(f'Balances caught up, took {count} blocks, moving on')

and then after that test finished i ran:

$ python3 tests/mocknet/local_test_node.py local-test-setup --num-nodes 2 --neard-binary-path ~/nearcore/target/debug/neard --source-home-dir ~/.near/test0_finished --target-home-dir ~/.near/test0_finished --yes
$ python3 tests/mocknet/mirror.py --local-test new-test --epoch-length 200 --num-validators 2 --num-seats 100 --genesis-protocol-version 67 --yes

(if the new-test command fails with some timeout error, just try again, idk exactly why that happens). but then if you look at the logs in ~/.near/local-mocknet/node0/stderr, youll see that the fork-network command panicked

@marcelo-gonzalez
Copy link
Contributor

so rly the only comments that made me not click approve are the ones that actually cause panics, but the others are just suggestions

@robin-near
Copy link
Contributor Author

Thanks so much for the really diligent review @marcelo-gonzalez !

So after implementing parallel memtrie load, I think this parallel iterator can also be scrapped and replaced with the parallel memtrie load logic. Reason being (1) it just uses rayon so there's much less of these corner cases, performance problems and panics; (2) invoking part of that code is needed anyway because to support parallel memtrie load we're gonna need some additional State entries, and that code figures out exactly what state entries are needed.

I'll do that as soon as I get a chance.

@robin-near
Copy link
Contributor Author

@marcelo-gonzalez Could you try this #11420 this is the new implementation I talked about. I can't make a reviewable PR out of it yet because the parallel loading PR is not merged and I need to clean up some of the changes that are no longer needed because of this. But in the meantime if you have time you could tell me if it works or not :)

@robin-near
Copy link
Contributor Author

Closing this in favor of #11420

@robin-near robin-near closed this May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Forknet] Copy out State and FlatState instead of deleting all other columns
3 participants