-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Forknet] Database trimmer that preserves only flat state. #11310
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11310 +/- ##
==========================================
- Coverage 70.98% 70.90% -0.09%
==========================================
Files 782 785 +3
Lines 156205 157066 +861
Branches 156205 157066 +861
==========================================
+ Hits 110889 111370 +481
- Misses 40496 40875 +379
- Partials 4820 4821 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcelo-gonzalez could you review this PR?
for kv in iter { | ||
let (key, value) = kv.unwrap(); | ||
(options.callback)(&key, Some(&value)); | ||
if Instant::now() > min_time_to_split { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if min_time_to_split is way far in the future, it feels a little sad to me to be putting an Instant::now() and associated syscall in this otherwise tight loop no? im not sure what exactly it is we're trying to avoid/achieve here, but i wonder if theres something better. bc my guess is that calling Instant::now() on every key/val returned by the iterator is actually a pretty non trivial slowdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I thought too, but given how fast the parallel iteration is, it doesn't seem to me it's a noticeable bottleneck.
if Instant::now() > min_time_to_split { | ||
if shared_state | ||
.num_threads_free | ||
.load(std::sync::atomic::Ordering::Relaxed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it's just Relaxed, if you read it and it's greater than 0, i think theres no guarantee that theres at least 1 thread that is not currently in the loop doing some work right? i bet it's not a big deal in practice esp on x86, and also since there's the shared_state.thread_work.lock()
right after setting it, which is prob enough of a barrier. But in any case Relaxed
feels a little fragile to me for what's being done here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee about anything in num_threads_free. It doesn't matter anyway, it's just a signal to suggest the current workers that it's probably wise to split. Split or not, it doesn't affect correctness. Thus the use of Relaxed.
break; | ||
} | ||
} | ||
for byte in start[byte_not_ff] + 1..=0xff { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if end is empty and start == [0xff] ? then forgetting about what we want to do about splitting, this start[byte_not_ff] + 1
will panic with overflow
this triggers the panic:
diff --git a/core/store/src/parallel_iter.rs b/core/store/src/parallel_iter.rs
index eb7c6d7b9..ce3998c2c 100644
--- a/core/store/src/parallel_iter.rs
+++ b/core/store/src/parallel_iter.rs
@@ -489,6 +489,9 @@ mod tests {
#[test]
fn test_iteration_range_divide() {
+ let ranges = IterationRange::from_str("ff..").unwrap().divide();
+ println!("{}", ranges[0].to_string());
+
let ranges = IterationRange::from_str("..").unwrap().divide();
assert_eq!(ranges.len(), 257);
assert_eq!(ranges[0].to_string(), "..00");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha... nice catch. Well I'm removing all this code so, phew. This code is so messy. I wonder if there's a good way to make it cleaner.
let mut update = store.store_update(); | ||
let mut data = BTreeMap::new(); | ||
for _ in 0..1000000 { | ||
let key_len = rand::random::<usize>() % 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could consider using a hardcoded seed in these rand calls so if it ever fails it will be easy to reproduce
StoreParallelIterator::lookup_keys( | ||
store.clone(), | ||
DBCol::State, | ||
keys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are no inlined keys we will hit the assert in WorkItem::new_lookup()
. this can actually happen and i hit this on localnet
// Now that we've read all the non-inlined keys into memory, delete the State column, and | ||
// write back these values. | ||
let mut update = store.store_update(); | ||
update.delete_all(DBCol::State); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i wonder if it is worth adding a way to delete a column like this via drop_cf(), like we do for checkpoints, because otherwise the space wont be cleared up until a compaction happens at some point. obviously not a big deal to implement that in this PR, but in the future it would prob be good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggressive trimming is just a testing tool for this trimming, it isn't meant to shrink the db size. But you're right, when it comes time to maybe actually delete most of the State column, then yeah, drop_cf would be awesome.
.join(near_config.config.store.path.clone().unwrap_or_else(|| PathBuf::from("data"))); | ||
let temp_store_home = store_path.join("temp-trimmed-db"); | ||
if std::fs::metadata(&temp_store_home).is_ok() { | ||
panic!("Temp trimmed DB already exists; please delete temp-trimmed-db in the data directory first"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not return an anyhow::Error instead of panicking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sure.
let (tx, rx) = std::sync::mpsc::sync_channel(1000000); | ||
let iter_handle = { | ||
let old = old.clone(); | ||
std::thread::spawn(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal, but why do we need a new thread for this? this new thread will just be spawning other ones and then waiting in join() right? again, not a big deal, but could just do the joining later when you feel like it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well both sides of the channel needs to be operated on in parallel. So either this is in a thread, or the consumer side is in a thread.
StoreParallelIterator::lookup_keys( | ||
old, | ||
DBCol::State, | ||
non_inlined_keys, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again I hit the same assert here because there were no non inlined keys. if you want to reproduce, what i did is i ran python3 tests/sanity/transactions.py
with this change:
diff --git a/pytest/tests/sanity/transactions.py b/pytest/tests/sanity/transactions.py
index c628fbee0..a936e1e56 100755
--- a/pytest/tests/sanity/transactions.py
+++ b/pytest/tests/sanity/transactions.py
@@ -23,7 +23,7 @@ nodes = start_cluster(
num_shards=4,
config=None,
genesis_config_changes=[["min_gas_price",
- 0], ["max_inflation_rate", [0, 1]],
+ 0], ["use_production_config", True], ["max_inflation_rate", [0, 1]],
["epoch_length", 10],
["block_producer_kickout_threshold", 70]],
client_config_changes={
@@ -33,7 +33,8 @@ nodes = start_cluster(
"secs": 2,
"nanos": 0
}
- }
+ },
+ "tracked_shards": [0, 1, 2, 3]
},
1: {
"consensus": {
@@ -41,7 +42,8 @@ nodes = start_cluster(
"secs": 2,
"nanos": 0
}
- }
+ },
+ "tracked_shards": [0, 1, 2, 3]
},
2: {
"consensus": {
@@ -49,7 +51,8 @@ nodes = start_cluster(
"secs": 2,
"nanos": 0
}
- }
+ },
+ "tracked_shards": [0, 1, 2, 3]
},
3: {
"consensus": {
@@ -57,7 +60,8 @@ nodes = start_cluster(
"secs": 2,
"nanos": 0
}
- }
+ },
+ "tracked_shards": [0, 1, 2, 3]
},
4: {
"consensus": {
@@ -72,7 +76,7 @@ nodes = start_cluster(
started = time.time()
-act_to_val = [4, 4, 4, 4, 4]
+act_to_val = [0, 0, 0, 0, 0]
ctx = utils.TxContext(act_to_val, nodes)
@@ -97,8 +101,12 @@ cur_balances = ctx.get_balances()
assert cur_balances == ctx.expected_balances, "%s != %s" % (
cur_balances, ctx.expected_balances)
+killed_it = False
# we are done with the sanity test, now let's stress it
-for height, _ in utils.poll_blocks(nodes[4], timeout=TIMEOUT):
+for height, _ in utils.poll_blocks(nodes[0], timeout=TIMEOUT):
+ if height >= 60 and not killed_it:
+ nodes[4].kill()
+ killed_it=True
if ctx.get_balances() == ctx.expected_balances:
count = height - sent_height
logger.info(f'Balances caught up, took {count} blocks, moving on')
and then after that test finished i ran:
$ python3 tests/mocknet/local_test_node.py local-test-setup --num-nodes 2 --neard-binary-path ~/nearcore/target/debug/neard --source-home-dir ~/.near/test0_finished --target-home-dir ~/.near/test0_finished --yes
$ python3 tests/mocknet/mirror.py --local-test new-test --epoch-length 200 --num-validators 2 --num-seats 100 --genesis-protocol-version 67 --yes
(if the new-test command fails with some timeout error, just try again, idk exactly why that happens). but then if you look at the logs in ~/.near/local-mocknet/node0/stderr
, youll see that the fork-network command panicked
so rly the only comments that made me not click approve are the ones that actually cause panics, but the others are just suggestions |
Thanks so much for the really diligent review @marcelo-gonzalez ! So after implementing parallel memtrie load, I think this parallel iterator can also be scrapped and replaced with the parallel memtrie load logic. Reason being (1) it just uses rayon so there's much less of these corner cases, performance problems and panics; (2) invoking part of that code is needed anyway because to support parallel memtrie load we're gonna need some additional State entries, and that code figures out exactly what state entries are needed. I'll do that as soon as I get a chance. |
@marcelo-gonzalez Could you try this #11420 this is the new implementation I talked about. I can't make a reviewable PR out of it yet because the parallel loading PR is not merged and I need to clean up some of the changes that are no longer needed because of this. But in the meantime if you have time you could tell me if it works or not :) |
Closing this in favor of #11420 |
Closes #11280
Includes two new commands:
neard database aggressive-trimming --obliterate-disk-trie
performs the deletion of all State keys except those large values needed by flat storage. This is used to test that memtrie-only nodes can work.neard fork-network finalize
now rebuilds the database by copying out FlatState as well as the State keys that are needed for large values, and then swapping in the new database replacing the old.@marcelo-gonzalez Not sure how to test the fork-network changes.