-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: stabilize chunk producer assignments #11322
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11322 +/- ##
==========================================
+ Coverage 71.09% 71.27% +0.17%
==========================================
Files 783 784 +1
Lines 156826 157646 +820
Branches 156826 157646 +820
==========================================
+ Hits 111492 112355 +863
+ Misses 40515 40450 -65
- Partials 4819 4841 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the refactorings for the readability as well.
but please wait for other's reviews.
@@ -734,6 +735,9 @@ impl EpochManager { | |||
) | |||
}; | |||
let next_next_epoch_config = self.config.for_protocol_version(next_version); | |||
let next_shard_layout = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit.
is next_shard_layout for the layout for the epoch just about to start, and next_next_is the next one? how about renaming them to
next_shard_layout -> new_shard_layout (to refer to the epoch just starting) and next_next_shard_layout -> next_shard_layout (to refer to the epoch after the new one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer next
and next_next
, it somehow makes sense to me :)
Also I think it is a bit of a convention now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, next next is a convention... Actually I think there is some issue but I want to fix it in follow-up PR as we are not going to reshard soon.
@@ -1223,6 +1262,7 @@ mod tests { | |||
} | |||
} | |||
|
|||
#[allow(unused)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why keep this one if unused?
/// | ||
/// This function guarantees that, in order of priority: | ||
/// * every shard has at least `min_validators_per_shard` assigned to it; | ||
/// * chunk producer repeats are completely avoided is possible; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just a few nits
@@ -734,6 +735,9 @@ impl EpochManager { | |||
) | |||
}; | |||
let next_next_epoch_config = self.config.for_protocol_version(next_version); | |||
let next_shard_layout = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer next
and next_next
, it somehow makes sense to me :)
Also I think it is a bit of a convention now.
let next_shard_layout = | ||
self.config.for_protocol_version(epoch_protocol_version).shard_layout; | ||
let has_same_shard_layout = next_shard_layout == next_next_epoch_config.shard_layout; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: There is a method called will_shard_layout_change
that would be a bit cleaner here. I'm not sure if you'll have access to it from here so feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is accessible but I can't use it for different reason: it accepts parent hash of block in the end of next epoch, which doesn't exist yet.
for validators_for_role in [&chunk_producers, &block_producers, &chunk_validators].iter() { | ||
for validator in validators_for_role.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: flatten may be a bit shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up using chain
because flatten
was complaining that something is not convertible to an iterator.
prev_epoch_info | ||
.chunk_producers_settlement() | ||
.iter() | ||
.map(|vs| { | ||
vs.into_iter().map(|v| prev_epoch_info.get_validator(*v)).collect::<Vec<_>>() | ||
}) | ||
.collect::<Vec<_>>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this would be more readable as a for loop.
num_shards: NumShards, | ||
prev_chunk_producers_assignment: Option<Vec<Vec<ValidatorStake>>>, | ||
) -> Vec<Vec<usize>> { | ||
if let Some(prev_assignment) = prev_chunk_producers_assignment { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
let Some(prev_assignment) = prev_chunk_producers_assignment else { return vec![vec![]; num_shards as usize]; };
prev_assignment | ||
.into_iter() | ||
.map(|vs| { | ||
vs.into_iter() | ||
.map(|v| chunk_producer_indices.get(v.account_id()).copied()) | ||
.flatten() | ||
.collect::<Vec<_>>() | ||
}) | ||
.collect::<Vec<_>>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This would be more readable as a for loop and with full variable names (vs
, v
).
/// Helper struct to maintain set of shards sorted by number of chunk producers. | ||
struct ShardSetItem { | ||
shard_validators_num: usize, | ||
shard_index: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In the comment is says number of chunk producers
but the field is called shard validators num
- I'm not sure which one is correct but can you make is consistent?
/// Caller must guarantee that `min_validators_per_shard` is achievable and | ||
/// `prev_chunk_producers_assignment` corresponds to the same number of shards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this wouldn't work during resharding - can you add a todo like so:
TODO(resharding) - implement shard assignment
fn get_initial_chunk_producer_assignment( | ||
chunk_producers: &[ValidatorStake], | ||
num_shards: NumShards, | ||
prev_chunk_producers_assignment: Option<Vec<Vec<ValidatorStake>>>, | ||
) -> Vec<Vec<usize>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method basically just copies the assignment from the previous epoch right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, leaving only chunk producers which remained online in the current epoch.
|
||
let minimal_shard = shard_set.pop_first().unwrap().shard_index; | ||
let maximal_shard = shard_set.pop_last().unwrap().shard_index; | ||
let validator_pos = rng.gen_range(0..chunk_producer_assignment[maximal_shard].len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps here we could pick a validator to minimize the difference in stakes between the min_shard and max_shard. The security and balancing implications are not immediately clear to me though. No need to do it in this PR, I'm just sharing an idea to consider.
This may be a random walk. If that is the case it will eventually get unbalanced. That being said I think this PR is great step in the right direction. Let's evaluate it and follow up based on the data. |
Evaluation results, comparing old and new algorithms: #11213 (comment) |
On #11322 there was a fair suggestion to clarify what is "next" and "next next" epoch, and in fact, there was some inaccuracy there, which I want to resolve. With the fixed definitions, * As epoch T defines epoch T+2, EpochSummary field for epoch T must be called `next_next_version`, not `next_version`. Self-check: on epoch T validators voted for new protocol version; however, on epoch T+1 upgrade doesn't happen because validators need to already sync at some state, which corresponds to old protocol version; protocol version of T+2 is defined by epoch_summary(T).next_next_version. * `proposals_to_epoch_info` arguments are aligned with what is "prev" and "next". `prev_prev_protocol_version` is kept for legacy reasons, `prev_epoch_info` is indeed info of the epoch before, `protocol_version` is the one which we use to generate current info. Moreover, previous shard layout comparison was incorrect and now it is fixed as well.
Using stable (sticky) chunk producer assignments since stateless validation protocol version, applying algorithm from near#11213 (comment). New algorithm properties: * tries to balance numbers of chunk producers for shards, **ignoring the stakes**. Comparison of stake diffs for old and new algo: near#11213 (comment) * minimises number of state syncs to be made, setting limit caused by reassignments to 5 in epoch config. Figuring out exact number is a TODO as well, but for mainnet it shouldn't matter because we have way more validator proposals than planned chunk producers (100) at every epoch. Old assignment algorithm is moved behind `old_validator_selection`. The part for assigning validators with repeats is still used though. The main function is `assign_chunk_producers_to_shards` which is comprehensively tested. +737 lines are scary but the algorithm itself is +300 lines with comments and tests are another +250. `get_chunk_producers_assignment` is separated into a function because it became too big.
On near#11322 there was a fair suggestion to clarify what is "next" and "next next" epoch, and in fact, there was some inaccuracy there, which I want to resolve. With the fixed definitions, * As epoch T defines epoch T+2, EpochSummary field for epoch T must be called `next_next_version`, not `next_version`. Self-check: on epoch T validators voted for new protocol version; however, on epoch T+1 upgrade doesn't happen because validators need to already sync at some state, which corresponds to old protocol version; protocol version of T+2 is defined by epoch_summary(T).next_next_version. * `proposals_to_epoch_info` arguments are aligned with what is "prev" and "next". `prev_prev_protocol_version` is kept for legacy reasons, `prev_epoch_info` is indeed info of the epoch before, `protocol_version` is the one which we use to generate current info. Moreover, previous shard layout comparison was incorrect and now it is fixed as well.
Using stable (sticky) chunk producer assignments since stateless validation protocol version, applying algorithm from #11213 (comment).
New algorithm properties:
Old assignment algorithm is moved behind
old_validator_selection
. The part for assigning validators with repeats is still used though.The main function is
assign_chunk_producers_to_shards
which is comprehensively tested. +737 lines are scary but the algorithm itself is +300 lines with comments and tests are another +250.get_chunk_producers_assignment
is separated into a function because it became too big.