feat: stabilize chunk producer assignments #11322

Longarithm · 2024-05-15T20:30:10Z

Using stable (sticky) chunk producer assignments since stateless validation protocol version, applying algorithm from #11213 (comment).

New algorithm properties:

tries to balance numbers of chunk producers for shards, ignoring the stakes. Comparison of stake diffs for old and new algo: Stabilize chunk producer assignments across every epoch #11213 (comment)
minimises number of state syncs to be made, setting limit caused by reassignments to 5 in epoch config. Figuring out exact number is a TODO as well, but for mainnet it shouldn't matter because we have way more validator proposals than planned chunk producers (100) at every epoch.

Old assignment algorithm is moved behind old_validator_selection. The part for assigning validators with repeats is still used though.

The main function is assign_chunk_producers_to_shards which is comprehensively tested. +737 lines are scary but the algorithm itself is +300 lines with comments and tests are another +250. get_chunk_producers_assignment is separated into a function because it became too big.

codecov · 2024-05-16T14:12:44Z

Codecov Report

Attention: Patch coverage is 97.68271% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 71.27%. Comparing base (899cb30) to head (c364b1d).
Report is 8 commits behind head on master.

Files	Patch %	Lines
chain/epoch-manager/src/shard_assignment.rs	97.41%	11 Missing ⚠️
tools/fork-network/src/cli.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11322      +/-   ##
==========================================
+ Coverage   71.09%   71.27%   +0.17%     
==========================================
  Files         783      784       +1     
  Lines      156826   157646     +820     
  Branches   156826   157646     +820     
==========================================
+ Hits       111492   112355     +863     
+ Misses      40515    40450      -65     
- Partials     4819     4841      +22

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.38% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.23% <48.12%> (+0.08%)`	⬆️
linux	`68.76% <82.35%> (-0.05%)`	⬇️
linux-nightly	`70.71% <97.50%> (+0.18%)`	⬆️
macos	`52.36% <82.17%> (+0.13%)`	⬆️
pytests	`1.60% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.39% <0.00%> (-0.01%)`	⬇️
unittests	`65.67% <97.14%> (+0.16%)`	⬆️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tayfunelmas

LGTM, thanks for the refactorings for the readability as well.
but please wait for other's reviews.

tayfunelmas · 2024-05-17T00:56:42Z

chain/epoch-manager/src/lib.rs

@@ -734,6 +735,9 @@ impl EpochManager {
            )
        };
        let next_next_epoch_config = self.config.for_protocol_version(next_version);
+        let next_shard_layout =


nit.
is next_shard_layout for the layout for the epoch just about to start, and next_next_is the next one? how about renaming them to
next_shard_layout -> new_shard_layout (to refer to the epoch just starting) and next_next_shard_layout -> next_shard_layout (to refer to the epoch after the new one)

Personally I prefer next and next_next, it somehow makes sense to me :)
Also I think it is a bit of a convention now.

Yeah, next next is a convention... Actually I think there is some issue but I want to fix it in follow-up PR as we are not going to reshard soon.

tayfunelmas · 2024-05-18T02:02:33Z

chain/epoch-manager/src/validator_selection.rs

@@ -1223,6 +1262,7 @@ mod tests {
        }
    }

+    #[allow(unused)]


why keep this one if unused?

tayfunelmas · 2024-05-18T02:09:15Z

chain/epoch-manager/src/shard_assignment.rs

+///
+/// This function guarantees that, in order of priority:
+/// * every shard has at least `min_validators_per_shard` assigned to it;
+/// * chunk producer repeats are completely avoided is possible;


if possible?

wacban

LGTM just a few nits

wacban · 2024-05-18T15:41:21Z

chain/epoch-manager/src/lib.rs

@@ -734,6 +735,9 @@ impl EpochManager {
            )
        };
        let next_next_epoch_config = self.config.for_protocol_version(next_version);
+        let next_shard_layout =


Personally I prefer next and next_next, it somehow makes sense to me :)
Also I think it is a bit of a convention now.

wacban · 2024-05-18T15:44:23Z

chain/epoch-manager/src/lib.rs

+        let next_shard_layout =
+            self.config.for_protocol_version(epoch_protocol_version).shard_layout;
+        let has_same_shard_layout = next_shard_layout == next_next_epoch_config.shard_layout;


nit: There is a method called will_shard_layout_change that would be a bit cleaner here. I'm not sure if you'll have access to it from here so feel free to ignore.

It is accessible but I can't use it for different reason: it accepts parent hash of block in the end of next epoch, which doesn't exist yet.

wacban · 2024-05-20T08:24:49Z

chain/epoch-manager/src/validator_selection.rs

+    for validators_for_role in [&chunk_producers, &block_producers, &chunk_validators].iter() {
+        for validator in validators_for_role.iter() {


nit: flatten may be a bit shorter

Ended up using chain because flatten was complaining that something is not convertible to an iterator.

wacban · 2024-05-20T08:26:27Z

chain/epoch-manager/src/validator_selection.rs

+            prev_epoch_info
+                .chunk_producers_settlement()
+                .iter()
+                .map(|vs| {
+                    vs.into_iter().map(|v| prev_epoch_info.get_validator(*v)).collect::<Vec<_>>()
+                })
+                .collect::<Vec<_>>(),


nit: I think this would be more readable as a for loop.

wacban · 2024-05-20T08:53:21Z

chain/epoch-manager/src/shard_assignment.rs

+    num_shards: NumShards,
+    prev_chunk_producers_assignment: Option<Vec<Vec<ValidatorStake>>>,
+) -> Vec<Vec<usize>> {
+    if let Some(prev_assignment) = prev_chunk_producers_assignment {


nit:

let Some(prev_assignment) = prev_chunk_producers_assignment else { return vec![vec![]; num_shards as usize]; };

wacban · 2024-05-20T08:54:46Z

chain/epoch-manager/src/shard_assignment.rs

+        prev_assignment
+            .into_iter()
+            .map(|vs| {
+                vs.into_iter()
+                    .map(|v| chunk_producer_indices.get(v.account_id()).copied())
+                    .flatten()
+                    .collect::<Vec<_>>()
+            })
+            .collect::<Vec<_>>()


nit: This would be more readable as a for loop and with full variable names (vs, v).

wacban · 2024-05-20T08:56:36Z

chain/epoch-manager/src/shard_assignment.rs

+/// Helper struct to maintain set of shards sorted by number of chunk producers.
+struct ShardSetItem {
+    shard_validators_num: usize,
+    shard_index: usize,
+}


nit: In the comment is says number of chunk producers but the field is called shard validators num - I'm not sure which one is correct but can you make is consistent?

wacban · 2024-05-20T08:58:42Z

chain/epoch-manager/src/shard_assignment.rs

+/// Caller must guarantee that `min_validators_per_shard` is achievable and
+/// `prev_chunk_producers_assignment` corresponds to the same number of shards.


It seems like this wouldn't work during resharding - can you add a todo like so:
TODO(resharding) - implement shard assignment

wacban · 2024-05-20T09:00:46Z

chain/epoch-manager/src/shard_assignment.rs

+fn get_initial_chunk_producer_assignment(
+    chunk_producers: &[ValidatorStake],
+    num_shards: NumShards,
+    prev_chunk_producers_assignment: Option<Vec<Vec<ValidatorStake>>>,
+) -> Vec<Vec<usize>> {


This method basically just copies the assignment from the previous epoch right?

Yeah, leaving only chunk producers which remained online in the current epoch.

wacban · 2024-05-20T09:12:19Z

chain/epoch-manager/src/shard_assignment.rs

+
+        let minimal_shard = shard_set.pop_first().unwrap().shard_index;
+        let maximal_shard = shard_set.pop_last().unwrap().shard_index;
+        let validator_pos = rng.gen_range(0..chunk_producer_assignment[maximal_shard].len());


Perhaps here we could pick a validator to minimize the difference in stakes between the min_shard and max_shard. The security and balancing implications are not immediately clear to me though. No need to do it in this PR, I'm just sharing an idea to consider.

wacban · 2024-05-20T09:18:22Z

tries to balance numbers of chunk producers for shards, ignoring the stakes. Looking at historical data to see how would it impact stakes is a TODO.

This may be a random walk. If that is the case it will eventually get unbalanced. That being said I think this PR is great step in the right direction. Let's evaluate it and follow up based on the data.

Longarithm · 2024-05-20T11:35:48Z

Evaluation results, comparing old and new algorithms: #11213 (comment)

On #11322 there was a fair suggestion to clarify what is "next" and "next next" epoch, and in fact, there was some inaccuracy there, which I want to resolve. With the fixed definitions, * As epoch T defines epoch T+2, EpochSummary field for epoch T must be called `next_next_version`, not `next_version`. Self-check: on epoch T validators voted for new protocol version; however, on epoch T+1 upgrade doesn't happen because validators need to already sync at some state, which corresponds to old protocol version; protocol version of T+2 is defined by epoch_summary(T).next_next_version. * `proposals_to_epoch_info` arguments are aligned with what is "prev" and "next". `prev_prev_protocol_version` is kept for legacy reasons, `prev_epoch_info` is indeed info of the epoch before, `protocol_version` is the one which we use to generate current info. Moreover, previous shard layout comparison was incorrect and now it is fixed as well.

Using stable (sticky) chunk producer assignments since stateless validation protocol version, applying algorithm from near#11213 (comment). New algorithm properties: * tries to balance numbers of chunk producers for shards, **ignoring the stakes**. Comparison of stake diffs for old and new algo: near#11213 (comment) * minimises number of state syncs to be made, setting limit caused by reassignments to 5 in epoch config. Figuring out exact number is a TODO as well, but for mainnet it shouldn't matter because we have way more validator proposals than planned chunk producers (100) at every epoch. Old assignment algorithm is moved behind `old_validator_selection`. The part for assigning validators with repeats is still used though. The main function is `assign_chunk_producers_to_shards` which is comprehensively tested. +737 lines are scary but the algorithm itself is +300 lines with comments and tests are another +250. `get_chunk_producers_assignment` is separated into a function because it became too big.

On near#11322 there was a fair suggestion to clarify what is "next" and "next next" epoch, and in fact, there was some inaccuracy there, which I want to resolve. With the fixed definitions, * As epoch T defines epoch T+2, EpochSummary field for epoch T must be called `next_next_version`, not `next_version`. Self-check: on epoch T validators voted for new protocol version; however, on epoch T+1 upgrade doesn't happen because validators need to already sync at some state, which corresponds to old protocol version; protocol version of T+2 is defined by epoch_summary(T).next_next_version. * `proposals_to_epoch_info` arguments are aligned with what is "prev" and "next". `prev_prev_protocol_version` is kept for legacy reasons, `prev_epoch_info` is indeed info of the epoch before, `protocol_version` is the one which we use to generate current info. Moreover, previous shard layout comparison was incorrect and now it is fixed as well.

Longarithm linked an issue May 16, 2024 that may be closed by this pull request

Stabilize chunk producer assignments across every epoch #11213

Closed

Longarithm requested review from wacban, tayfunelmas and robin-near May 16, 2024 23:10

Longarithm marked this pull request as ready for review May 16, 2024 23:10

Longarithm requested a review from a team as a code owner May 16, 2024 23:10

Longarithm changed the title ~~draft: stabilize chunk producer assignments~~ feat: stabilize chunk producer assignments May 17, 2024

assignments

d4264da

Longarithm force-pushed the epoch-4 branch from 3380901 to d4264da Compare May 17, 2024 21:11

tayfunelmas approved these changes May 18, 2024

View reviewed changes

wacban approved these changes May 20, 2024

View reviewed changes

feedback

8e1f8cb

Longarithm enabled auto-merge May 21, 2024 19:49

Longarithm and others added 2 commits May 21, 2024 22:49

Merge branch 'master' into epoch-4

84e120b

deref

c364b1d

Longarithm added this pull request to the merge queue May 22, 2024

Merged via the queue into near:master with commit 4091e3a May 22, 2024
28 of 29 checks passed

Longarithm deleted the epoch-4 branch May 22, 2024 04:58

Longarithm mentioned this pull request May 22, 2024

refactor: clarify what is the next epoch on generation #11376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stabilize chunk producer assignments #11322

feat: stabilize chunk producer assignments #11322

Longarithm commented May 15, 2024 •

edited

codecov bot commented May 16, 2024 •

edited

tayfunelmas left a comment

tayfunelmas May 17, 2024

wacban May 18, 2024

Longarithm May 21, 2024

tayfunelmas May 18, 2024

tayfunelmas May 18, 2024

wacban left a comment

wacban May 18, 2024

wacban May 18, 2024

Longarithm May 21, 2024 •

edited

wacban May 20, 2024

Longarithm May 21, 2024

wacban May 20, 2024

wacban May 20, 2024

wacban May 20, 2024

wacban May 20, 2024

wacban May 20, 2024

wacban May 20, 2024

Longarithm May 21, 2024

wacban May 20, 2024

wacban commented May 20, 2024

Longarithm commented May 20, 2024

		for validators_for_role in [&chunk_producers, &block_producers, &chunk_validators].iter() {
		for validator in validators_for_role.iter() {

		/// Caller must guarantee that `min_validators_per_shard` is achievable and
		/// `prev_chunk_producers_assignment` corresponds to the same number of shards.

feat: stabilize chunk producer assignments #11322

feat: stabilize chunk producer assignments #11322

Conversation

Longarithm commented May 15, 2024 • edited

codecov bot commented May 16, 2024 • edited

Codecov Report

tayfunelmas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Longarithm May 21, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban commented May 20, 2024

Longarithm commented May 20, 2024

Longarithm commented May 15, 2024 •

edited

codecov bot commented May 16, 2024 •

edited

Longarithm May 21, 2024 •

edited