-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: a slow chunk should not lead to chain getting stuck #11344
Conversation
Ok(validation_result) => { | ||
cache.put(prev_chunk_hash, validation_result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we could potentially save validation result to disk and merge the code paths above into one. I decided not to do that for a few reasons:
- Saving the result to disk will increase the disk footprint for chunk validators and we don't really need them most of the time
- Overloading "chunk extra" could make the code difficult to understand and debug. Keeping the code paths for chunk producer and chunk validator separate makes the code less elegant but easier to reason with.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11344 +/- ##
==========================================
+ Coverage 71.08% 71.21% +0.12%
==========================================
Files 783 784 +1
Lines 156813 157250 +437
Branches 156813 157250 +437
==========================================
+ Hits 111478 111982 +504
+ Misses 40508 40424 -84
- Partials 4827 4844 +17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
That's a really cool idea! I discarded it initially as I thought it's impossible because new outgoing receipts need to be taken into account. But you're totally right, the validation can be split into two parts, the first one just checking the state transition of the previous chunk, the second one validating the new chunk. The intermediate results of the first part can be cached and it's only the second part that changes with new blocks and incoming receipts. It would be cool if we can structure the code to mirror this logical structure of validation. Sorry if it's already done like that and no need to do it in this PR. Are there any scenarios where this solution would not work? e.g.
Do you think this fix is sufficient for the first SV release? Since this adds extra complexity I would rather avoid it if we are going to implement another solution as well. Do you have any estimates on how quickly the chain can recover by using this method? In the nayduck test I'm guessing it recovers quickly as there are only 4 nodes. What would happen in a large network? |
Not entirely sure what you mean but it is part of the validation logic.
Epoch boundary is covered by test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
This is a suboptimal fix to near#11339 that is easy to implement. The core of the problem is as follows: if a chunk is slow to apply for chunk validators, then their chunk endorsements will not get included in the following block. However, the next block could still be produced, at which point it becomes impossible to include that chunk because when a block producer produces a block, it only includes chunks whose prev block hash is the same as its tip. Now, when a new block is a produced, a chunk producer, after processing that block, will produce a new chunk, which again will cause chunk validators to take a long time to apply and therefore not able to send endorsement before the next block is produced. This fix works by caching the state validation result so that if there are multiple chunks with the same previous chunk (not previous block), then validators don't need to apply the same (expensive) state transition again. Therefore they can send an endorsement very quickly when the next chunk is produced. `slow_chunk.py` passes with this change. However, this fix is not optimal because for each block, the set of validators assigned to a shard changes and it may take a while before there is an assignment in which 2/3 of the validators have applied the expensive state transition and can quickly send an endorsement. It is also not clear what the optimal fix is. One idea is to remove the invariant that blocks can only include chunks with the same prev block hash, so that old chunks can be included if enough endorsements are received at some point. That, however, is a nontrivial change and can potentially break a lot of things.
This is a suboptimal fix to #11339 that is easy to implement. The core of the problem is as follows: if a chunk is slow to apply for chunk validators, then their chunk endorsements will not get included in the following block. However, the next block could still be produced, at which point it becomes impossible to include that chunk because when a block producer produces a block, it only includes chunks whose prev block hash is the same as its tip. Now, when a new block is a produced, a chunk producer, after processing that block, will produce a new chunk, which again will cause chunk validators to take a long time to apply and therefore not able to send endorsement before the next block is produced.
This fix works by caching the state validation result so that if there are multiple chunks with the same previous chunk (not previous block), then validators don't need to apply the same (expensive) state transition again. Therefore they can send an endorsement very quickly when the next chunk is produced.
slow_chunk.py
passes with this change.However, this fix is not optimal because for each block, the set of validators assigned to a shard changes and it may take a while before there is an assignment in which 2/3 of the validators have applied the expensive state transition and can quickly send an endorsement. It is also not clear what the optimal fix is. One idea is to remove the invariant that blocks can only include chunks with the same prev block hash, so that old chunks can be included if enough endorsements are received at some point. That, however, is a nontrivial change and can potentially break a lot of things.