Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: a slow chunk should not lead to chain getting stuck #11344

Merged
merged 8 commits into from
May 22, 2024

Conversation

bowenwang1996
Copy link
Collaborator

This is a suboptimal fix to #11339 that is easy to implement. The core of the problem is as follows: if a chunk is slow to apply for chunk validators, then their chunk endorsements will not get included in the following block. However, the next block could still be produced, at which point it becomes impossible to include that chunk because when a block producer produces a block, it only includes chunks whose prev block hash is the same as its tip. Now, when a new block is a produced, a chunk producer, after processing that block, will produce a new chunk, which again will cause chunk validators to take a long time to apply and therefore not able to send endorsement before the next block is produced.

This fix works by caching the state validation result so that if there are multiple chunks with the same previous chunk (not previous block), then validators don't need to apply the same (expensive) state transition again. Therefore they can send an endorsement very quickly when the next chunk is produced. slow_chunk.py passes with this change.

However, this fix is not optimal because for each block, the set of validators assigned to a shard changes and it may take a while before there is an assignment in which 2/3 of the validators have applied the expensive state transition and can quickly send an endorsement. It is also not clear what the optimal fix is. One idea is to remove the invariant that blocks can only include chunks with the same prev block hash, so that old chunks can be included if enough endorsements are received at some point. That, however, is a nontrivial change and can potentially break a lot of things.

Comment on lines 226 to 227
Ok(validation_result) => {
cache.put(prev_chunk_hash, validation_result);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we could potentially save validation result to disk and merge the code paths above into one. I decided not to do that for a few reasons:

  • Saving the result to disk will increase the disk footprint for chunk validators and we don't really need them most of the time
  • Overloading "chunk extra" could make the code difficult to understand and debug. Keeping the code paths for chunk producer and chunk validator separate makes the code less elegant but easier to reason with.

Copy link

codecov bot commented May 19, 2024

Codecov Report

Attention: Patch coverage is 86.88525% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 71.21%. Comparing base (e273c41) to head (e791d00).
Report is 10 commits behind head on master.

Files Patch % Lines
...nt/src/stateless_validation/chunk_validator/mod.rs 88.33% 0 Missing and 7 partials ⚠️
...client/src/stateless_validation/shadow_validate.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11344      +/-   ##
==========================================
+ Coverage   71.08%   71.21%   +0.12%     
==========================================
  Files         783      784       +1     
  Lines      156813   157250     +437     
  Branches   156813   157250     +437     
==========================================
+ Hits       111478   111982     +504     
+ Misses      40508    40424      -84     
- Partials     4827     4844      +17     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (-0.01%) ⬇️
db-migration 0.24% <0.00%> (-0.01%) ⬇️
genesis-check 1.38% <0.00%> (-0.01%) ⬇️
integration-tests 37.21% <86.88%> (+0.13%) ⬆️
linux 68.72% <1.63%> (-0.09%) ⬇️
linux-nightly 70.64% <86.88%> (+0.12%) ⬆️
macos 52.24% <1.63%> (+0.02%) ⬆️
pytests 1.60% <0.00%> (-0.01%) ⬇️
sanity-checks 1.40% <0.00%> (-0.01%) ⬇️
unittests 65.58% <1.63%> (+0.07%) ⬆️
upgradability 0.29% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wacban
Copy link
Contributor

wacban commented May 20, 2024

That's a really cool idea! I discarded it initially as I thought it's impossible because new outgoing receipts need to be taken into account. But you're totally right, the validation can be split into two parts, the first one just checking the state transition of the previous chunk, the second one validating the new chunk. The intermediate results of the first part can be cached and it's only the second part that changes with new blocks and incoming receipts.

It would be cool if we can structure the code to mirror this logical structure of validation. Sorry if it's already done like that and no need to do it in this PR.

Are there any scenarios where this solution would not work? e.g.

  • epoch boundary
  • cache too small (unfavourable chunk validator assignment, attacks to empty it)

Do you think this fix is sufficient for the first SV release? Since this adds extra complexity I would rather avoid it if we are going to implement another solution as well.

Do you have any estimates on how quickly the chain can recover by using this method? In the nayduck test I'm guessing it recovers quickly as there are only 4 nodes. What would happen in a large network?

@bowenwang1996
Copy link
Collaborator Author

It would be cool if we can structure the code to mirror this logical structure of validation. Sorry if it's already done like that and no need to do it in this PR.

Not entirely sure what you mean but it is part of the validation logic.

Are there any scenarios where this solution would not work? e.g.

epoch boundary
cache too small (unfavourable chunk validator assignment, attacks to empty it)

Epoch boundary is covered by test test_client_with_multi_test_loop. I think the only way to attack the cache is to control both block production and chunk product at the same time for an extended period of time and create a lot of forks. I think the probability of that happening is extremely low, but still, I can make the cache larger given that it is sufficient to cache receipt root since it won't work across epoch anyways.

Copy link
Contributor

@robin-near robin-near left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now!

@bowenwang1996 bowenwang1996 added this pull request to the merge queue May 22, 2024
Merged via the queue into near:master with commit 87da0e1 May 22, 2024
27 of 29 checks passed
@bowenwang1996 bowenwang1996 deleted the fix-slow-chunk branch May 22, 2024 05:18
Copy link
Contributor

@pugachAG pugachAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

marcelo-gonzalez pushed a commit to marcelo-gonzalez/nearcore that referenced this pull request May 23, 2024
This is a suboptimal fix to near#11339 that is easy to implement. The core
of the problem is as follows: if a chunk is slow to apply for chunk
validators, then their chunk endorsements will not get included in the
following block. However, the next block could still be produced, at
which point it becomes impossible to include that chunk because when a
block producer produces a block, it only includes chunks whose prev
block hash is the same as its tip. Now, when a new block is a produced,
a chunk producer, after processing that block, will produce a new chunk,
which again will cause chunk validators to take a long time to apply and
therefore not able to send endorsement before the next block is
produced.

This fix works by caching the state validation result so that if there
are multiple chunks with the same previous chunk (not previous block),
then validators don't need to apply the same (expensive) state
transition again. Therefore they can send an endorsement very quickly
when the next chunk is produced. `slow_chunk.py` passes with this
change.

However, this fix is not optimal because for each block, the set of
validators assigned to a shard changes and it may take a while before
there is an assignment in which 2/3 of the validators have applied the
expensive state transition and can quickly send an endorsement. It is
also not clear what the optimal fix is. One idea is to remove the
invariant that blocks can only include chunks with the same prev block
hash, so that old chunks can be included if enough endorsements are
received at some point. That, however, is a nontrivial change and can
potentially break a lot of things.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants