fix: a slow chunk should not lead to chain getting stuck #11344

bowenwang1996 · 2024-05-17T23:31:45Z

This is a suboptimal fix to #11339 that is easy to implement. The core of the problem is as follows: if a chunk is slow to apply for chunk validators, then their chunk endorsements will not get included in the following block. However, the next block could still be produced, at which point it becomes impossible to include that chunk because when a block producer produces a block, it only includes chunks whose prev block hash is the same as its tip. Now, when a new block is a produced, a chunk producer, after processing that block, will produce a new chunk, which again will cause chunk validators to take a long time to apply and therefore not able to send endorsement before the next block is produced.

This fix works by caching the state validation result so that if there are multiple chunks with the same previous chunk (not previous block), then validators don't need to apply the same (expensive) state transition again. Therefore they can send an endorsement very quickly when the next chunk is produced. slow_chunk.py passes with this change.

However, this fix is not optimal because for each block, the set of validators assigned to a shard changes and it may take a while before there is an assignment in which 2/3 of the validators have applied the expensive state transition and can quickly send an endorsement. It is also not clear what the optimal fix is. One idea is to remove the invariant that blocks can only include chunks with the same prev block hash, so that old chunks can be included if enough endorsements are received at some point. That, however, is a nontrivial change and can potentially break a lot of things.

bowenwang1996 · 2024-05-17T23:36:08Z

chain/client/src/stateless_validation/chunk_validator/mod.rs

+                Ok(validation_result) => {
+                    cache.put(prev_chunk_hash, validation_result);


Note: we could potentially save validation result to disk and merge the code paths above into one. I decided not to do that for a few reasons:

Saving the result to disk will increase the disk footprint for chunk validators and we don't really need them most of the time

Overloading "chunk extra" could make the code difficult to understand and debug. Keeping the code paths for chunk producer and chunk validator separate makes the code less elegant but easier to reason with.

chain/client/src/stateless_validation/chunk_validator/mod.rs

codecov · 2024-05-19T22:11:27Z

Codecov Report

Attention: Patch coverage is 86.88525% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 71.21%. Comparing base (e273c41) to head (e791d00).
Report is 10 commits behind head on master.

Files	Patch %	Lines
...nt/src/stateless_validation/chunk_validator/mod.rs	88.33%	0 Missing and 7 partials ⚠️
...client/src/stateless_validation/shadow_validate.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11344      +/-   ##
==========================================
+ Coverage   71.08%   71.21%   +0.12%     
==========================================
  Files         783      784       +1     
  Lines      156813   157250     +437     
  Branches   156813   157250     +437     
==========================================
+ Hits       111478   111982     +504     
+ Misses      40508    40424      -84     
- Partials     4827     4844      +17

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.38% <0.00%> (-0.01%)`	⬇️
integration-tests	`37.21% <86.88%> (+0.13%)`	⬆️
linux	`68.72% <1.63%> (-0.09%)`	⬇️
linux-nightly	`70.64% <86.88%> (+0.12%)`	⬆️
macos	`52.24% <1.63%> (+0.02%)`	⬆️
pytests	`1.60% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.40% <0.00%> (-0.01%)`	⬇️
unittests	`65.58% <1.63%> (+0.07%)`	⬆️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wacban · 2024-05-20T11:38:14Z

That's a really cool idea! I discarded it initially as I thought it's impossible because new outgoing receipts need to be taken into account. But you're totally right, the validation can be split into two parts, the first one just checking the state transition of the previous chunk, the second one validating the new chunk. The intermediate results of the first part can be cached and it's only the second part that changes with new blocks and incoming receipts.

It would be cool if we can structure the code to mirror this logical structure of validation. Sorry if it's already done like that and no need to do it in this PR.

Are there any scenarios where this solution would not work? e.g.

epoch boundary
cache too small (unfavourable chunk validator assignment, attacks to empty it)

Do you think this fix is sufficient for the first SV release? Since this adds extra complexity I would rather avoid it if we are going to implement another solution as well.

Do you have any estimates on how quickly the chain can recover by using this method? In the nayduck test I'm guessing it recovers quickly as there are only 4 nodes. What would happen in a large network?

bowenwang1996 · 2024-05-20T16:51:50Z

It would be cool if we can structure the code to mirror this logical structure of validation. Sorry if it's already done like that and no need to do it in this PR.

Not entirely sure what you mean but it is part of the validation logic.

Are there any scenarios where this solution would not work? e.g.

epoch boundary
cache too small (unfavourable chunk validator assignment, attacks to empty it)

Epoch boundary is covered by test test_client_with_multi_test_loop. I think the only way to attack the cache is to control both block production and chunk product at the same time for an extended period of time and create a lot of forks. I think the probability of that happening is extremely low, but still, I can make the cache larger given that it is sufficient to cache receipt root since it won't work across epoch anyways.

chain/client/src/stateless_validation/chunk_validator/mod.rs

robin-near

Looks great now!

chain/client/src/stateless_validation/chunk_validator/mod.rs

pugachAG

LGTM 👍

This is a suboptimal fix to near#11339 that is easy to implement. The core of the problem is as follows: if a chunk is slow to apply for chunk validators, then their chunk endorsements will not get included in the following block. However, the next block could still be produced, at which point it becomes impossible to include that chunk because when a block producer produces a block, it only includes chunks whose prev block hash is the same as its tip. Now, when a new block is a produced, a chunk producer, after processing that block, will produce a new chunk, which again will cause chunk validators to take a long time to apply and therefore not able to send endorsement before the next block is produced. This fix works by caching the state validation result so that if there are multiple chunks with the same previous chunk (not previous block), then validators don't need to apply the same (expensive) state transition again. Therefore they can send an endorsement very quickly when the next chunk is produced. `slow_chunk.py` passes with this change. However, this fix is not optimal because for each block, the set of validators assigned to a shard changes and it may take a while before there is an assignment in which 2/3 of the validators have applied the expensive state transition and can quickly send an endorsement. It is also not clear what the optimal fix is. One idea is to remove the invariant that blocks can only include chunks with the same prev block hash, so that old chunks can be included if enough endorsements are received at some point. That, however, is a nontrivial change and can potentially break a lot of things.

fix: a slow chunk should not lead to chain getting stuck

5c4df9c

bowenwang1996 requested a review from a team as a code owner May 17, 2024 23:31

bowenwang1996 requested review from saketh-are, robin-near and pugachAG May 17, 2024 23:31

bowenwang1996 commented May 17, 2024

View reviewed changes

per shard caching

321ccf1

robin-near approved these changes May 18, 2024

View reviewed changes

chain/client/src/stateless_validation/chunk_validator/mod.rs Outdated Show resolved Hide resolved

bowenwang1996 added 4 commits May 17, 2024 17:37

address comments

a09476a

fix an insidious bug

19a61d8

fix another bug

833a1ef

fmt

3e50323

new implementation

1998f4a

tayfunelmas reviewed May 21, 2024

View reviewed changes

chain/client/src/stateless_validation/chunk_validator/mod.rs Show resolved Hide resolved

robin-near approved these changes May 21, 2024

View reviewed changes

chain/client/src/stateless_validation/chunk_validator/mod.rs Outdated Show resolved Hide resolved

address comments

e791d00

bowenwang1996 added this pull request to the merge queue May 22, 2024

Merged via the queue into near:master with commit 87da0e1 May 22, 2024
27 of 29 checks passed

bowenwang1996 deleted the fix-slow-chunk branch May 22, 2024 05:18

pugachAG reviewed May 22, 2024

View reviewed changes

bowenwang1996 mentioned this pull request May 22, 2024

Implement handling of slow chunks #11339

Closed

wacban mentioned this pull request May 23, 2024

Limit size of source_receipt_proofs inside ChunkStateWitness #11295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: a slow chunk should not lead to chain getting stuck #11344

fix: a slow chunk should not lead to chain getting stuck #11344

bowenwang1996 commented May 17, 2024

bowenwang1996 May 17, 2024

codecov bot commented May 19, 2024 •

edited

wacban commented May 20, 2024

bowenwang1996 commented May 20, 2024

robin-near left a comment

pugachAG left a comment

		Ok(validation_result) => {
		cache.put(prev_chunk_hash, validation_result);

fix: a slow chunk should not lead to chain getting stuck #11344

fix: a slow chunk should not lead to chain getting stuck #11344

Conversation

bowenwang1996 commented May 17, 2024

bowenwang1996 May 17, 2024

Choose a reason for hiding this comment

codecov bot commented May 19, 2024 • edited

Codecov Report

wacban commented May 20, 2024

bowenwang1996 commented May 20, 2024

robin-near left a comment

Choose a reason for hiding this comment

pugachAG left a comment

Choose a reason for hiding this comment

codecov bot commented May 19, 2024 •

edited