-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] Unpause atomic writes on block insert error #2457
base: mainnet-staging
Are you sure you want to change the base?
Conversation
synthesizer/src/vm/mod.rs
Outdated
// Note: This call is guaranteed to succeed (without error), because `DISCARD_BATCH == true`. | ||
self.block_store().unpause_atomic_writes::<true>()?; | ||
// Rollback the Merkle tree. | ||
self.block_store().remove_last_n_from_tree_only(1).map_err(|removal_error| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was an issue with adding the block to the block store, why would we need to revert the tree or rollback the block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh we might not, I thought there might be partial state that would need rolling back but I'm not yet super familiar with this codepath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a commit to remove this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Is there a way to force this error to occur and validate that this fix correctly addresses the failure case? In addition, is there a way to determine why the rocksDB write might have failed? Could there be any issues with the |
@niklaslong correct me if I'm mistaken, but I believe the error was not rocksDB-related, but rather "couldn't insert block at height N", i.e. a logic error. |
I believe the only errors in |
|
Yes, the error I managed to trigger on the devnet was the "incorrect height" ☝️
I didn't have any further luck last week on the devnets but we could try to trigger an error on insert manually to check if the issue can be replicated that way. It would at least validate that it's possible to trigger it on that codepath. |
Is this ready for review or is there still changes/testing to be done? |
Should be good to go in theory. Fabiano has encountered the error again on a client, I've sent him a patched branch to try, hopefully we can validate the fix in practice that way 🤞 |
Tested this on my 3-node setup and this works. Node would still corrupt the database or stop syncing for unknown reasons, but there's no |
@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs? |
hmm, after grepping the logs I did see a few panics about 16 hours ago, but as the stderr is not logged to log file I'm not sure what they are about now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM post-rebase 👌
self.block_store().abort_atomic(); | ||
// Disable the atomic batch override. | ||
// Note: This call is guaranteed to succeed (without error), because `DISCARD_BATCH == true`. | ||
self.block_store().unpause_atomic_writes::<true>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the correct approach, but can we be sure this doesn't have any deleterious side effects? How are we sure that it doesn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always should have been there; block insertion is the only operation where pausing is used (because block insertion and finalization are technically separate atomic operations, but they need to be treated atomically here), and once the pause is in force, it MUST be undone by the end of that operation or if that operation fails (which is covered by this PR).
Ran e2e.yml integration test on this, passed succesfully. |
@@ -398,7 +398,19 @@ impl<N: Network, C: ConsensusStorage<N>> VM<N, C> { | |||
self.block_store().pause_atomic_writes()?; | |||
|
|||
// First, insert the block. | |||
self.block_store().insert(block)?; | |||
if let Err(insert_error) = self.block_store().insert(block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to force this error to occur and validate that this fix correctly addresses the failure case?
I would advise strongly on an explicit test case prior to proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in theory... though I don't think it will be trivial (it needs to be triggered once during sync specifically, not during ledger load or other block inserts and this code has no caller context). I think @ljedrz might be able to take a look while I'm away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicit test case added in 3696b0d; it panics (with a pause-related error) without this PR, and passes with it
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR contains a potential fix to one of the atomic panics we've seen on the devnet:
I initially thought the error might suggest the occurrence of simultaneous block inserts from different threads (there are no
await
points so it can't be from the same task) but the block lock ruled this out. This leaves us with improper handling of unpausing which this changeset aims to fix.Drafting this PR as I haven't yet been able to confirm this fix is sufficient to resolve the issue entirely.