New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable local snapshot restore #9541
base: main
Are you sure you want to change the base?
Conversation
* the working attackers isolation spec, needs more comments * the TLA+ spec of the attackers isolation
Bumps [gaurav-nelson/github-action-markdown-link-check](https://github.com/gaurav-nelson/github-action-markdown-link-check) from 1.0.8 to 1.0.11. - [Release notes](https://github.com/gaurav-nelson/github-action-markdown-link-check/releases) - [Commits](gaurav-nelson/github-action-markdown-link-check@1.0.8...2a60e0f) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add light attack evidence handling
Reflect the change made in tendermint#5805 The MTU (Maximum Transmission Unit) for Ethernet is 1500 bytes. The IP header and the TCP header take up 20 bytes each at least (unless optional header fields are used) and thus the max for (non-Jumbo frame) Ethernet is 1500 - 20 -20 = 1460 Source: https://stackoverflow.com/a/3074427/820520
Bumps [gaurav-nelson/github-action-markdown-link-check](https://github.com/gaurav-nelson/github-action-markdown-link-check) from 1.0.11 to 1.0.12. - [Release notes](https://github.com/gaurav-nelson/github-action-markdown-link-check/releases) - [Commits](gaurav-nelson/github-action-markdown-link-check@1.0.11...0fe4911) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>
Co-authored-by: Erik Grinaker <erik@interchain.berlin>
Co-authored-by: Marko <marbar3778@yahoo.com>
* modify readme * add rfc and proto * add rust=spec back to avoid breakage * lint readme
* describe the genesis * Update spec/core/genesis.md Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Callum Waters <cmwaters19@gmail.com> * add wording on app_state * Update spec/core/genesis.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* fix links * fix more links
* added proposer-based timestamp spec * Update spec/consensus/proposer-based-timestamp/pbts_001_draft.md Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com> * Update spec/consensus/proposer-based-timestamp/pbts_001_draft.md Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com> * Update spec/consensus/proposer-based-timestamp/pbts-algorithm_001_draft.md Co-authored-by: Marko <marbar3778@yahoo.com> * Update spec/consensus/proposer-based-timestamp/pbts-algorithm_001_draft.md * Update spec/consensus/proposer-based-timestamp/pbts-sysmodel_001_draft.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * fixes from PR Co-authored-by: Josef Widder <44643235+josef-widder@users.noreply.github.com> Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com> Co-authored-by: Marko <marbar3778@yahoo.com> Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* ABCI++ RFC This commit adds an RFC for ABCI++, which is a collection of three new phases of communication between the consensus engine and the application. Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch> * Fix bugs pointed out by @liamsi * Update rfc/004-abci++.md Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com> * Fix markdown lints * Update rfc/004-abci++.md Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com> * Update rfc/004-abci++.md Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com> * Update rfc/004-abci++.md Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com> * Add information about the rename in the context section * Bold RFC * Add example for self-authenticating vote data * More exposition of the term IPC * Update pros / negatives * Fix sentence fragment * Add desc for no-ops Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch> Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com> Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com> Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>
* add rpc spec and support outline * add json * add more routes remove unneeded ones * add rest of rpc endpoints * add jsonrpc calls * add more jsonrpc calls * fix blockchain * cleanup unused links and add links to repos * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * add missing param from consensus param * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * Update spec/rpc/README.md Co-authored-by: Callum Waters <cmwaters19@gmail.com> * fix cast and add doc to readme Co-authored-by: Callum Waters <cmwaters19@gmail.com> Co-authored-by: Marko Baricevic <markobaricevic@Fergalicious.local>
* Avoid quantifier alternation cycle The problematic quantifier alternation cycle arose because the definition of accountability_violation was unfolded. This commit also restructures the induction proof for clarity. * add count_lines.sh * fix typo and add forgotten complete=fo in comment Co-authored-by: Giuliano <giuliano@eic-61-11.galois.com>
Think I got it to work. I'll need to clean up the code and isolate a few locations for you to look into. There is a stray Error
Probably the validator set for the wrong height is being used. That's probably easy to fix but you are welcome to comment |
Fixed the last obvious bug by doing a fastsync after localstate sync. |
2b5aff4
to
5e5a6dc
Compare
5e5a6dc
to
994cb64
Compare
if state.LastBlockHeight != store.Height() { | ||
panic(fmt.Sprintf("state (%v) and store (%v) height mismatch", state.LastBlockHeight, | ||
// TODO: use logger before SetLogger? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer a fatal condition, panic? log? print?
If statesync RPC is not set, then a panic is better? Would we need to send more config.toml variables into this NewReactor() call?
eventBus types.BlockEventPublisher | ||
genDoc *types.GenesisDoc | ||
logger log.Logger | ||
stateProvider statesync.StateProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stateProvider is used by localSync() to sync over RPC when application height > store height
@@ -299,6 +355,19 @@ func (h *Handshaker) ReplayBlocks( | |||
"stateHeight", | |||
stateBlockHeight) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decision whether localSync() called before switch/case below in case edge cases remain
@@ -322,7 +322,7 @@ func newConsensusStateForReplay(config cfg.BaseConfig, csConfig *cfg.ConsensusCo | |||
tmos.Exit(fmt.Sprintf("Failed to start event bus: %v", err)) | |||
} | |||
|
|||
handshaker := NewHandshaker(stateStore, state, blockStore, gdoc) | |||
handshaker := NewHandshaker(stateStore, state, blockStore, gdoc, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe to assume no stateProvider required here?
err = bcR.SwitchToBlockSync(state) | ||
if err != nil { | ||
// If already fastsync, we expect an error | ||
n.Logger.Info("Failed to switch to fast sync", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any better way to detect when SwitchToBlockSync() is not required?
} | ||
n.consensusReactor.Metrics.StateSyncing.Set(0) | ||
n.consensusReactor.Metrics.BlockSyncing.Set(1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe to assume BlockSyncing?
) error { | ||
handshaker := cs.NewHandshaker(stateStore, state, blockStore, genDoc) | ||
consensusLogger log.Logger) error { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prepare stateProvider before NewHandshaker() is called
Implements RPC sync during handshake as recommended This code was previously tested against uni-5 testnet, then cleaned up and rebased on main. Comments added in areas of code where guidance could be used. |
As an aside, what is the normal way you do a sanity check on chain+cosmos-sdk+tendermint? It seems that simply running So when rebasing these changes on tendermint/main, it makes testing against juno/v11.0.0+comsos-sdk/v0.45.9 very challenging with with extensive manual patches to bridge any API/Type gaps. |
If you want to test with a real life example there isn't really any other way. What we could do is incorporate local sync into the e2e tests (i.e. we use our custom e2e application to test that the state restore worked as expected) I don't think this should be breaking though so you should only need to change the go.mod |
No problem. I think the risk of some very complicated breakage is very low given how close the Then as time permits, we can work together on any other more involved test/checks as we build further. When I get responses to the comments I put above, then we'll be able to wrap up at least the main design/implementation according to your guidance. |
@cmwaters A few notes:
|
As recently seen in evmos chain, a bad acting node is serving bad snapshots:
There should be a total of 103 chunks, but someone is only serving 22, forcing all p2p-statesync to fail. This highlights one reason a local statesync restore is an important option to have as it allows a validator to copy a known good snapshot from another server (self owned or from another validator/api server) |
Just a heads up - this is on my backlog of things to do |
We understand the challenge. Your feedback has been great and is worth waiting for! |
Hello, checking in again. :) |
Throwing in a comment because some repos need it to keep a PR alive, but tell me if it doesn't help :) |
This comment was marked as resolved.
This comment was marked as resolved.
@hoangdv2429 - May be better to ask for help in the cosmos hub discord. This particular page is focusing on a different part of statesync than your question. |
By adding the The main reason we haven't dedicated a lot of time to this yet is that we have a whole bunch of other super high priority work that's currently underway and limited capacity for other work, but we're doing our 2023-Q1 planning next week and will see how we can prioritize this. |
Thank you @thanethomson - Good luck with the high priority items and give a signal when it's time resume this PR |
Seeking guidance.
This PR enables cosmos-sdk to initiate local snapshot restore with tendermint finishing up the stateStore and blockStore data restoration.
This development is based on juno's uni testnet (tendermint v0.34.20) but seems to merge into main without conflict.
This PR pairs with: cosmos/cosmos-sdk#13521
This PR enables a validator quickly restore without needing to search for remote snapshot servers. In basis, if a validator already takes periodic snapshots, it is wasteful in time and bandwidth to search over P2P in hopes of finding a snapshot.
However, an RPC server is still required to fetch the last, current, and next blocks to fully restore stateStore and blockStore.
statesync/stateprovider.go
The use case is to quickly reduce disk usage which normally grows over time to an arguably large size without the need to slowly prune IAVL trees.
Before further development, we'd like to ask:
Do you prefer such work to be done on the main branch, and if so how shall it be tested against a cosmwasm restore?
If testing is required, how shall it be done with cosmos-sdk's additional data stores such as cosmwasm or a sifchain style clp module?
Please take a look at the structure of code and indicate whether any rules regarding ABCI or API/Responsibilities have been broken.
The current code design is meant for ease of review/comment before adding multiple additional layers of abstraction to put each bit of code in the right place.
This PR also allows for much smaller zip/tar.gz files for chain snapshots and doesn't require a validator to prune just to provide small restore files. This means archive nodes can now create snapshots for manual download too.
PR checklist
CHANGELOG_PENDING.md
updated, or no changelog entry neededdocs/
) and code comments, or nodocumentation updates needed