Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable local snapshot restore #9541

Open
wants to merge 385 commits into
base: main
Choose a base branch
from

Conversation

chillyvee
Copy link

@chillyvee chillyvee commented Oct 12, 2022

Seeking guidance.

This PR enables cosmos-sdk to initiate local snapshot restore with tendermint finishing up the stateStore and blockStore data restoration.

This development is based on juno's uni testnet (tendermint v0.34.20) but seems to merge into main without conflict.

This PR pairs with: cosmos/cosmos-sdk#13521

This PR enables a validator quickly restore without needing to search for remote snapshot servers. In basis, if a validator already takes periodic snapshots, it is wasteful in time and bandwidth to search over P2P in hopes of finding a snapshot.

However, an RPC server is still required to fetch the last, current, and next blocks to fully restore stateStore and blockStore.

statesync/stateprovider.go

146:    lastLightBlock, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height), time.Now())
150:    currentLightBlock, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height+1), time.Now())
154:    nextLightBlock, err := s.lc.VerifyLightBlockAtHeight(ctx, int64(height+2), time.Now())

The use case is to quickly reduce disk usage which normally grows over time to an arguably large size without the need to slowly prune IAVL trees.

Before further development, we'd like to ask:

Do you prefer such work to be done on the main branch, and if so how shall it be tested against a cosmwasm restore?

If testing is required, how shall it be done with cosmos-sdk's additional data stores such as cosmwasm or a sifchain style clp module?

Please take a look at the structure of code and indicate whether any rules regarding ABCI or API/Responsibilities have been broken.

The current code design is meant for ease of review/comment before adding multiple additional layers of abstraction to put each bit of code in the right place.

This PR also allows for much smaller zip/tar.gz files for chain snapshots and doesn't require a validator to prune just to provide small restore files. This means archive nodes can now create snapshots for manual download too.

PR checklist

  • Tests written/updated, or no tests needed
  • CHANGELOG_PENDING.md updated, or no changelog entry needed
  • Updated relevant documentation (docs/) and code comments, or no
    documentation updates needed

konnov and others added 30 commits December 11, 2020 15:03
* the working attackers isolation spec, needs more comments

* the TLA+ spec of the attackers isolation
Bumps [gaurav-nelson/github-action-markdown-link-check](https://github.com/gaurav-nelson/github-action-markdown-link-check) from 1.0.8 to 1.0.11.
- [Release notes](https://github.com/gaurav-nelson/github-action-markdown-link-check/releases)
- [Commits](gaurav-nelson/github-action-markdown-link-check@1.0.8...2a60e0f)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Add light attack evidence handling
Reflect the change made in tendermint#5805

The MTU (Maximum Transmission Unit) for Ethernet is 1500 bytes.
The IP header and the TCP header take up 20 bytes each at least (unless
optional header fields are used) and thus the max for (non-Jumbo frame)
Ethernet is 1500 - 20 -20 = 1460
Source: https://stackoverflow.com/a/3074427/820520
Bumps [gaurav-nelson/github-action-markdown-link-check](https://github.com/gaurav-nelson/github-action-markdown-link-check) from 1.0.11 to 1.0.12.
- [Release notes](https://github.com/gaurav-nelson/github-action-markdown-link-check/releases)
- [Commits](gaurav-nelson/github-action-markdown-link-check@1.0.11...0fe4911)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>
Co-authored-by: Erik Grinaker <erik@interchain.berlin>
Co-authored-by: Marko <marbar3778@yahoo.com>
* modify readme

* add rfc and proto

* add rust=spec back to avoid breakage

* lint readme
* describe the genesis

* Update spec/core/genesis.md

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* add wording on app_state

* Update spec/core/genesis.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* fix links

* fix more links
* added proposer-based timestamp spec

* Update spec/consensus/proposer-based-timestamp/pbts_001_draft.md

Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com>

* Update spec/consensus/proposer-based-timestamp/pbts_001_draft.md

Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com>

* Update spec/consensus/proposer-based-timestamp/pbts-algorithm_001_draft.md

Co-authored-by: Marko <marbar3778@yahoo.com>

* Update spec/consensus/proposer-based-timestamp/pbts-algorithm_001_draft.md

* Update spec/consensus/proposer-based-timestamp/pbts-sysmodel_001_draft.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* fixes from PR

Co-authored-by: Josef Widder <44643235+josef-widder@users.noreply.github.com>
Co-authored-by: Aleksandr Bezobchuk <alexanderbez@users.noreply.github.com>
Co-authored-by: Marko <marbar3778@yahoo.com>
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
* ABCI++ RFC

This commit adds an RFC for ABCI++, which is a collection of three new phases of communication between the consensus engine and the application.

Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch>

* Fix bugs pointed out by @liamsi

* Update rfc/004-abci++.md

Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com>

* Fix markdown lints

* Update rfc/004-abci++.md

Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>

* Update rfc/004-abci++.md

Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>

* Update rfc/004-abci++.md

Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>

* Add information about the rename in the context section

* Bold RFC

* Add example for self-authenticating vote data

* More exposition of the term IPC

* Update pros / negatives

* Fix sentence fragment

* Add desc for no-ops

Co-authored-by: Sunny Aggarwal <sunnya97@protonmail.ch>
Co-authored-by: Federico Kunze <31522760+fedekunze@users.noreply.github.com>
Co-authored-by: Ismail Khoffi <Ismail.Khoffi@gmail.com>
Co-authored-by: Tess Rinearson <tess.rinearson@gmail.com>
* add rpc spec and support outline

* add json

* add more routes remove unneeded ones

* add rest of rpc endpoints

* add jsonrpc calls

* add more jsonrpc calls

* fix blockchain

* cleanup unused links and add links to repos

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* add missing param from consensus param

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* Update spec/rpc/README.md

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

* fix cast and add doc to readme

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
Co-authored-by: Marko Baricevic <markobaricevic@Fergalicious.local>
* Avoid quantifier alternation cycle

The problematic quantifier alternation cycle arose because the
definition of accountability_violation was unfolded.

This commit also restructures the induction proof for clarity.

* add count_lines.sh

* fix typo and add forgotten complete=fo in comment

Co-authored-by: Giuliano <giuliano@eic-61-11.galois.com>
@chillyvee
Copy link
Author

chillyvee commented Oct 28, 2022

Think I got it to work. I'll need to clean up the code and isolate a few locations for you to look into.

There is a stray Error

ERR Error in validation err="Invalid commit -- wrong set size: 56 vs 9" module=blockchain

Probably the validator set for the wrong height is being used.

That's probably easy to fix but you are welcome to comment

@chillyvee
Copy link
Author

Fixed the last obvious bug by doing a fastsync after localstate sync.

@thanethomson thanethomson added the S:wip Work in progress (prevents stalebot from automatically closing) label Oct 28, 2022
if state.LastBlockHeight != store.Height() {
panic(fmt.Sprintf("state (%v) and store (%v) height mismatch", state.LastBlockHeight,
// TODO: use logger before SetLogger?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer a fatal condition, panic? log? print?

If statesync RPC is not set, then a panic is better? Would we need to send more config.toml variables into this NewReactor() call?

eventBus types.BlockEventPublisher
genDoc *types.GenesisDoc
logger log.Logger
stateProvider statesync.StateProvider
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stateProvider is used by localSync() to sync over RPC when application height > store height

@@ -299,6 +355,19 @@ func (h *Handshaker) ReplayBlocks(
"stateHeight",
stateBlockHeight)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decision whether localSync() called before switch/case below in case edge cases remain

@@ -322,7 +322,7 @@ func newConsensusStateForReplay(config cfg.BaseConfig, csConfig *cfg.ConsensusCo
tmos.Exit(fmt.Sprintf("Failed to start event bus: %v", err))
}

handshaker := NewHandshaker(stateStore, state, blockStore, gdoc)
handshaker := NewHandshaker(stateStore, state, blockStore, gdoc, nil)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe to assume no stateProvider required here?

err = bcR.SwitchToBlockSync(state)
if err != nil {
// If already fastsync, we expect an error
n.Logger.Info("Failed to switch to fast sync", "err", err)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any better way to detect when SwitchToBlockSync() is not required?

}
n.consensusReactor.Metrics.StateSyncing.Set(0)
n.consensusReactor.Metrics.BlockSyncing.Set(1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe to assume BlockSyncing?

) error {
handshaker := cs.NewHandshaker(stateStore, state, blockStore, genDoc)
consensusLogger log.Logger) error {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prepare stateProvider before NewHandshaker() is called

@chillyvee
Copy link
Author

Implements RPC sync during handshake as recommended

This code was previously tested against uni-5 testnet, then cleaned up and rebased on main.

Comments added in areas of code where guidance could be used.

@chillyvee
Copy link
Author

As an aside, what is the normal way you do a sanity check on chain+cosmos-sdk+tendermint?

It seems that simply running main branch as a target for all 3 git repositories makes almost everything break.

So when rebasing these changes on tendermint/main, it makes testing against juno/v11.0.0+comsos-sdk/v0.45.9 very challenging with with extensive manual patches to bridge any API/Type gaps.

@cmwaters
Copy link
Contributor

cmwaters commented Oct 28, 2022

As an aside, what is the normal way you do a sanity check on chain+cosmos-sdk+tendermint?

It seems that simply running main branch as a target for all 3 git repositories makes almost everything break.

So when rebasing these changes on tendermint/main, it makes testing against juno/v11.0.0+comsos-sdk/v0.45.9 very challenging with with extensive manual patches to bridge any API/Type gaps.

If you want to test with a real life example there isn't really any other way. What we could do is incorporate local sync into the e2e tests (i.e. we use our custom e2e application to test that the state restore worked as expected)

I don't think this should be breaking though so you should only need to change the go.mod

@chillyvee
Copy link
Author

chillyvee commented Oct 28, 2022

No problem. I think the risk of some very complicated breakage is very low given how close the main branch is with 0.34 so I'll just keep a slightly forked version for v0.34 to verify function. Not as if anyone is going to launch on main branch when there are perfectly good release tags.

Then as time permits, we can work together on any other more involved test/checks as we build further.

When I get responses to the comments I put above, then we'll be able to wrap up at least the main design/implementation according to your guidance.

@chillyvee
Copy link
Author

chillyvee commented Nov 3, 2022

@cmwaters A few notes:

  1. If you have a chance to comment on the above, I'll see if I can implement your feedback over the weekend
  2. It would be great to be able to put this in some future point release of 34.x so that existing chains can get this feature. Should I create a separate PR and which branch should it target?
  3. You mentioned that you were moving Handshaker from NewNode to OnStart in this main branch, but this current code does not require that change. Are you still trying to update the Handshaker location before merging this PR?

@chillyvee
Copy link
Author

chillyvee commented Nov 6, 2022

As recently seen in evmos chain, a bad acting node is serving bad snapshots:

Applied snapshot chunk to ABCI app chunk=21 format=1 height=6781500 module=statesync server=node total=22

There should be a total of 103 chunks, but someone is only serving 22, forcing all p2p-statesync to fail.

This highlights one reason a local statesync restore is an important option to have as it allows a validator to copy a known good snapshot from another server (self owned or from another validator/api server)

@cmwaters cmwaters self-assigned this Nov 10, 2022
@cmwaters
Copy link
Contributor

Just a heads up - this is on my backlog of things to do

@chillyvee
Copy link
Author

Just a heads up - this is on my backlog of things to do

We understand the challenge. Your feedback has been great and is worth waiting for!

@chillyvee
Copy link
Author

chillyvee commented Nov 21, 2022

Hello, checking in again. :)

@chillyvee
Copy link
Author

Throwing in a comment because some repos need it to keep a PR alive, but tell me if it doesn't help :)

@hoangdv2429

This comment was marked as resolved.

@chillyvee
Copy link
Author

chillyvee commented Dec 3, 2022

Error: error during handshake: error on replay: app block height (12985526) is higher than core (12985480).

@hoangdv2429 - May be better to ask for help in the cosmos hub discord. This particular page is focusing on a different part of statesync than your question.

@thanethomson
Copy link
Contributor

Throwing in a comment because some repos need it to keep a PR alive, but tell me if it doesn't help :)

By adding the S:wip label we've prevented stalebot from automatically closing this PR, so don't worry about it automatically being closed.

The main reason we haven't dedicated a lot of time to this yet is that we have a whole bunch of other super high priority work that's currently underway and limited capacity for other work, but we're doing our 2023-Q1 planning next week and will see how we can prioritize this.

@chillyvee
Copy link
Author

Thank you @thanethomson - Good luck with the high priority items and give a signal when it's time resume this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog-priority A priority issue in the backlog S:wip Work in progress (prevents stalebot from automatically closing)
Projects
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

None yet