State sync from local snapshot: Implementation #29

sergio-mena · 2022-12-23T15:50:41Z

Tasks:

Implementation (existing implementation of ADR083: Enable local snapshot restore tendermint/tendermint#9541)
Introduce necessary changes to documentation and/or spec to keep them up to date
Check if statesync: bootstrap node with state obtained out-of-band tendermint/tendermint#4642 can be closed once implementation merged

DoD:

Implementation code complete and all PRs merged
Any changes needed to the documentation/spec are merged

Original issue: tendermint/tendermint#9948

chillyvee · 2023-03-13T21:53:43Z

Happy to collaborate when you are ready. We just updated the code for the latest Juno v13 and have some intermediate feedback based on previous discussions.

adizere · 2023-03-31T11:45:41Z

Thank @chillyvee, appreciate it ! We'll get back to you in a few weeks time after we have explored all the existing material in tendermint/tendermint and have clarity over the solution. This work is part of our Q2 priorities.

adizere · 2023-04-18T07:37:24Z

Hi @chillyvee would you kindly reconstruct your original PR and submit to this repo as a draft PR preferably, with a base branch v0.34? We've now in the "understand problem & solution" space (#28) so it would be good to see this implementation.

Thanks in advance!

yihuang · 2023-04-18T08:30:43Z

Hi @chillyvee would you kindly reconstruct your original PR and submit to this repo as a draft PR preferably, with a base branch v0.34? We've now in the "understand problem & solution" space (#28) so it would be good to see this implementation.

Thanks in advance!

Love this feature, would like to help if needed, BTW, where can I find the ADR083?

From my understanding, state-sync can be decoupled into several components:

snapshot discovery and chunks downloading
snapshot chunks verification with light client protocol
snapshot restoration
node startup with the new state.

Ideally we should have offline commands for each steps to provide maximum flexibility.

cason · 2023-04-18T09:39:00Z

I think Adi referred to the open PR in the old repository: tendermint/tendermint#9651

This has become a huge PR due to the changes made in the upstream repository. We are only interested on this file: https://github.com/tendermint/tendermint/pull/9651/files#diff-393baba5ea65c4316e47d1a3715a1c4a8e934c1310071f982d2cf5418ef84d80

yihuang · 2023-04-18T09:53:14Z

I think Adi referred to the open PR in the old repository: tendermint/tendermint#9651

This has become a huge PR due to the changes made in the upstream repository. We are only interested on this file: https://github.com/tendermint/tendermint/pull/9651/files#diff-393baba5ea65c4316e47d1a3715a1c4a8e934c1310071f982d2cf5418ef84d80

Thanks.

Allow applications to start with a bootstrapped state alongside an empty Tendermint instance using a subsystem of the state sync protocol.

This is exactly what I want.

yihuang · 2023-04-19T03:43:32Z

Just did a squash rebase and a few tweaks to fix building, and opened the PR targeting v0.34.x branch: #728

adizere · 2023-04-19T07:54:31Z

Thank you @yihuang. I'll be opening an ADR based on the original one soon.

adizere · 2023-04-19T08:17:15Z

Ported the ADR here: #729. It's not yet clear to me if it's reflective of the solution, please have a look.

chillyvee · 2023-04-19T09:41:53Z

Seeing the activity here. Need to review notes and will be back soon.

yihuang · 2023-04-19T12:15:10Z

Seeing the activity here. Need to review notes and will be back soon.

Please feel free to take over the PR when you are back 😄

cason · 2023-04-19T15:49:19Z

Any reason for targeting this change to branch v0.34.x? I would say it would be better and safer to target the main branch, then after due checks, backport it to release branches.

chillyvee · 2023-04-19T17:40:41Z

Is there a preference for v0.34 or main?

Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work.

yihuang · 2023-04-20T01:44:44Z

Any reason for targeting this change to branch v0.34.x? I would say it would be better and safer to target the main branch, then after due checks, backport it to release branches.

target main branch: #733

adizere · 2023-04-20T07:52:54Z

Is there a preference for v0.34 or main?

Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work.

Excellent idea!

yihuang · 2023-04-26T08:33:30Z

FYI, after we identified the root cause of the state sync timeout issue and use the workaround: #742, the state sync UX become good again for us.

adizere · 2023-05-04T14:49:45Z

Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work.

Any progress on this?

chillyvee · 2023-05-04T20:55:00Z

@adizere Will try to make this a nice weekend project :)

adizere · 2023-05-05T13:44:43Z

@adizere Will try to make this a nice weekend project :)

Hey, it's not urgent. I was asking because I was not able to test it. We spoke among the whole team also about the solution, and there's some concerns that the application state might be easily corruptible, or unsafe to transfer across machines. We are leaning towards a solution that reuses the snapshotting mechanism (instead of app.db).

chillyvee · 2023-05-05T14:03:36Z

@adizere Assuming statesync works (which it does), then calling statesync with local files rather than the exact same files downloaded over p2p between machines doesn't seem like a way to make the appstate more corruptible.

All the code does is reuse the existing snapshot restore mechanism and reference local application.db. Then finishes up by triggering comet-bft to restore the remaining non-application state.

Essentially the difference is replacing the statesync file downloads via FTP with a USB thumb drive. It's the same code executing.

Do you know what part of using the existing snapshot restore code/other code is the additional concern coming from? Could take your feedback to make extra checks or confirm the risk.

yihuang · 2023-05-05T17:14:04Z

In my test, the current implementation do has some issues with handshake process, if it's too tricky to hook into the handshake process, I'd like to have some low level tools instead, for example let user to download and store the consensus state at an arbitrary height using the light client protocol. So power user can just do the thing in multiple steps manually:

download snapshot and restore app state
restore cometbft state to the same height (feat: add bootstrap-state cmd #803)
start the node and it'll sync from there.

cason · 2023-05-08T03:26:58Z

@chillyvee Can you take a look on #801?

I think the proposal I am writing there, as ADR 104, is essentially the same as the one you are proposing.

cason · 2023-05-08T03:34:59Z

@yihuang State sync currently does what you are proposing, with the advantage that it makes sure that the application state matches CometBFT's state (i.e., block store and state store). It does not make much sense to me to bootstrap CometBFT's state to a given height, as you proposed in #803, without having an application at the same height.

In #801 we propose a mechanism to allow what you are proposing as the first bullet, i.e., to restore an application snapshot from the local file system, while bootstrapping CometBFT's state accordingly.

yihuang · 2023-05-08T04:09:53Z

It does not make much sense to me to bootstrap CometBFT's state to a given height, as you proposed in #803, without having an application at the same height.

the point is split up the state-sync process into multiple commands to maximize flexibility, there'll be a separate command to restore the app state from a local snapshot, it could be implemented in cosmos-sdk.

with the advantage that it makes sure that the application state matches CometBFT's state

after offline restoration, node start up normally, the normal start up procedures will check the consistencies anyway.

I think i'm just trying to do the same thing without intrusive changes in the normal node start up logic, just a few offline commands, but I'm open for more integrated solutions as well, as long as it solve the problem.

yihuang · 2023-05-08T04:31:14Z

BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here

cason · 2023-05-08T05:16:00Z

the point is split up the state-sync process into multiple commands to maximize flexibility

Yes, I agree with this. Maybe not really CLI commands, but to make the steps more clearly separated on the code, although they are already kind of.

More specifically:

This node startup method configures the Light client provider and calls State Sync
State sync main routine receives a candidate snapshot, uses its height to start the light client, then use the light-client obtained state to verify the installed snapshot
Back to the node startup, the state and commit obtained by the light client are used to bootstrap the state store and the block store accordingly

cason · 2023-05-08T05:19:01Z

I think i'm just trying to do the same thing without intrusive changes in the normal node start up logic

The ADR 104 proposal (#801) is not intrusive as well. It has the advantage of also considering the server side, namely, providing a command to allow "exporting" an application snapshot to a "file", that will be use to bootstrap another node.

cason · 2023-05-08T05:23:16Z

BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here

Yes, this is an alternative for the server side to "export" snapshots.

I am not sure that is the best one, as it requires every snapshot to be saved to disk, even if the snapshots are not shipped out-of-band to fresh clients. Notice that snapshots are already saved somewhere by the application, and this approach will create some data duplication, with the disadvantage that the application knows how and when to prune or delete old snapshots, feature that would not be trivially transferred to this file system "snapshot store".

yihuang · 2023-05-08T05:28:41Z

BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here

Yes, this is an alternative for the server side to "export" snapshots.

I am not sure that is the best one, as it requires every snapshot to be saved to disk, even if the snapshots are not shipped out-of-band to fresh clients. Notice that snapshots are already saved somewhere by the application, and this approach will create some data duplication, with the disadvantage that the application knows how and when to prune or delete old snapshots, feature that would not be trivially transferred to this file system "snapshot store".

in cosmos-sdk, it'll directly restore the state from snapshot without saving them, it only saves the snapshots created by itself, the new flag is to make it save the snapshot for the first restoration. cometbft do save the chunk to file in a temporary directory though, it's immediately removed after restoration finished.
the saved snapshot is to be used by another cli command to restore app state from it, then user use the bootstrap-state cli command to restore cometbft state, then a normal node start up.

cason · 2023-05-08T05:35:06Z

in cosmos-sdk

We have to also consider users that do not adopt the SDK. Of course, once we have a more detailed proposal, we would be happy to discuss specifics with the SDK team.

cason · 2023-05-08T05:37:37Z

Moreover, I would really appreciate your feedback on #801. It is in draft mode, but you can comment or propose changes there.

yihuang · 2023-05-08T06:22:07Z

Moreover, I would really appreciate your feedback on #801. It is in draft mode, but you can comment or propose changes there.

have left review comments there, I think we need more descriptions of the new design to decide if it's a better one.

yihuang · 2023-05-08T10:58:49Z

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

yihuang · 2023-05-10T09:01:03Z

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

#803 is ported to sdk directly, it turns out we can hack tendermint state directly, so the whole local snapshot restoration procedure can be done in app side alone.

adizere · 2023-05-16T08:07:17Z

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

#803 is ported to sdk directly, it turns out we can hack tendermint state directly, so the whole local snapshot restoration procedure can be done in app side alone.

Excellent. Thanks @yihuang for the update!

I don't fully understand the SDK-side solution, but it seems to handle snapshot restoration plus also saving the snapshot (cosmos/cosmos-sdk#16060). What exactly is left on Comet side to do?

yihuang · 2023-05-16T08:17:14Z

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

#803 is ported to sdk directly, it turns out we can hack tendermint state directly, so the whole local snapshot restoration procedure can be done in app side alone.

Excellent. Thanks @yihuang for the update!

I don't fully understand the SDK-side solution, but it seems to handle snapshot restoration plus also saving the snapshot (cosmos/cosmos-sdk#16060). What exactly is left on Comet side to do?

maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future.

adizere · 2023-05-22T09:44:57Z

maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future.

Got it. This is a nice to have, right? Correct me if I'm wrong, it seems like on Comet side there's nothing left to do (except for "nice to have" in the future).

yihuang · 2023-05-22T11:04:31Z

maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future.

Got it. This is a nice to have, right? Correct me if I'm wrong, it seems like on Comet side there's nothing left to do (except for "nice to have" in the future).

yeah, nice to have.

adizere · 2023-05-24T15:42:05Z

yeah, nice to have.

Thanks @yihuang. We'll track that work here: #884.

I'm closing the present PR: The original scope and problem we wanted to address here shifted. In the future, we might implement ADR 104 (#801) as a solution for non-SDK users that want support for state sync out of a local snapshot. For the case of SDK users, their needs seem to be addressed.

…Change Change the default value of maxInboundPeer and maxOutBoundPeer to 100

…bft#2846… (cometbft#29) (cometbft#33) * perf(libs/json): Lower heap overhead of JSON encoding (backport cometbft#2846) (cometbft#2876) --- Many RPC methods require JSON marshalled responses. We saw this taking a notable amount of heap allocation in query serving full nodes. This PR removes some extra heap allocations that were being done. We avoided using the more efficient encoder.Encode before, because it added a newline. This PR changes the function signature for these private methods to be using *bytes.Buffer, and then uses the in-buffer methods (rather than a second copy). We then just truncate the final byte after each such call, which does not waste any allocations. I added a benchmark for the most complex test case. OLD: ``` BenchmarkJsonMarshalStruct-12 78992 15542 ns/op 4487 B/op 191 allocs/op ``` New: ``` BenchmarkJsonMarshalStruct-12 93346 11132 ns/op 3245 B/op 58 allocs/op ``` Roughly a 3-4x reduction in the number of allocations, and 20% speedup. - [x] Tests written/updated - Existing tests cover this - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2846 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> * changelog --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> (cherry picked from commit ee66963) Co-authored-by: Adam Tucker <adam@osmosis.team>

…bft#2846… (cometbft#29) * perf(libs/json): Lower heap overhead of JSON encoding (backport cometbft#2846) (cometbft#2876) --- Many RPC methods require JSON marshalled responses. We saw this taking a notable amount of heap allocation in query serving full nodes. This PR removes some extra heap allocations that were being done. We avoided using the more efficient encoder.Encode before, because it added a newline. This PR changes the function signature for these private methods to be using *bytes.Buffer, and then uses the in-buffer methods (rather than a second copy). We then just truncate the final byte after each such call, which does not waste any allocations. I added a benchmark for the most complex test case. OLD: ``` BenchmarkJsonMarshalStruct-12 78992 15542 ns/op 4487 B/op 191 allocs/op ``` New: ``` BenchmarkJsonMarshalStruct-12 93346 11132 ns/op 3245 B/op 58 allocs/op ``` Roughly a 3-4x reduction in the number of allocations, and 20% speedup. - [x] Tests written/updated - Existing tests cover this - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2846 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> * changelog --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev>

sergio-mena mentioned this issue Dec 23, 2022

State sync from local snapshot #27

Closed

2 tasks

sergio-mena added the backlog A prioritized task in the team's backlog label Dec 23, 2022

yihuang mentioned this issue Apr 19, 2023

Enable local snapshot restore (backport: #733) #728

Closed

3 tasks

yihuang mentioned this issue Apr 20, 2023

Enable local snapshot restore #733

Closed

3 tasks

adizere added this to the 2023-Q2 milestone Apr 21, 2023

yihuang mentioned this issue May 8, 2023

feat: add bootstrap-state cmd #803

Closed

3 tasks

thanethomson removed the backlog A prioritized task in the team's backlog label May 16, 2023

cason added enhancement New feature or request state-sync labels May 17, 2023

adizere closed this as completed May 24, 2023

adizere closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2023

Raneet10 pushed a commit to 0xPolygon/cometbft that referenced this issue Nov 20, 2023

Merge pull request cometbft#29 from maticnetwork/vaibhav/DefaultValue…

da9dfa1

…Change Change the default value of maxInboundPeer and maxOutBoundPeer to 100

State sync from local snapshot: Implementation #29

State sync from local snapshot: Implementation #29

Comments

sergio-mena commented Dec 23, 2022 • edited

chillyvee commented Mar 13, 2023

adizere commented Mar 31, 2023

adizere commented Apr 18, 2023 • edited

yihuang commented Apr 18, 2023 • edited

cason commented Apr 18, 2023 • edited

yihuang commented Apr 18, 2023

yihuang commented Apr 19, 2023 • edited

adizere commented Apr 19, 2023

adizere commented Apr 19, 2023

chillyvee commented Apr 19, 2023

yihuang commented Apr 19, 2023

cason commented Apr 19, 2023

chillyvee commented Apr 19, 2023

yihuang commented Apr 20, 2023

adizere commented Apr 20, 2023

yihuang commented Apr 26, 2023 • edited

adizere commented May 4, 2023

chillyvee commented May 4, 2023

adizere commented May 5, 2023

chillyvee commented May 5, 2023 • edited

yihuang commented May 5, 2023 • edited

cason commented May 8, 2023 • edited

cason commented May 8, 2023 • edited

yihuang commented May 8, 2023 • edited

yihuang commented May 8, 2023 • edited

cason commented May 8, 2023 • edited

cason commented May 8, 2023

cason commented May 8, 2023 • edited

yihuang commented May 8, 2023 • edited

cason commented May 8, 2023

cason commented May 8, 2023

yihuang commented May 8, 2023 • edited

yihuang commented May 8, 2023

yihuang commented May 10, 2023

adizere commented May 16, 2023

yihuang commented May 16, 2023

adizere commented May 22, 2023 • edited

yihuang commented May 22, 2023

adizere commented May 24, 2023

sergio-mena commented Dec 23, 2022 •

edited

adizere commented Apr 18, 2023 •

edited

yihuang commented Apr 18, 2023 •

edited

cason commented Apr 18, 2023 •

edited

yihuang commented Apr 19, 2023 •

edited

yihuang commented Apr 26, 2023 •

edited

chillyvee commented May 5, 2023 •

edited

yihuang commented May 5, 2023 •

edited

cason commented May 8, 2023 •

edited

cason commented May 8, 2023 •

edited

yihuang commented May 8, 2023 •

edited

yihuang commented May 8, 2023 •

edited

cason commented May 8, 2023 •

edited

cason commented May 8, 2023 •

edited

yihuang commented May 8, 2023 •

edited

yihuang commented May 8, 2023 •

edited

adizere commented May 22, 2023 •

edited