Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State sync from local snapshot: Implementation #29

Closed
3 tasks
Tracked by #27
sergio-mena opened this issue Dec 23, 2022 · 39 comments
Closed
3 tasks
Tracked by #27

State sync from local snapshot: Implementation #29

sergio-mena opened this issue Dec 23, 2022 · 39 comments
Labels
enhancement New feature or request state-sync
Milestone

Comments

@sergio-mena
Copy link
Contributor

sergio-mena commented Dec 23, 2022

Tasks:

DoD:

  • Implementation code complete and all PRs merged
  • Any changes needed to the documentation/spec are merged

Original issue: tendermint/tendermint#9948

@sergio-mena sergio-mena added the backlog A prioritized task in the team's backlog label Dec 23, 2022
@chillyvee
Copy link
Contributor

Happy to collaborate when you are ready. We just updated the code for the latest Juno v13 and have some intermediate feedback based on previous discussions.

@adizere
Copy link
Member

adizere commented Mar 31, 2023

Thank @chillyvee, appreciate it ! We'll get back to you in a few weeks time after we have explored all the existing material in tendermint/tendermint and have clarity over the solution. This work is part of our Q2 priorities.

@adizere
Copy link
Member

adizere commented Apr 18, 2023

Hi @chillyvee would you kindly reconstruct your original PR and submit to this repo as a draft PR preferably, with a base branch v0.34? We've now in the "understand problem & solution" space (#28) so it would be good to see this implementation.

Thanks in advance!

@yihuang
Copy link
Contributor

yihuang commented Apr 18, 2023

Hi @chillyvee would you kindly reconstruct your original PR and submit to this repo as a draft PR preferably, with a base branch v0.34? We've now in the "understand problem & solution" space (#28) so it would be good to see this implementation.

Thanks in advance!

Love this feature, would like to help if needed, BTW, where can I find the ADR083?

From my understanding, state-sync can be decoupled into several components:

  • snapshot discovery and chunks downloading
  • snapshot chunks verification with light client protocol
  • snapshot restoration
  • node startup with the new state.

Ideally we should have offline commands for each steps to provide maximum flexibility.

@cason
Copy link
Contributor

cason commented Apr 18, 2023

I think Adi referred to the open PR in the old repository: tendermint/tendermint#9651

This has become a huge PR due to the changes made in the upstream repository. We are only interested on this file: https://github.com/tendermint/tendermint/pull/9651/files#diff-393baba5ea65c4316e47d1a3715a1c4a8e934c1310071f982d2cf5418ef84d80

@yihuang
Copy link
Contributor

yihuang commented Apr 18, 2023

I think Adi referred to the open PR in the old repository: tendermint/tendermint#9651

This has become a huge PR due to the changes made in the upstream repository. We are only interested on this file: https://github.com/tendermint/tendermint/pull/9651/files#diff-393baba5ea65c4316e47d1a3715a1c4a8e934c1310071f982d2cf5418ef84d80

Thanks.

Allow applications to start with a bootstrapped state alongside an empty Tendermint instance using a subsystem of the state sync protocol.

This is exactly what I want.

@yihuang
Copy link
Contributor

yihuang commented Apr 19, 2023

Just did a squash rebase and a few tweaks to fix building, and opened the PR targeting v0.34.x branch: #728

@adizere
Copy link
Member

adizere commented Apr 19, 2023

Thank you @yihuang. I'll be opening an ADR based on the original one soon.

@adizere
Copy link
Member

adizere commented Apr 19, 2023

Ported the ADR here: #729. It's not yet clear to me if it's reflective of the solution, please have a look.

@chillyvee
Copy link
Contributor

Seeing the activity here. Need to review notes and will be back soon.

@yihuang
Copy link
Contributor

yihuang commented Apr 19, 2023

Seeing the activity here. Need to review notes and will be back soon.

Please feel free to take over the PR when you are back 😄

@cason
Copy link
Contributor

cason commented Apr 19, 2023

Any reason for targeting this change to branch v0.34.x? I would say it would be better and safer to target the main branch, then after due checks, backport it to release branches.

@chillyvee
Copy link
Contributor

Is there a preference for v0.34 or main?

Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work.

@yihuang
Copy link
Contributor

yihuang commented Apr 20, 2023

Any reason for targeting this change to branch v0.34.x? I would say it would be better and safer to target the main branch, then after due checks, backport it to release branches.

target main branch: #733

@adizere
Copy link
Member

adizere commented Apr 20, 2023

Is there a preference for v0.34 or main?

Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work.

Excellent idea!

@adizere adizere added this to the 2023-Q2 milestone Apr 21, 2023
@yihuang
Copy link
Contributor

yihuang commented Apr 26, 2023

FYI, after we identified the root cause of the state sync timeout issue and use the workaround: #742, the state sync UX become good again for us.

@adizere
Copy link
Member

adizere commented May 4, 2023

Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work.

Any progress on this?

@chillyvee
Copy link
Contributor

@adizere Will try to make this a nice weekend project :)

@adizere
Copy link
Member

adizere commented May 5, 2023

@adizere Will try to make this a nice weekend project :)

Hey, it's not urgent. I was asking because I was not able to test it. We spoke among the whole team also about the solution, and there's some concerns that the application state might be easily corruptible, or unsafe to transfer across machines. We are leaning towards a solution that reuses the snapshotting mechanism (instead of app.db).

@chillyvee
Copy link
Contributor

chillyvee commented May 5, 2023

@adizere Assuming statesync works (which it does), then calling statesync with local files rather than the exact same files downloaded over p2p between machines doesn't seem like a way to make the appstate more corruptible.

All the code does is reuse the existing snapshot restore mechanism and reference local application.db. Then finishes up by triggering comet-bft to restore the remaining non-application state.

Essentially the difference is replacing the statesync file downloads via FTP with a USB thumb drive. It's the same code executing.

Do you know what part of using the existing snapshot restore code/other code is the additional concern coming from? Could take your feedback to make extra checks or confirm the risk.

@yihuang
Copy link
Contributor

yihuang commented May 5, 2023

In my test, the current implementation do has some issues with handshake process, if it's too tricky to hook into the handshake process, I'd like to have some low level tools instead, for example let user to download and store the consensus state at an arbitrary height using the light client protocol. So power user can just do the thing in multiple steps manually:

@cason
Copy link
Contributor

cason commented May 8, 2023

@chillyvee Can you take a look on #801?

I think the proposal I am writing there, as ADR 104, is essentially the same as the one you are proposing.

@cason
Copy link
Contributor

cason commented May 8, 2023

@yihuang State sync currently does what you are proposing, with the advantage that it makes sure that the application state matches CometBFT's state (i.e., block store and state store). It does not make much sense to me to bootstrap CometBFT's state to a given height, as you proposed in #803, without having an application at the same height.

In #801 we propose a mechanism to allow what you are proposing as the first bullet, i.e., to restore an application snapshot from the local file system, while bootstrapping CometBFT's state accordingly.

@yihuang
Copy link
Contributor

yihuang commented May 8, 2023

It does not make much sense to me to bootstrap CometBFT's state to a given height, as you proposed in #803, without having an application at the same height.

the point is split up the state-sync process into multiple commands to maximize flexibility, there'll be a separate command to restore the app state from a local snapshot, it could be implemented in cosmos-sdk.

with the advantage that it makes sure that the application state matches CometBFT's state

after offline restoration, node start up normally, the normal start up procedures will check the consistencies anyway.

I think i'm just trying to do the same thing without intrusive changes in the normal node start up logic, just a few offline commands, but I'm open for more integrated solutions as well, as long as it solve the problem.

@yihuang
Copy link
Contributor

yihuang commented May 8, 2023

BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here

@cason
Copy link
Contributor

cason commented May 8, 2023

the point is split up the state-sync process into multiple commands to maximize flexibility

Yes, I agree with this. Maybe not really CLI commands, but to make the steps more clearly separated on the code, although they are already kind of.

More specifically:

  1. This node startup method configures the Light client provider and calls State Sync
  2. State sync main routine receives a candidate snapshot, uses its height to start the light client, then use the light-client obtained state to verify the installed snapshot
  3. Back to the node startup, the state and commit obtained by the light client are used to bootstrap the state store and the block store accordingly

@cason
Copy link
Contributor

cason commented May 8, 2023

I think i'm just trying to do the same thing without intrusive changes in the normal node start up logic

The ADR 104 proposal (#801) is not intrusive as well. It has the advantage of also considering the server side, namely, providing a command to allow "exporting" an application snapshot to a "file", that will be use to bootstrap another node.

@cason
Copy link
Contributor

cason commented May 8, 2023

BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here

Yes, this is an alternative for the server side to "export" snapshots.

I am not sure that is the best one, as it requires every snapshot to be saved to disk, even if the snapshots are not shipped out-of-band to fresh clients. Notice that snapshots are already saved somewhere by the application, and this approach will create some data duplication, with the disadvantage that the application knows how and when to prune or delete old snapshots, feature that would not be trivially transferred to this file system "snapshot store".

@yihuang
Copy link
Contributor

yihuang commented May 8, 2023

BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here

Yes, this is an alternative for the server side to "export" snapshots.

I am not sure that is the best one, as it requires every snapshot to be saved to disk, even if the snapshots are not shipped out-of-band to fresh clients. Notice that snapshots are already saved somewhere by the application, and this approach will create some data duplication, with the disadvantage that the application knows how and when to prune or delete old snapshots, feature that would not be trivially transferred to this file system "snapshot store".

in cosmos-sdk, it'll directly restore the state from snapshot without saving them, it only saves the snapshots created by itself, the new flag is to make it save the snapshot for the first restoration. cometbft do save the chunk to file in a temporary directory though, it's immediately removed after restoration finished.
the saved snapshot is to be used by another cli command to restore app state from it, then user use the bootstrap-state cli command to restore cometbft state, then a normal node start up.

@cason
Copy link
Contributor

cason commented May 8, 2023

in cosmos-sdk

We have to also consider users that do not adopt the SDK. Of course, once we have a more detailed proposal, we would be happy to discuss specifics with the SDK team.

@cason
Copy link
Contributor

cason commented May 8, 2023

Moreover, I would really appreciate your feedback on #801. It is in draft mode, but you can comment or propose changes there.

@yihuang
Copy link
Contributor

yihuang commented May 8, 2023

Moreover, I would really appreciate your feedback on #801. It is in draft mode, but you can comment or propose changes there.

have left review comments there, I think we need more descriptions of the new design to decide if it's a better one.

@yihuang
Copy link
Contributor

yihuang commented May 8, 2023

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

@yihuang
Copy link
Contributor

yihuang commented May 10, 2023

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

#803 is ported to sdk directly, it turns out we can hack tendermint state directly, so the whole local snapshot restoration procedure can be done in app side alone.

@adizere
Copy link
Member

adizere commented May 16, 2023

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

#803 is ported to sdk directly, it turns out we can hack tendermint state directly, so the whole local snapshot restoration procedure can be done in app side alone.

Excellent. Thanks @yihuang for the update!

I don't fully understand the SDK-side solution, but it seems to handle snapshot restoration plus also saving the snapshot (cosmos/cosmos-sdk#16060). What exactly is left on Comet side to do?

@yihuang
Copy link
Contributor

yihuang commented May 16, 2023

But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt?

#803 is ported to sdk directly, it turns out we can hack tendermint state directly, so the whole local snapshot restoration procedure can be done in app side alone.

Excellent. Thanks @yihuang for the update!

I don't fully understand the SDK-side solution, but it seems to handle snapshot restoration plus also saving the snapshot (cosmos/cosmos-sdk#16060). What exactly is left on Comet side to do?

maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future.

@thanethomson thanethomson removed the backlog A prioritized task in the team's backlog label May 16, 2023
@cason cason added enhancement New feature or request state-sync labels May 17, 2023
@adizere
Copy link
Member

adizere commented May 22, 2023

maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future.

Got it. This is a nice to have, right? Correct me if I'm wrong, it seems like on Comet side there's nothing left to do (except for "nice to have" in the future).

@yihuang
Copy link
Contributor

yihuang commented May 22, 2023

maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future.

Got it. This is a nice to have, right? Correct me if I'm wrong, it seems like on Comet side there's nothing left to do (except for "nice to have" in the future).

yeah, nice to have.

@adizere
Copy link
Member

adizere commented May 24, 2023

yeah, nice to have.

Thanks @yihuang. We'll track that work here: #884.

I'm closing the present PR: The original scope and problem we wanted to address here shifted. In the future, we might implement ADR 104 (#801) as a solution for non-SDK users that want support for state sync out of a local snapshot. For the case of SDK users, their needs seem to be addressed.

@adizere adizere closed this as completed May 24, 2023
@adizere adizere closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2023
Raneet10 pushed a commit to 0xPolygon/cometbft that referenced this issue Nov 20, 2023
…Change

Change the default value of maxInboundPeer and maxOutBoundPeer to 100
cometcrafter pushed a commit to graphprotocol/cometbft that referenced this issue May 1, 2024
…bft#2846… (cometbft#29) (cometbft#33)

* perf(libs/json): Lower heap overhead of JSON encoding (backport cometbft#2846) (cometbft#2876)

---

Many RPC methods require JSON marshalled responses. We saw this taking a
notable amount of heap allocation in query serving full nodes. This PR
removes some extra heap allocations that were being done. We avoided
using the more efficient encoder.Encode before, because it added a
newline. This PR changes the function signature for these private
methods to be using *bytes.Buffer, and then uses the in-buffer methods
(rather than a second copy). We then just truncate the final byte after
each such call, which does not waste any allocations.

I added a benchmark for the most complex test case.

OLD:
```
BenchmarkJsonMarshalStruct-12              78992             15542 ns/op            4487 B/op        191 allocs/op
```
New:
```
BenchmarkJsonMarshalStruct-12              93346             11132 ns/op            3245 B/op         58 allocs/op
```

Roughly a 3-4x reduction in the number of allocations, and 20% speedup.

- [x] Tests written/updated - Existing tests cover this
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2846 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Andy Nogueira <me@andynogueira.dev>

* changelog

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Andy Nogueira <me@andynogueira.dev>
(cherry picked from commit ee66963)

Co-authored-by: Adam Tucker <adam@osmosis.team>
cometcrafter pushed a commit to graphprotocol/cometbft that referenced this issue May 1, 2024
…bft#2846… (cometbft#29)

* perf(libs/json): Lower heap overhead of JSON encoding (backport cometbft#2846) (cometbft#2876)

---

Many RPC methods require JSON marshalled responses. We saw this taking a
notable amount of heap allocation in query serving full nodes. This PR
removes some extra heap allocations that were being done. We avoided
using the more efficient encoder.Encode before, because it added a
newline. This PR changes the function signature for these private
methods to be using *bytes.Buffer, and then uses the in-buffer methods
(rather than a second copy). We then just truncate the final byte after
each such call, which does not waste any allocations.

I added a benchmark for the most complex test case.

OLD:
```
BenchmarkJsonMarshalStruct-12              78992             15542 ns/op            4487 B/op        191 allocs/op
```
New:
```
BenchmarkJsonMarshalStruct-12              93346             11132 ns/op            3245 B/op         58 allocs/op
```

Roughly a 3-4x reduction in the number of allocations, and 20% speedup.

- [x] Tests written/updated - Existing tests cover this
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#2846 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Andy Nogueira <me@andynogueira.dev>

* changelog

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request state-sync
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants