-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State sync from local snapshot: Implementation #29
Comments
Happy to collaborate when you are ready. We just updated the code for the latest Juno v13 and have some intermediate feedback based on previous discussions. |
Thank @chillyvee, appreciate it ! We'll get back to you in a few weeks time after we have explored all the existing material in tendermint/tendermint and have clarity over the solution. This work is part of our Q2 priorities. |
Hi @chillyvee would you kindly reconstruct your original PR and submit to this repo as a draft PR preferably, with a base branch v0.34? We've now in the "understand problem & solution" space (#28) so it would be good to see this implementation. Thanks in advance! |
Love this feature, would like to help if needed, BTW, where can I find the ADR083? From my understanding, state-sync can be decoupled into several components:
Ideally we should have offline commands for each steps to provide maximum flexibility. |
I think Adi referred to the open PR in the old repository: tendermint/tendermint#9651 This has become a huge PR due to the changes made in the upstream repository. We are only interested on this file: https://github.com/tendermint/tendermint/pull/9651/files#diff-393baba5ea65c4316e47d1a3715a1c4a8e934c1310071f982d2cf5418ef84d80 |
Thanks.
This is exactly what I want. |
Just did a squash rebase and a few tweaks to fix building, and opened the PR targeting |
Thank you @yihuang. I'll be opening an ADR based on the original one soon. |
Ported the ADR here: #729. It's not yet clear to me if it's reflective of the solution, please have a look. |
Seeing the activity here. Need to review notes and will be back soon. |
Please feel free to take over the PR when you are back 😄 |
Any reason for targeting this change to branch |
Is there a preference for v0.34 or main? Can probably make a working example for Juno v14 if anyone thinks that is helpful for the "understand" part of the work. |
target main branch: #733 |
Excellent idea! |
FYI, after we identified the root cause of the state sync timeout issue and use the workaround: #742, the state sync UX become good again for us. |
Any progress on this? |
@adizere Will try to make this a nice weekend project :) |
Hey, it's not urgent. I was asking because I was not able to test it. We spoke among the whole team also about the solution, and there's some concerns that the application state might be easily corruptible, or unsafe to transfer across machines. We are leaning towards a solution that reuses the snapshotting mechanism (instead of app.db). |
@adizere Assuming statesync works (which it does), then calling statesync with local files rather than the exact same files downloaded over p2p between machines doesn't seem like a way to make the appstate more corruptible. All the code does is reuse the existing snapshot restore mechanism and reference local application.db. Then finishes up by triggering comet-bft to restore the remaining non-application state. Essentially the difference is replacing the statesync file downloads via FTP with a USB thumb drive. It's the same code executing. Do you know what part of using the existing snapshot restore code/other code is the additional concern coming from? Could take your feedback to make extra checks or confirm the risk. |
In my test, the current implementation do has some issues with handshake process, if it's too tricky to hook into the handshake process, I'd like to have some low level tools instead, for example let user to download and store the consensus state at an arbitrary height using the light client protocol. So power user can just do the thing in multiple steps manually:
|
@chillyvee Can you take a look on #801? I think the proposal I am writing there, as ADR 104, is essentially the same as the one you are proposing. |
@yihuang State sync currently does what you are proposing, with the advantage that it makes sure that the application state matches CometBFT's state (i.e., block store and state store). It does not make much sense to me to bootstrap CometBFT's state to a given height, as you proposed in #803, without having an application at the same height. In #801 we propose a mechanism to allow what you are proposing as the first bullet, i.e., to restore an application snapshot from the local file system, while bootstrapping CometBFT's state accordingly. |
the point is split up the state-sync process into multiple commands to maximize flexibility, there'll be a separate command to restore the app state from a local snapshot, it could be implemented in cosmos-sdk.
after offline restoration, node start up normally, the normal start up procedures will check the consistencies anyway. I think i'm just trying to do the same thing without intrusive changes in the normal node start up logic, just a few offline commands, but I'm open for more integrated solutions as well, as long as it solve the problem. |
BTW, the second bullet in your proposal can also be implemented in app side, as demonstrated here |
Yes, I agree with this. Maybe not really CLI commands, but to make the steps more clearly separated on the code, although they are already kind of. More specifically:
|
The ADR 104 proposal (#801) is not intrusive as well. It has the advantage of also considering the server side, namely, providing a command to allow "exporting" an application snapshot to a "file", that will be use to bootstrap another node. |
Yes, this is an alternative for the server side to "export" snapshots. I am not sure that is the best one, as it requires every snapshot to be saved to disk, even if the snapshots are not shipped out-of-band to fresh clients. Notice that snapshots are already saved somewhere by the application, and this approach will create some data duplication, with the disadvantage that the application knows how and when to prune or delete old snapshots, feature that would not be trivially transferred to this file system "snapshot store". |
in cosmos-sdk, it'll directly restore the state from snapshot without saving them, it only saves the snapshots created by itself, the new flag is to make it save the snapshot for the first restoration. cometbft do save the chunk to file in a temporary directory though, it's immediately removed after restoration finished. |
We have to also consider users that do not adopt the SDK. Of course, once we have a more detailed proposal, we would be happy to discuss specifics with the SDK team. |
Moreover, I would really appreciate your feedback on #801. It is in draft mode, but you can comment or propose changes there. |
have left review comments there, I think we need more descriptions of the new design to decide if it's a better one. |
But I hope we can merge #803 to unblock app side solutions immediately, then we can take our time to find more ideal design, wdyt? |
Excellent. Thanks @yihuang for the update! I don't fully understand the SDK-side solution, but it seems to handle snapshot restoration plus also saving the snapshot (cosmos/cosmos-sdk#16060). What exactly is left on Comet side to do? |
maybe we can do this command more elegantly in cometbft, after all, it use many internal APIs to do the job, those are maybe prone to change in the future. |
Got it. This is a nice to have, right? Correct me if I'm wrong, it seems like on Comet side there's nothing left to do (except for "nice to have" in the future). |
yeah, nice to have. |
Thanks @yihuang. We'll track that work here: #884. I'm closing the present PR: The original scope and problem we wanted to address here shifted. In the future, we might implement ADR 104 (#801) as a solution for non-SDK users that want support for state sync out of a local snapshot. For the case of SDK users, their needs seem to be addressed. |
…Change Change the default value of maxInboundPeer and maxOutBoundPeer to 100
…bft#2846… (cometbft#29) (cometbft#33) * perf(libs/json): Lower heap overhead of JSON encoding (backport cometbft#2846) (cometbft#2876) --- Many RPC methods require JSON marshalled responses. We saw this taking a notable amount of heap allocation in query serving full nodes. This PR removes some extra heap allocations that were being done. We avoided using the more efficient encoder.Encode before, because it added a newline. This PR changes the function signature for these private methods to be using *bytes.Buffer, and then uses the in-buffer methods (rather than a second copy). We then just truncate the final byte after each such call, which does not waste any allocations. I added a benchmark for the most complex test case. OLD: ``` BenchmarkJsonMarshalStruct-12 78992 15542 ns/op 4487 B/op 191 allocs/op ``` New: ``` BenchmarkJsonMarshalStruct-12 93346 11132 ns/op 3245 B/op 58 allocs/op ``` Roughly a 3-4x reduction in the number of allocations, and 20% speedup. - [x] Tests written/updated - Existing tests cover this - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2846 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> * changelog --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> (cherry picked from commit ee66963) Co-authored-by: Adam Tucker <adam@osmosis.team>
…bft#2846… (cometbft#29) * perf(libs/json): Lower heap overhead of JSON encoding (backport cometbft#2846) (cometbft#2876) --- Many RPC methods require JSON marshalled responses. We saw this taking a notable amount of heap allocation in query serving full nodes. This PR removes some extra heap allocations that were being done. We avoided using the more efficient encoder.Encode before, because it added a newline. This PR changes the function signature for these private methods to be using *bytes.Buffer, and then uses the in-buffer methods (rather than a second copy). We then just truncate the final byte after each such call, which does not waste any allocations. I added a benchmark for the most complex test case. OLD: ``` BenchmarkJsonMarshalStruct-12 78992 15542 ns/op 4487 B/op 191 allocs/op ``` New: ``` BenchmarkJsonMarshalStruct-12 93346 11132 ns/op 3245 B/op 58 allocs/op ``` Roughly a 3-4x reduction in the number of allocations, and 20% speedup. - [x] Tests written/updated - Existing tests cover this - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [x] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#2846 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev> * changelog --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Tasks:
DoD:
Original issue: tendermint/tendermint#9948
The text was updated successfully, but these errors were encountered: