-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast PG catchup #1629
Comments
retrying failed/canceled catchup from the point where it was interrupted is very important, we could avoid retransmitting data |
x4m
pushed a commit
to x4m/wal-g
that referenced
this issue
Mar 4, 2024
Per design docs in wal-g#1629, but with significant changes (2nd approach). This PR allows to use catchup without pushing to storage.
reshke
pushed a commit
that referenced
this issue
Mar 31, 2024
* Catchup-send and catchup-receive commands Per design docs in #1629, but with significant changes (2nd approach). This PR allows to use catchup without pushing to storage. * Remove debug stuff * Some errors handling * Fix review issues * Minor refactoring * Remove unnecesary files * Refactor sending file * typo * Compression implementation * Calm goling * Fix unit test * Fix one more test * Refactor * Encryption * Enable diff back * Refactor a bit * Formatting * Minimal docs --------- Co-authored-by: Andrey M. Borodin <x4mmm@night.local> Co-authored-by: Andrey M. Borodin <x4mmm@172.25.72.30-ekb.dhcp.yndx.net>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In this issue, I want to suggest some enhancements to the existing PG catchup feature.
Problem Statement
If you have a streaming replication cluster, standby servers might lag behind the primary. If this lag is in the order of days, a reasonable solution is to recreate standby from a fresh backup.
However, retrieving the whole backup can take a long time, especially in heavily loaded clusters. For this situation, WAL-G provides PG catchup, which is a special type of backup that can be performed on top of an existing stopped standby server. WAL-G handles two distinct worlds: the currently running Postgres instance and the storage. The catchup backup must be stored in storage, and then fetched by another instance. However, the catchup via storage doubles the number of bytes that need to be transferred, making the catchup process twice as slow. This speed is crucial for the catchup. That's why we need a faster catchup that does not require pushing the backup to storage. We want the catchup to push the backup directly to the standby node.
Here are my proposed solutions:
Proposed Solution 1: Creating a Special Storage for Catchup
To avoid pushing the backup onto storage, we can create a special storage for catchup called CatchupStorage (CS). Let's call it CS for short. CS has several settings, including concurrent connections (concurrency) and standby hostname and port. When
wal-g catchup-fetch
is run against CS, it knows the concurrency setting, so it can return a list of expected tars to CS. CS opens a port when configured, and accepts incoming connections fromwal-g catchup-push
. It implementsObjectPut()
as the outgoing connection, transferring the name and contents of the object. When catchup wants to download a file from CS, it accepts all incoming connections and reads the names of objects from the connections.ObjectGet()
picks the connection corresponding to the requested object name. Data transfer between the CS source and destination is encrypted using the existing method of tar encryption; however, tar names are hardcoded and do not need encryption.The implementation of this solution is relatively simple, but its overall design is complex.
Proposal Solution 2: Implementing New Commands for Catchup Transfer
We can implement new commands for catching up without using storage at all:
wal-g catchup-send
andwal-g catchup-recieve
. These commands perform the same functions, but they do not use storage at all.Open questions
What if catchup was not applied successfully? Can we retry this operation?
Can we have only one TCP connection between Primary and Standby? I think we need many connections to utilize network efficiently. But one of the objectives might be avoid Primary starvation, so maybe one connection is reasonable too.
If we have many TCP connections we must ensure data from all streams was actually read.
@usernamedt What do you think? Which path to take?
The text was updated successfully, but these errors were encountered: