Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Proof-of-concept fix for missing .partial WALs on recovery #629

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mikewallace1979
Copy link
Contributor

@mikewallace1979 mikewallace1979 commented Jul 27, 2022

Updated with a different approach where .partial files are handled after copying the WALs in the archive.

There are a few upsides to doing this:

  • We no longer need special cases in the code branch for compressed WALs.
  • We can use Rsync.from_file_list to copy the .partial files when WALs are uncompressed, since all .partial files now get copied and renamed in a temporary staging directory.

Including the .partial WAL files in the get_required_xlog_files response seems a bit odd since one of the first things we do in _xlog_copy is put them in a separate list so they can be handled after the WALs in the archive are copied, however it feels like the right thing to do given that the .partial file is a required xlog file (unless the requested recovery timeline or time dictate otherwise).

@edb-sonar-app
Copy link

edb-sonar-app bot commented Jul 27, 2022

@mikewallace1979
Copy link
Contributor Author

Having talked this through with @amenonsen the assumption about the .partial file being uncompressed is reasonable.

Compression in the .partial file can happen in two places:

  1. PostgreSQL wal_compression - if this is set then pages within the WAL will be compressed, however the file itself is not a compressed file in any sense that external tools need to care.
  2. pg_receivewal compression - this compresses the .partial file but applies a .gz.partial suffix:
	/* File looks like a partial gzip-compressed WAL file */
	if (fname_len == XLOG_FNAME_LEN + strlen(".gz.partial") &&
		strcmp(filename + XLOG_FNAME_LEN, ".gz.partial") == 0)
	{
		*ispartial = true;
		*wal_compression_method = COMPRESSION_GZIP;
		return true;
	}

Given:

  1. Barman runs pg_receivewal itself and we do not support it being run out-of-band writing files into the streaming directory.
  2. Barman does not use pg_receivewal's compression feature.
  3. Barman does not consider a .gz.partial file to be a WAL.

Any .partial file we encounter in Barman at this point is therefore going to be uncompressed (at the file level).

This is a first pass at including .partial files in the streaming
directory when copying WALs on recovery.

The implementation works by appending any .partial files in the
streaming directory during `Server.get_required_xlog_files` and then
handling `.partial` files as a special case during
`RecoveryExecutor._xlog_copy`.

It works, but I don't like it for a number of reasons:

1. The special cases to handle .partial files during _xlog_copy are
   awkward and does not make for readable code.
2. The assumption that `.partial` files will always have no compression
   is questionable.

Alternatives:

1. Make WalFileInfo handle `.partial` details transparently.
2. Don't use `get_required_xlog_files` at all and instead just have a
   "and check for .partial files" somewhere in `_xlog_copy`.
3. Flip things around so instead of finding `.partial` files in the
   `streaming` directory and seeing if they are needed, we figure out
   from the most recent required xlog what the next xlog will be given
   the recovery target and use some of the `get-wal` code to look for
   that specific file.

A couple of other things to consider:

1. Should we also check `incoming` for WAL files? If we want parity
   with `barman get-wal` then we should also check `incoming`.
2. Should we also include non-partial WALs in `streaming` (and
   `incoming`) in the copy?
Instead of attempting to copy .partial files during the same
`Rsync.from_file_list` call, we instead treat it as a separate copy.

Offers a number of improvements over the previous attempt:

1. `.partial` files are now also copied when WALs are not compressed.
2. The `.partial` file is not copied if it is not required due to
   the requested recovery time being met by existing WALs in the archive.

On the downside, the sonarqube cognitive complexity metrics are going
to hate this. There are a few opportunities for extracting bit of code
but both `get_required_xlog_files` and `_xlog_copy` are already quite
long and complex functions so it's unlikely we'll be able to make
sonarqube completely happy.
@mikewallace1979 mikewallace1979 force-pushed the 393-copy-partial-wal-files-during-recovery branch from 957eb20 to 6df3178 Compare October 6, 2022 15:24
… one

This "should never happen" however if `get_required_xlog_files` ever gets run
outside of the `barman recover` context it is possible that there are older
`.partial` files which haven't been dealt with by the archiver process yet.

In such cases we should make the same decision the archiver would and use the
latest file.
Updates _xlog_copy so that special handling is only used for `.partial`
files in the streaming directory. Any `.partial` files found in the
WAL archive are handled the same way Barman would previously handle
them, i.e. they are copied across to the recovery WAL destination
(decompressing if necessary) and *not* renamed. Because this is a
situation which requires some manual intervention a warning message is
now logged to tell the user about each `.partial` WAL found in the
archive.

The user must then determine whether the `.partial` WAL in the archive
is the result of a recovery from backup or a standby promotion and, if
so, can rename the file to omit its `.partial` extension.

There is a clear opportunity for Barman to handle this itself however
this is deliberately not tackled by this patch and will instead be
added at a later date when we improve the behaviour of Barman in a
clustered environment.

Relates to #393.
@mikewallace1979 mikewallace1979 force-pushed the 393-copy-partial-wal-files-during-recovery branch from dc650f1 to 79472a7 Compare October 10, 2022 20:20
@edb-sonar-app
Copy link

edb-sonar-app bot commented Oct 10, 2022

@martinmarques
Copy link
Contributor

@mikewallace1979 do you think this is something we would like to keep pursuing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants