[Bug]: Recovery from WAL fails when tablespace was added #4407

wvengen · 2024-04-30T06:43:55Z

Is there an existing issue already for this bug?

I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

I have read the troubleshooting guide and I think this is a new bug.

Contact Details

(I'll watch this issue)

Version

1.23.0

What version of Kubernetes are you using?

1.26

What is your Kubernetes environment?

Cloud: Other (provider using Gardener on OpenStack)

How did you install the operator?

Helm

What happened?

Start with a database with backup setup (base + WAL) and make sure a base backup is present.
Add a (temporary) tablespace by updating the resource, and make sure WALs are backed up. I also deleted the cluster afterwards.
Create a new cluster by bootstrapping from recovery, from the backup.
The recovery process has a fatal error on the redo of the Tablespace/CREATE record.

I was able to recover to just before affected WAL record without data loss, but I can imagine that this could bring unexpected downtime during recovery in certain situations, hence this report.

At the very least, I would recommend to mention something in the tablespaces limitations, e.g. make sure to create a base backup right after changing tablespaces (which I assume would work fine) and/or noting that changing tablespaces of an existing cluster can introduce recovery issues.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: db2
  labels:
    cnpg.io/reload: 'true'
spec:
  instances: 1
  imageName: ghcr.io/cloudnative-pg/postgresql:16.2

  storage:
    size: 80Gi

  # tried restoring both with and without this tablespace
  # in both cases, restore failed because the tablespace was not setup
  # ideally, I would be able to restore with a different temporary tablespace setup
  #tablespaces:
  #  - name: temporary
  #    storage:
  #      size: 40Gi
  #    temporary: true

  env:
    - name: PSQL_HISTORY
      value: /run/psql_history

  bootstrap:
    recovery:
      source: clusterBackup
      database: db
      owner: db
      secret:
        name: db-secret

  resources:
    limits:
      memory: 6000Mi
    requests:
      memory: 3000Mi

  postgresql:
    parameters:
      checkpoint_completion_target: "0.9"
      default_statistics_target: "100"
      effective_cache_size: 4608MB
      effective_io_concurrency: "300"
      maintenance_work_mem: 384MB
      max_connections: "20"
      max_standby_streaming_delay: "300000"
      max_wal_size: 4GB
      min_wal_size: 1GB
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
      random_page_cost: "1.1"
      shared_buffers: 1536MB
      timezone: Europe/Amsterdam
      wal_buffers: 16MB
      work_mem: 19660kB

  externalClusters:
    - name: clusterBackup
      barmanObjectStore:
        destinationPath: "s3://dbbackup/pg16"
        endpointURL: "https://s3.example.com"
        serverName: db
        s3Credentials:
          accessKeyId:
            name: db-s3-secret
            key: key
          secretAccessKey:
            name: db-s3-secret
            key: secret
        wal:
          maxParallel: 4

Relevant log output

This is the log output for when the tablespace was not present in the cluster definition. When the tablespace was present, restore failed with the error "PGData already exists, can't overwrite" (instead of here that the directory is absent).

...
{"level":"info","ts":"2024-04-29T21:28:43Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:43.892 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18117","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000AE\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:44Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:44.368 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18118","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000AF\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:44Z","msg":"Checking if the server is still in recovery","logging_pod":"db2-1-full-recovery","recovery":true}
{"level":"info","ts":"2024-04-29T21:28:44Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:44.824 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18119","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000B0\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.296 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18120","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000B1\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.871 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18121","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000B2\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.895 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18122","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"FATAL","sql_state_code":"58P01","message":"directory \"/var/lib/postgresql/tablespaces/temporary/data\" does not exist","hint":"Create this directory for the tablespace before restarting the server.","context":"WAL redo at 277/AE0031F0 for Tablespace/CREATE: 3383535 \"/var/lib/postgresql/tablespaces/temporary/data\"","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.953 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"6","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"startup process (PID 37) exited with exit code 1","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.953 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"7","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"terminating any other active server processes","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:46Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:46.141 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"8","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"shutting down due to startup process failure","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:46Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:46.310 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"9","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system is shut down","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:49Z","logger":"pg_ctl","msg":"pg_ctl: no server running\n","pipe":"stdout","logging_pod":"db2-1-full-recovery"}
{"level":"info","ts":"2024-04-29T21:28:49Z","msg":"Error while deactivating instance","logging_pod":"db2-1-full-recovery","err":"instance is not running"}
{"level":"info","ts":"2024-04-29T21:28:49Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"db2-1-full-recovery"}
{"level":"error","ts":"2024-04-29T21:28:49Z","msg":"Error while restoring a backup","logging_pod":"db2-1-full-recovery","error":"while waiting for PostgreSQL to stop recovery mode: error while reading results of pg_is_in_recovery: failed to connect to `host=/controller/run user=postgres database=postgres`: dial error (dial unix /controller/run/.s.PGSQL.5432: connect: no such file or directory)","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/log.Error\n\tpkg/management/log/log.go:163\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/restore.restoreSubCommand\n\tinternal/cmd/manager/instance/restore/cmd.go:92\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/restore.NewCmd.func2\n\tinternal/cmd/manager/instance/restore/cmd.go:63\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

gbartolini · 2024-05-01T13:07:39Z

Hi @wvengen,

You are right. This is something that needs to be documented, at the very least. IMO, if you set the 'future' state (i.e. with), we should be able to support it.

wvengen added the triage Pending triage label Apr 30, 2024

wvengen assigned gbartolini Apr 30, 2024

gbartolini added documentation 📖 Improvements or additions to documentation bug 🐛 Something isn't working enhancement 🪄 New feature or request and removed triage Pending triage bug 🐛 Something isn't working labels May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Recovery from WAL fails when tablespace was added #4407

[Bug]: Recovery from WAL fails when tablespace was added #4407

wvengen commented Apr 30, 2024 •

edited

gbartolini commented May 1, 2024

[Bug]: Recovery from WAL fails when tablespace was added #4407

[Bug]: Recovery from WAL fails when tablespace was added #4407

Comments

wvengen commented Apr 30, 2024 • edited

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Cluster resource

Relevant log output

Code of Conduct

gbartolini commented May 1, 2024

wvengen commented Apr 30, 2024 •

edited