Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Recovery from WAL fails when tablespace was added #4407

Open
4 tasks done
wvengen opened this issue Apr 30, 2024 · 1 comment
Open
4 tasks done

[Bug]: Recovery from WAL fails when tablespace was added #4407

wvengen opened this issue Apr 30, 2024 · 1 comment
Assignees
Labels
documentation 📖 Improvements or additions to documentation enhancement 🪄 New feature or request

Comments

@wvengen
Copy link

wvengen commented Apr 30, 2024

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

(I'll watch this issue)

Version

1.23.0

What version of Kubernetes are you using?

1.26

What is your Kubernetes environment?

Cloud: Other (provider using Gardener on OpenStack)

How did you install the operator?

Helm

What happened?

  1. Start with a database with backup setup (base + WAL) and make sure a base backup is present.
  2. Add a (temporary) tablespace by updating the resource, and make sure WALs are backed up. I also deleted the cluster afterwards.
  3. Create a new cluster by bootstrapping from recovery, from the backup.
  4. The recovery process has a fatal error on the redo of the Tablespace/CREATE record.

I was able to recover to just before affected WAL record without data loss, but I can imagine that this could bring unexpected downtime during recovery in certain situations, hence this report.

At the very least, I would recommend to mention something in the tablespaces limitations, e.g. make sure to create a base backup right after changing tablespaces (which I assume would work fine) and/or noting that changing tablespaces of an existing cluster can introduce recovery issues.

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: db2
  labels:
    cnpg.io/reload: 'true'
spec:
  instances: 1
  imageName: ghcr.io/cloudnative-pg/postgresql:16.2

  storage:
    size: 80Gi

  # tried restoring both with and without this tablespace
  # in both cases, restore failed because the tablespace was not setup
  # ideally, I would be able to restore with a different temporary tablespace setup
  #tablespaces:
  #  - name: temporary
  #    storage:
  #      size: 40Gi
  #    temporary: true

  env:
    - name: PSQL_HISTORY
      value: /run/psql_history

  bootstrap:
    recovery:
      source: clusterBackup
      database: db
      owner: db
      secret:
        name: db-secret

  resources:
    limits:
      memory: 6000Mi
    requests:
      memory: 3000Mi

  postgresql:
    parameters:
      checkpoint_completion_target: "0.9"
      default_statistics_target: "100"
      effective_cache_size: 4608MB
      effective_io_concurrency: "300"
      maintenance_work_mem: 384MB
      max_connections: "20"
      max_standby_streaming_delay: "300000"
      max_wal_size: 4GB
      min_wal_size: 1GB
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
      random_page_cost: "1.1"
      shared_buffers: 1536MB
      timezone: Europe/Amsterdam
      wal_buffers: 16MB
      work_mem: 19660kB

  externalClusters:
    - name: clusterBackup
      barmanObjectStore:
        destinationPath: "s3://dbbackup/pg16"
        endpointURL: "https://s3.example.com"
        serverName: db
        s3Credentials:
          accessKeyId:
            name: db-s3-secret
            key: key
          secretAccessKey:
            name: db-s3-secret
            key: secret
        wal:
          maxParallel: 4

Relevant log output

This is the log output for when the tablespace was not present in the cluster definition. When the tablespace was present, restore failed with the error "PGData already exists, can't overwrite" (instead of here that the directory is absent).

...
{"level":"info","ts":"2024-04-29T21:28:43Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:43.892 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18117","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000AE\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:44Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:44.368 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18118","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000AF\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:44Z","msg":"Checking if the server is still in recovery","logging_pod":"db2-1-full-recovery","recovery":true}
{"level":"info","ts":"2024-04-29T21:28:44Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:44.824 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18119","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000B0\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.296 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18120","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000B1\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.871 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18121","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"restored log file \"0000000C00000277000000B2\" from archive","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.895 UTC","process_id":"37","session_id":"662fe84e.25","session_line_num":"18122","session_start_time":"2024-04-29 18:34:54 UTC","virtual_transaction_id":"1/0","transaction_id":"0","error_severity":"FATAL","sql_state_code":"58P01","message":"directory \"/var/lib/postgresql/tablespaces/temporary/data\" does not exist","hint":"Create this directory for the tablespace before restarting the server.","context":"WAL redo at 277/AE0031F0 for Tablespace/CREATE: 3383535 \"/var/lib/postgresql/tablespaces/temporary/data\"","backend_type":"startup","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.953 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"6","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"startup process (PID 37) exited with exit code 1","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:45Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:45.953 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"7","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"terminating any other active server processes","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:46Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:46.141 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"8","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"shutting down due to startup process failure","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:46Z","logger":"postgres","msg":"record","logging_pod":"db2-1-full-recovery","record":{"log_time":"2024-04-29 21:28:46.310 UTC","process_id":"33","session_id":"662fe84e.21","session_line_num":"9","session_start_time":"2024-04-29 18:34:54 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"database system is shut down","backend_type":"postmaster","query_id":"0"}}
{"level":"info","ts":"2024-04-29T21:28:49Z","logger":"pg_ctl","msg":"pg_ctl: no server running\n","pipe":"stdout","logging_pod":"db2-1-full-recovery"}
{"level":"info","ts":"2024-04-29T21:28:49Z","msg":"Error while deactivating instance","logging_pod":"db2-1-full-recovery","err":"instance is not running"}
{"level":"info","ts":"2024-04-29T21:28:49Z","msg":"Exited log pipe","fileName":"/controller/log/postgres.csv","logging_pod":"db2-1-full-recovery"}
{"level":"error","ts":"2024-04-29T21:28:49Z","msg":"Error while restoring a backup","logging_pod":"db2-1-full-recovery","error":"while waiting for PostgreSQL to stop recovery mode: error while reading results of pg_is_in_recovery: failed to connect to `host=/controller/run user=postgres database=postgres`: dial error (dial unix /controller/run/.s.PGSQL.5432: connect: no such file or directory)","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:125\ngithub.com/cloudnative-pg/cloudnative-pg/pkg/management/log.Error\n\tpkg/management/log/log.go:163\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/restore.restoreSubCommand\n\tinternal/cmd/manager/instance/restore/cmd.go:92\ngithub.com/cloudnative-pg/cloudnative-pg/internal/cmd/manager/instance/restore.NewCmd.func2\n\tinternal/cmd/manager/instance/restore/cmd.go:63\ngithub.com/spf13/cobra.(*Command).execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\tpkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tcmd/manager/main.go:66\nruntime.main\n\t/opt/hostedtoolcache/go/1.22.2/x64/src/runtime/proc.go:271"}

Code of Conduct

  • I agree to follow this project's Code of Conduct
@wvengen wvengen added the triage Pending triage label Apr 30, 2024
@gbartolini
Copy link
Contributor

Hi @wvengen,

You are right. This is something that needs to be documented, at the very least. IMO, if you set the 'future' state (i.e. with), we should be able to support it.

@gbartolini gbartolini added documentation 📖 Improvements or additions to documentation bug 🐛 Something isn't working enhancement 🪄 New feature or request and removed triage Pending triage bug 🐛 Something isn't working labels May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation 📖 Improvements or additions to documentation enhancement 🪄 New feature or request
Projects
Development

No branches or pull requests

2 participants