feat: prevent failovers when disk space is exhausted #4404

leonardoce · 2024-04-29T15:46:43Z

PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files.

The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue.

Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster.

This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation.

After the PVCs are resized, the cluster will restart working correctly.

Closes: #4521

github-actions · 2024-04-29T15:46:58Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

leonardoce · 2024-05-13T10:10:05Z

I tested this using Longhorn in a Fedora VM, but any storage enforcing the PV capacity will do the trick.

To test the patch, you need to finish your WAL storage. To keep things easy, I used:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
spec:
  instances: 1

  storage:
    size: 256Mi

And then:

CREATE TABLE storage_area (t text);

-- repeat the following query 20-30 times (you need to be fast!)
INSERT INTO storage_area (t) (select repeat('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do', 5*1024*1024));

With the predefined WAL settings, you'll finish your WAL disk space before you finish the space for PGDATA.

controllers/cluster_controller.go

pkg/management/postgres/instance.go

controllers/cluster_controller.go

armru · 2024-05-16T10:13:57Z

/test limit=local

github-actions · 2024-05-16T10:14:12Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9110497781

jsilvela

About to start adding documentation and going over the E2E, but left a few comments on the implementation bits.
IMO the "WALDisk" nomenclature could get confusing as it seems to imply there is a separate WAL volume, which may or may not be the case.

internal/cmd/manager/instance/run/lifecycle/run.go

pkg/fileutils/directory.go

pkg/management/postgres/instance.go

pkg/utils/fencing.go

controllers/cluster_controller.go

jsilvela

I still think it's worth renaming the ensureSufficientDiskSpace method, but otherwise give this an enthusiastic 👍

docs/src/instance_manager.md

docs/src/troubleshooting.md

leonardoce · 2024-05-29T10:23:22Z

https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/9284417523

leonardoce · 2024-05-29T11:43:58Z

E2e tests are green!!!!

leonardoce · 2024-05-29T13:01:09Z

I started an entire E2e run, both local and on cloud: https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/9286574473

PostgreSQL will shutdown cleanly when there is no enough disk space to store WAL files. The operator was not recognizing this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue. Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster. This patch makes the instance manager recognize this condition, and report it back to the operator. Upon detecting it, the operator will fence the primary instance and set a phase describing the situation. Since the primary instance is fenced, no failovers will be done. Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>

Add an e2e to test the recovery in case a primary runs out of disk space. Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 release-1.23 labels Apr 29, 2024

leonardoce force-pushed the dev/space branch 3 times, most recently from 4012706 to 126566d Compare May 13, 2024 09:54

leonardoce marked this pull request as ready for review May 13, 2024 10:02

leonardoce requested a review from a team as a code owner May 13, 2024 10:02

gbartolini reviewed May 13, 2024

View reviewed changes

controllers/cluster_controller.go Outdated Show resolved Hide resolved

gbartolini reviewed May 13, 2024

View reviewed changes

pkg/management/postgres/instance.go Outdated Show resolved Hide resolved

armru reviewed May 15, 2024

View reviewed changes

controllers/cluster_controller.go Outdated Show resolved Hide resolved

armru force-pushed the dev/space branch 2 times, most recently from 9b2cea6 to e625d8f Compare May 15, 2024 15:25

armru requested review from jsilvela, NiccoloFei and litaocdl as code owners May 16, 2024 10:04

github-actions bot added the ok to merge 👌 This PR can be merged label May 16, 2024

armru approved these changes May 20, 2024

View reviewed changes

jsilvela reviewed May 20, 2024

View reviewed changes

controllers/cluster_controller.go Outdated Show resolved Hide resolved

leonardoce force-pushed the dev/space branch 2 times, most recently from 172cfe7 to 3cdc43a Compare May 21, 2024 09:47

jsilvela approved these changes May 21, 2024

View reviewed changes

jsilvela reviewed May 21, 2024

View reviewed changes

docs/src/instance_manager.md Outdated Show resolved Hide resolved

jsilvela reviewed May 21, 2024

View reviewed changes

docs/src/troubleshooting.md Outdated Show resolved Hide resolved

leonardoce force-pushed the dev/space branch from ac97f35 to 347c98a Compare May 29, 2024 10:12

leonardoce changed the title ~~feat: automatic fencing for instances with exhausted disk space~~ feat: prevent failovers when disk space is exhausted May 29, 2024

leonardoce force-pushed the dev/space branch 2 times, most recently from b10e33f to 8b3ab1f Compare May 29, 2024 12:59

Leonardo Cecchi and others added 20 commits June 3, 2024 18:05

test: out of disk space recovery scenario

ad27d50

Add an e2e to test the recovery in case a primary runs out of disk space. Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>

review: bulk fencing and noWalDiskSpace status

582cb0e

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: more structured approach to size probing

b779df3

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: rename size_probe -> directory

39e1156

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

docs: add top-level documentation

2357052

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

docs: commas

2773708

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

chore: fix grammar in pkg/fileutils/directory.go

841d807

Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>

chore: fix grammar in pkg/fileutils/directory.go

7c32ed6

Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>

chore: fix pkg/utils/fencing.go

0c321a8

Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>

chore: address Gabriele's comments

b92399e

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

chore: address Jaime's comments

37be53b

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

chore: improve naming

a269716

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

review: clarify documentation

b26326d

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

Update docs/src/instance_manager.md

26e0226

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

Update docs/src/troubleshooting.md

fabd652

Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>

chore: directory vs diskprobe

4eb1462

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

chore: rename ensureSufficientDiskSpace to ensureNoFailoverOnFullDisk to

1ea4895

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

docs: cosmetic changes

dd60750

Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>

feat: implementation using exit codes and no fencing

8b809a8

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>

fcanovai force-pushed the dev/space branch from 8b3ab1f to 8b809a8 Compare June 3, 2024 16:05

fcanovai added 2 commits June 3, 2024 18:05

fix: reduce required space to a single wal

a9760c7

docs: improve documentation

20868d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: prevent failovers when disk space is exhausted #4404

feat: prevent failovers when disk space is exhausted #4404

leonardoce commented Apr 29, 2024 •

edited

github-actions bot commented Apr 29, 2024

leonardoce commented May 13, 2024

armru commented May 16, 2024

github-actions bot commented May 16, 2024

jsilvela left a comment •

edited

jsilvela left a comment

leonardoce commented May 29, 2024

leonardoce commented May 29, 2024

leonardoce commented May 29, 2024

feat: prevent failovers when disk space is exhausted #4404

Are you sure you want to change the base?

feat: prevent failovers when disk space is exhausted #4404

Conversation

leonardoce commented Apr 29, 2024 • edited

github-actions bot commented Apr 29, 2024

leonardoce commented May 13, 2024

armru commented May 16, 2024

github-actions bot commented May 16, 2024

jsilvela left a comment • edited

Choose a reason for hiding this comment

jsilvela left a comment

Choose a reason for hiding this comment

leonardoce commented May 29, 2024

leonardoce commented May 29, 2024

leonardoce commented May 29, 2024

leonardoce commented Apr 29, 2024 •

edited

jsilvela left a comment •

edited