-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: prevent failovers when disk space is exhausted #4404
base: main
Are you sure you want to change the base?
Conversation
❗ By default, the pull request is configured to backport to all release branches.
|
4012706
to
126566d
Compare
I tested this using Longhorn in a Fedora VM, but any storage enforcing the PV capacity will do the trick. To test the patch, you need to finish your WAL storage. To keep things easy, I used: apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: cluster-example
spec:
instances: 1
storage:
size: 256Mi And then: CREATE TABLE storage_area (t text);
-- repeat the following query 20-30 times (you need to be fast!)
INSERT INTO storage_area (t) (select repeat('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do', 5*1024*1024)); With the predefined WAL settings, you'll finish your WAL disk space before you finish the space for PGDATA. |
9b2cea6
to
e625d8f
Compare
/test limit=local |
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9110497781 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About to start adding documentation and going over the E2E, but left a few comments on the implementation bits.
IMO the "WALDisk" nomenclature could get confusing as it seems to imply there is a separate WAL volume, which may or may not be the case.
172cfe7
to
3cdc43a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's worth renaming the ensureSufficientDiskSpace
method, but otherwise give this an enthusiastic 👍
E2e tests are green!!!! |
b10e33f
to
8b3ab1f
Compare
I started an entire E2e run, both local and on cloud: https://github.com/EnterpriseDB/cloudnative-pg/actions/runs/9286574473 |
PostgreSQL will shutdown cleanly when there is no enough disk space to store WAL files. The operator was not recognizing this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue. Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster. This patch makes the instance manager recognize this condition, and report it back to the operator. Upon detecting it, the operator will fence the primary instance and set a phase describing the situation. Since the primary instance is fenced, no failovers will be done. Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Add an e2e to test the recovery in case a primary runs out of disk space. Signed-off-by: Francesco Canovai <francesco.canovai@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>
Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>
Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>
Co-authored-by: Jaime Silvela <jaime.silvela@enterprisedb.com> Signed-off-by: Leonardo Cecchi <leonardo.cecchi@gmail.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>
Signed-off-by: Jaime Silvela <jaime.silvela@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
PostgreSQL will shut down cleanly when there is not enough disk space to store WAL files.
The operator did not recognize this condition and, since the primary failed, was performing a failover to the most advanced replica. This action will not fix the underlying issue.
Only a manual disk resize, initiated by the user, can ultimately lead to a fully working PostgreSQL cluster.
This patch makes the instance manager recognize this condition and report it to the operator. Upon detecting it, the operator will not trigger a switchover and set a phase describing the situation.
After the PVCs are resized, the cluster will restart working correctly.
Closes: #4521