No disk space crashloop but pod healthy #3788

mausch · 2023-11-27T16:54:22Z

Overview

I'm running a trivial CrunchyData instance with 1 primary.
It ran out of disk space possibly due to #2531 but this is not relevant to this issue.
Because of this the postgres pod is stuck in a loop displaying this:

2023-11-27 16:43:12,175 INFO: Lock owner: ; I am postgres-instance1-mqhs-0
2023-11-27 16:43:12,175 INFO: not healthy enough for leader race
2023-11-27 16:43:12,176 INFO: doing crash recovery in a single user mode in progress

i.e. Postgres isn't running at all, I can't connect to it.
The problems are:

The pod still shows up as healthy. Being unhealthy and restarting wouldn't fix anything in this case but this could be used to trigger some monitors/alerts to highlight that things aren't right.
The operator logs show no issues at all.

In short, Postgres is broken but the control plane or whatever you want to call it is not aware of it.

Environment

Please provide the following details:

Platform: k3s
Platform Version: 1.28
PGO Image Tag: ubi8-16.0-3.4-0
Postgres Version: 16
Storage: EBS

The text was updated successfully, but these errors were encountered:

jmckulk · 2023-11-30T15:06:46Z

Hey @mausch, I was able to replicate the issue you are seeing with a full disk. After the disk filled up, Postgres stopped working, but the pod was still ready.

We use the Patroni GET /readiness endpoint to determine whether the pod is ready or not. If you take a look at the Patroni API docs for that endpoint, it will return 200 when "the Patroni node is running as the leader OR when PostgreSQL is up and running." In a single instance, when the database goes down, there isn't another instance to take leadership if the current leader becomes unhealthy. Like you have found, this gets into a case where the database is inaccessible, but the pods are still ready.

As it stands now, if Patroni thinks that the cluster is ready then the pods will be ready. There is likely some work we could do to augment the Patroni readiness endpoint and I will be happy to get a story in our backlog.

mausch · 2023-11-30T21:27:09Z

Hi, thanks for looking into this.

Shouldn't the liveness probe (rather than readiness) apply here?

The Patroni docs say:

GET /health: returns HTTP status code 200 only when PostgreSQL is up and running.

GET /liveness: returns HTTP status code 200 if Patroni heartbeat loop is properly running and 503 if the last run was more than ttl seconds ago on the primary or 2*ttl on the replica. Could be used for livenessProbe.

GET /readiness: returns HTTP status code 200 when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for readinessProbe when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).

I don't have a pgo cluster at hand to check but presumably Patroni's /liveness is already mapped to the pod liveness?

jmckulk added the v5 label Nov 29, 2023

mausch mentioned this issue Dec 5, 2023

Postgres takes 5 minutes to start up and incorrectly reports readiness #3798

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No disk space crashloop but pod healthy #3788

No disk space crashloop but pod healthy #3788

mausch commented Nov 27, 2023

jmckulk commented Nov 30, 2023

mausch commented Nov 30, 2023

No disk space crashloop but pod healthy #3788

No disk space crashloop but pod healthy #3788

Comments

mausch commented Nov 27, 2023

Overview

Environment

jmckulk commented Nov 30, 2023

mausch commented Nov 30, 2023