You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running a trivial CrunchyData instance with 1 primary.
It ran out of disk space possibly due to #2531 but this is not relevant to this issue.
Because of this the postgres pod is stuck in a loop displaying this:
2023-11-27 16:43:12,175 INFO: Lock owner: ; I am postgres-instance1-mqhs-0
2023-11-27 16:43:12,175 INFO: not healthy enough for leader race
2023-11-27 16:43:12,176 INFO: doing crash recovery in a single user mode in progress
i.e. Postgres isn't running at all, I can't connect to it.
The problems are:
The pod still shows up as healthy. Being unhealthy and restarting wouldn't fix anything in this case but this could be used to trigger some monitors/alerts to highlight that things aren't right.
The operator logs show no issues at all.
In short, Postgres is broken but the control plane or whatever you want to call it is not aware of it.
Environment
Please provide the following details:
Platform: k3s
Platform Version: 1.28
PGO Image Tag: ubi8-16.0-3.4-0
Postgres Version: 16
Storage: EBS
The text was updated successfully, but these errors were encountered:
Hey @mausch, I was able to replicate the issue you are seeing with a full disk. After the disk filled up, Postgres stopped working, but the pod was still ready.
We use the Patroni GET /readiness endpoint to determine whether the pod is ready or not. If you take a look at the Patroni API docs for that endpoint, it will return 200 when "the Patroni node is running as the leader OR when PostgreSQL is up and running." In a single instance, when the database goes down, there isn't another instance to take leadership if the current leader becomes unhealthy. Like you have found, this gets into a case where the database is inaccessible, but the pods are still ready.
As it stands now, if Patroni thinks that the cluster is ready then the pods will be ready. There is likely some work we could do to augment the Patroni readiness endpoint and I will be happy to get a story in our backlog.
GET /health: returns HTTP status code 200 only when PostgreSQL is up and running.
GET /liveness: returns HTTP status code 200 if Patroni heartbeat loop is properly running and 503 if the last run was more than ttl seconds ago on the primary or 2*ttl on the replica. Could be used for livenessProbe.
GET /readiness: returns HTTP status code 200 when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for readinessProbe when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
I don't have a pgo cluster at hand to check but presumably Patroni's /liveness is already mapped to the pod liveness?
Overview
I'm running a trivial CrunchyData instance with 1 primary.
It ran out of disk space possibly due to #2531 but this is not relevant to this issue.
Because of this the postgres pod is stuck in a loop displaying this:
i.e. Postgres isn't running at all, I can't connect to it.
The problems are:
In short, Postgres is broken but the control plane or whatever you want to call it is not aware of it.
Environment
Please provide the following details:
ubi8-16.0-3.4-0
The text was updated successfully, but these errors were encountered: