Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No disk space crashloop but pod healthy #3788

Open
mausch opened this issue Nov 27, 2023 · 2 comments
Open

No disk space crashloop but pod healthy #3788

mausch opened this issue Nov 27, 2023 · 2 comments
Labels

Comments

@mausch
Copy link

mausch commented Nov 27, 2023

Overview

I'm running a trivial CrunchyData instance with 1 primary.
It ran out of disk space possibly due to #2531 but this is not relevant to this issue.
Because of this the postgres pod is stuck in a loop displaying this:

2023-11-27 16:43:12,175 INFO: Lock owner: ; I am postgres-instance1-mqhs-0
2023-11-27 16:43:12,175 INFO: not healthy enough for leader race
2023-11-27 16:43:12,176 INFO: doing crash recovery in a single user mode in progress

i.e. Postgres isn't running at all, I can't connect to it.
The problems are:

  • The pod still shows up as healthy. Being unhealthy and restarting wouldn't fix anything in this case but this could be used to trigger some monitors/alerts to highlight that things aren't right.
  • The operator logs show no issues at all.

In short, Postgres is broken but the control plane or whatever you want to call it is not aware of it.

Environment

Please provide the following details:

  • Platform: k3s
  • Platform Version: 1.28
  • PGO Image Tag: ubi8-16.0-3.4-0
  • Postgres Version: 16
  • Storage: EBS
@jmckulk jmckulk added the v5 label Nov 29, 2023
@jmckulk
Copy link
Collaborator

jmckulk commented Nov 30, 2023

Hey @mausch, I was able to replicate the issue you are seeing with a full disk. After the disk filled up, Postgres stopped working, but the pod was still ready.

We use the Patroni GET /readiness endpoint to determine whether the pod is ready or not. If you take a look at the Patroni API docs for that endpoint, it will return 200 when "the Patroni node is running as the leader OR when PostgreSQL is up and running." In a single instance, when the database goes down, there isn't another instance to take leadership if the current leader becomes unhealthy. Like you have found, this gets into a case where the database is inaccessible, but the pods are still ready.

As it stands now, if Patroni thinks that the cluster is ready then the pods will be ready. There is likely some work we could do to augment the Patroni readiness endpoint and I will be happy to get a story in our backlog.

@mausch
Copy link
Author

mausch commented Nov 30, 2023

Hi, thanks for looking into this.

Shouldn't the liveness probe (rather than readiness) apply here?

The Patroni docs say:

GET /health: returns HTTP status code 200 only when PostgreSQL is up and running.

GET /liveness: returns HTTP status code 200 if Patroni heartbeat loop is properly running and 503 if the last run was more than ttl seconds ago on the primary or 2*ttl on the replica. Could be used for livenessProbe.

GET /readiness: returns HTTP status code 200 when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for readinessProbe when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).

I don't have a pgo cluster at hand to check but presumably Patroni's /liveness is already mapped to the pod liveness?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants