[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

leonardoce · 2024-05-09T13:46:39Z

Is there an existing issue already for this bug?

I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

I have read the troubleshooting guide and I think this is a new bug.

Contact Details

leonardo.cecchi@enterprisedb.com

Version

1.23.0

What version of Kubernetes are you using?

1.30 (unsuppprted)

What is your Kubernetes environment?

Self-managed: kind (evaluation)

How did you install the operator?

YAML manifest

What happened?

The instance manager reconciler loop can be blocked by a PostgreSQL that fails to start up after being shut down.

Cluster resource

No response

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The instance manager starts PostgreSQL: 1. when it starts up 2. when configuration changes are being applied (after stopping it) 3. when fencing is lifted. In the second and third example, the operator is requested by the embedded cluster reconciler loop, and performed without any timeout. If PostgreSQL won't start up again because of a wrong configuration or missing disk space, the reconciler loop will be stuck waiting for a dead postmaster to be up. This patch handles this condition by using a combination of the timeout parameters that are already set in the cluster. Fixes: cloudnative-pg#4501 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>

) The instance manager starts PostgreSQL: 1. when it starts up 2. when configuration changes are being applied (after stopping it) 3. when fencing is lifted. In the second and third examples, the operator is requested by the embedded cluster reconciler loop, and performed without any timeout. If PostgreSQL won't start up again because of a wrong configuration or missing disk space, the reconciler loop will be stuck waiting for a dead postmaster to be up. This patch handles this condition by using a combination of the timeout parameters that are already set in the cluster. Fixes: #4501 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

) The instance manager starts PostgreSQL: 1. when it starts up 2. when configuration changes are being applied (after stopping it) 3. when fencing is lifted. In the second and third examples, the operator is requested by the embedded cluster reconciler loop, and performed without any timeout. If PostgreSQL won't start up again because of a wrong configuration or missing disk space, the reconciler loop will be stuck waiting for a dead postmaster to be up. This patch handles this condition by using a combination of the timeout parameters that are already set in the cluster. Fixes: #4501 Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com> Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com> (cherry picked from commit 09a4d80)

leonardoce added the triage Pending triage label May 9, 2024

leonardoce assigned gbartolini May 9, 2024

leonardoce mentioned this issue May 9, 2024

fix: timeout when restarting PostgreSQL and while lifting fencing #4504

Merged

mnencia closed this as completed in #4504 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

leonardoce commented May 9, 2024

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

Comments

leonardoce commented May 9, 2024

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

Version

What version of Kubernetes are you using?

What is your Kubernetes environment?

How did you install the operator?

What happened?

Cluster resource

Relevant log output

Code of Conduct