Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

Closed
4 tasks done
leonardoce opened this issue May 9, 2024 · 0 comments · Fixed by #4504
Closed
4 tasks done

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails #4501

leonardoce opened this issue May 9, 2024 · 0 comments · Fixed by #4504
Assignees
Labels
triage Pending triage

Comments

@leonardoce
Copy link
Contributor

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

leonardo.cecchi@enterprisedb.com

Version

1.23.0

What version of Kubernetes are you using?

1.30 (unsuppprted)

What is your Kubernetes environment?

Self-managed: kind (evaluation)

How did you install the operator?

YAML manifest

What happened?

The instance manager reconciler loop can be blocked by a PostgreSQL that fails to start up after being shut down.

Cluster resource

No response

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@leonardoce leonardoce added the triage Pending triage label May 9, 2024
leonardoce pushed a commit to leonardoce/cloudnative-pg that referenced this issue May 9, 2024
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
leonardoce pushed a commit to leonardoce/cloudnative-pg that referenced this issue May 9, 2024
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
armru pushed a commit to leonardoce/cloudnative-pg that referenced this issue May 13, 2024
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
leonardoce pushed a commit to leonardoce/cloudnative-pg that referenced this issue May 15, 2024
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
mnencia pushed a commit to leonardoce/cloudnative-pg that referenced this issue May 17, 2024
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
mnencia pushed a commit to leonardoce/cloudnative-pg that referenced this issue May 20, 2024
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
mnencia pushed a commit that referenced this issue May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
cnpg-bot pushed a commit that referenced this issue May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit 09a4d80)
mnencia pushed a commit that referenced this issue May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit 09a4d80)
mnencia pushed a commit that referenced this issue May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit 09a4d80)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Pending triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants