Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace deletion stuck if contains CRs that are watched by the operator #1876

Open
gyfora opened this issue Apr 27, 2023 · 10 comments · May be fixed by #1890
Open

Namespace deletion stuck if contains CRs that are watched by the operator #1876

gyfora opened this issue Apr 27, 2023 · 10 comments · May be fixed by #1890
Assignees
Milestone

Comments

@gyfora
Copy link

gyfora commented Apr 27, 2023

Bug Report

This is likely not a JOSDK bug but based on offline discussion with @csviri I am opening it here to track it.

In our current setup the operator is deployed in namespace x and is watching namespace y. The access to namespace y is controlled by roles and rolebindings (created in namespace y).

If there are CRs present in y and the namespace is deleted before the CRs are individually deleted we get the following exception during cleanup:

ERROR][flink/basic-example] Error during event processing ExecutionScope{ resource id: ResourceID{name='basic-example', namespace='flink'}, version: 1791281} failed.
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
    at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:546)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:566)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleUpdate(OperationSupport.java:369)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleUpdate(BaseOperation.java:712)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$replace$0(HasMetadataOperation.java:172)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:177)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:88)
    at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.replace(HasMetadataOperation.java:39)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher$CustomResourceFacade.updateResource(ReconciliationDispatcher.java:387)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.conflictRetryingUpdate(ReconciliationDispatcher.java:343)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleCleanup(ReconciliationDispatcher.java:297)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:87)
    at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:62)
    at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:414)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://10.96.0.1:443/apis/flink.apache.org/v1beta1/namespaces/flink/flinkdeployments/basic-example. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. flinkdeployments.flink.apache.org "basic-example" is forbidden: User "system:serviceaccount:default:flink-operator" cannot update resource "flinkdeployments" in API group "flink.apache.org" in the namespace "flink".
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:701)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:681)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:628)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:591)
    at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$5(StandardHttpClient.java:120)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
    at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
    at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:135)
    ... 3 more

Furthermore the namespace deletion gets stuck because the finalizer from the CR is never removed. The root problem seems to be when the namespace deletion is initiated the role and rolebinding is immediately deleted therefore the operator cannot remove the finalizer from the resource anymore.

Environment

Kubernetes cluster type:

kind

JOSDK version: 4.3.0

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:38:33Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.0", GitCommit:"b46a3f887ca979b1a5d14fd39cb1af43e7e5d12d", GitTreeState:"clean", BuildDate:"2022-12-20T03:36:50Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/arm64"}
@csviri
Copy link
Collaborator

csviri commented Apr 27, 2023

Yep this is more like a generic Kubernetes issue, but will clarify how to handle it here, since we have a feature (dynamic changes of watching namespaces) that is closely related.

@rmetzger
Copy link

rmetzger commented May 5, 2023

when the namespace deletion is initiated the role and rolebinding is immediately deleted

Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?

@csviri
Copy link
Collaborator

csviri commented May 5, 2023

when the namespace deletion is initiated the role and rolebinding is immediately deleted

Wouldn't setting a finalizer for the role and rolebinding solve the problem of immediate deletion?

yes, this sounds like a good idea.

This was suggested also here:
https://kubernetes.slack.com/archives/CAW0GV7A5/p1682603213236239

But will create an issue in Kubernetes, see if it can be solved eventually on GC controller level.

@csviri
Copy link
Collaborator

csviri commented May 5, 2023

What JOSDK could do is to provide reconcilers (one for role and one for rolebinding) that will handle adding finalizers and removing them, and it would up to the dev to register them them. Since this has also implication on permissions of the operator (update permission on role).

@csviri csviri linked a pull request May 5, 2023 that will close this issue
@csviri csviri added this to the 4.4 milestone May 9, 2023
@csviri csviri modified the milestones: 4.4, 5.0 Jun 27, 2023
@moayad-alyaghshi
Copy link

Hi @csviri

we are facing the same issue that the namespace deletion is stuck, but even when the operator is deployed in the same namespace as the CRs, which is not expected according to what I understood from the Slack thread. I would appreciate any explanation.

Note: The operator has a ClusterRole and ClusterRoleBinding to work with the CRs. We're using Quarkus with quarkus-operator-sdk.

@csviri csviri modified the milestones: 5.0, 4.5 Aug 17, 2023
@csviri
Copy link
Collaborator

csviri commented Aug 17, 2023

Hi @moayad-alyaghshi ,

I checked it briefly in namespace controller and the garbage collector controller when @gyfora reported this, and it seems (well as far I was able to see) there is nothing special to prevent this in K8S to happen even in the same namespace.

So this is not an issue with JOSDK, it's rather issue with K8S. What we can offer is that reconciler that solves this, just was not priority for now, scheduled this for 4.5;

Maybe it is worth asking again around this on k8s slack: https://kubernetes.slack.com/archives/CAW0GV7A5

@csviri csviri modified the milestones: 4.5, 4.6 Oct 3, 2023
@csviri csviri modified the milestones: 4.6, 4.7 Nov 15, 2023
@csviri csviri modified the milestones: 4.7, 5.0 Dec 5, 2023
@csviri
Copy link
Collaborator

csviri commented Dec 12, 2023

issue in k8s: kubernetes/kubernetes#115070

Copy link

github-actions bot commented Mar 9, 2024

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Mar 9, 2024
@csviri csviri removed the stale label Mar 9, 2024
@jessebye
Copy link

The Kubernetes issue was closed. What are the next steps for this? Is there a way to solve this in java-operator-sdk?

This issue causes all our namespaces to hang on termination because the CR is never finalized by Java Operator SDK...

@csviri
Copy link
Collaborator

csviri commented Apr 12, 2024

@jessebye yes, there is a way to solve this with custom reconcile. My plan is to implement that, also make it available as a standalone controller for non-Java/JOSDK projects. Will move this to 5.0, since more people asking for that.

cc @metacosm

@csviri csviri modified the milestones: 5.1, 5.0 Apr 12, 2024
@csviri csviri modified the milestones: 5.0, 5.1 May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants