Add ownerReference to managed entities #498

aermakov-zalando · 2019-02-22T15:45:46Z

If I delete a database, resources created by it (services, statefulsets, etc.) should be deleted automatically. This can be easily implemented by adding the Postgres resource as an ownerReference (with controller: true). Kubernetes will then automatically delete the resources once the master database resource is gone.

The text was updated successfully, but these errors were encountered:

sdudoladov · 2019-02-25T12:49:05Z

The operator normally deletes k8s resources it creates . Can you please provide more details on the problems with deletion you encountered ?

aermakov-zalando · 2019-02-25T12:57:01Z

The linked function doesn't delete services, for one. Additionally, the operator can always be killed in the middle of Delete(), leaving dangling resources. It should also be easier to maintain since you won't have to write the cleanup code or test it, Kubernetes will always clean up the resources for you no matter what.

sdudoladov · 2019-02-25T13:05:26Z

The linked function doesn't delete services, for one.

service are deleted, but the point overall is valid

aermakov-zalando · 2019-02-25T13:19:45Z

Isn't there also a -config service that should be deleted as well?

Jan-M · 2019-02-25T15:35:44Z

Yeah, those come from the headless services via Patroni.

Guess we need some thinking here, who cleans up what and do we have any expectation on what deletes what and how one thing prevents mistakes. Let's think about this cascade or how to make it safe and clean.

dmayle · 2020-04-01T09:38:46Z

The reason why I created the linked #886 was mostly because of external tooling that we have for cluster monitoring and maintenance. I know that the operator will clean up after itself, but having the reference allows us to better manage the state of our clusters.

bchrobot · 2020-04-01T12:48:45Z

... because of external tooling that we have for cluster monitoring and maintenance. ... having the reference allows us to better manage the state of our clusters.

This is true for us as well

dave08 · 2020-04-01T13:14:12Z

Apart from internal management, this would be very important for those using projects like ArgoCD for GitOps, since that's the way they know which CRD caused resources to be deployed... otherwise in the UI it looks pretty ugly to track "orphaned" resources (that ArgoCD thinks were deployed manually, but are actually managed in the Git repo with the custom CRD...).

frittentheke · 2020-04-01T22:31:48Z

While I support this idea of creating some sort of relationship / ownership among the resources the operator creates and manages.
@sdudoladov Rgarding the dangling resources @aermakov-zalando referred to which the operator might leave behind if being interrupted (or actually not running, also a source for inconsistencies here) the only solution is a Finalizer blocking the removal / garbage collection of the postgresl custom resource until the operator is done with it and give its green light to remove it - see #450

FxKu · 2020-04-02T12:55:58Z

So far we've been reluctant to ownerReferences and finalizes as we fear it makes deleting DBs by accident too easy. Of course, I see that you want things to be cleaned up, in case something goes wrong during (intended) deletions or creations of clusters.

A good compromise might be to make the assignment of ownerReferences configurable. Maybe someone wants to come up with a PR for this. We are already using onwerReferences between the statefulSet and the new connectionPooler deployment. So shouldn't be too hard to implement.

dave08 · 2020-04-02T13:07:03Z

Somehow KubeDB manages the deletion concern with a few states: https://kubedb.com/docs/v0.13.0-rc.0/guides/postgres/quickstart/quickstart/#wipeout-dormantdatabase,is it really a concern if someone sets the state consciously?

frittentheke · 2020-04-02T19:20:32Z

So far we've been reluctant to ownerReferences and finalizes as we fear it makes deleting DBs by accident too easy. Of course, I see that you want things to be cleaned up, in case something goes wrong during (intended) deletions or creations of clusters.

A good compromise might be to make the assignment of ownerReferences configurable. Maybe someone wants to come up with a PR for this. We are already using onwerReferences between the statefulSet and the new connectionPooler deployment. So shouldn't be too hard to implement.

@FxKu How are references making it "easier" to accidentally delete a database cluster? Deleting the "postgresql" resource is all it takes anyways.

I may refer to yet another aspect of using Finalizers in issue #387 which @sdudoladov brought up.

Certainly you can't have it both ways, but there should be a clear strategy about whether the Operator is the master of its resources. Meaning that only changes that go via the CR (be it reconfigurations or deletions of a cluster) are actually supported.

Certainly a human with sufficient access to the Kubernetes API has to "understand" that an Operator and therefore a CR is in the loop when dealing with such resources. Trying to reconfigure or delete those manually is always a bad idea. There is just not an easy way to protect those resources from being fiddled with. But what you can and should assure is that the Operator always keeps the managed resources in line with what the CR says. Otherwise the operator would just be a template engine doing the initial conversion from CR to the set of resources required to have the PostgreSQL cluster but then let's everybody and their dog mess things up manually. But an operator usually manages the whole life-cycle of such an abstract resource as "postgresql" and the resulting "native" building blocks of Kubernetes.
To me the best analogy is configuration management. Certainly you can ssh into a box and change configuration files manually and stop the puppet agent - but then you are going to have a bad time.
The job of an Operator as well as configuration management in general should be to keep variance and complexity in the various states things can be in low.

That being said, it really should not make much of a difference to have the cascading delete the discussed owner references make the Kubernetes API server do or to have the Operator do all those things by itself. It might be nicer to log and "run final steps" as with Finalizers doing "a last base backup" like AWS RDS does when the operator does things explicitly. The end result - no CR = no resources - should actually be the same.

synchris · 2021-11-04T20:12:58Z

At least if someone does not want to have this relationship this could be optional and enabled through the helm chart.

sarahhenkens · 2021-12-25T21:22:04Z

Chiming in here as well, this is a hard-blocker for our team to adopt this operator since we use ArgoCD

sagikazarmark · 2022-06-22T14:19:02Z

@FxKu I'd love to revive the conversation and potentially provide a patch for this.

Although I disagree with making owner references optional, I came up with two solutions supporting that:

Make the StatefulSet the owner of everything and then make the owner reference between the CRD and the StatefulSet optional. This would make the operator's job easier in terms of cleanup.
Make the CRD the owner of all things (optionally). This would mean that if the flag is enabled, the connection pooler would be updated and the CRD would become the owner.

Any ideas/comments?

ErikLundJensen · 2022-06-28T10:44:51Z

The StatefulSet should not be the owner. Sometimes a statefulset has to be recreated due to immutable fields. A delete of the statefulset should therefore not delete other resources.

sagikazarmark · 2022-06-28T16:09:31Z

A delete of the statefulset should therefore not delete other resources.

Well, I guess it already does according to @FxKu 's earlier comment.

But I agree with you, the other options seems way better (except making owner references optional).

cod-r · 2022-08-10T14:13:57Z

This was the winning operator for us until we created a POC with argocd and found out about this issue. This is definitely a deal breaker for people doing gitops with argocd or flux

Silvenga · 2022-08-10T18:06:08Z

This is definitely a deal breaker for people doing gitops with argocd or flux

Can definately confirm for us.

JuanRamino · 2022-08-17T16:07:42Z

Must have also for us.

Looking at how rabbitmq operator handle this, the postgresql resource should be the owner of the other resources

sagikazarmark · 2022-10-04T09:24:09Z

I'd be willing to work on this given there is a consensus about the solution. We should have a conversation about the potential solutions.

Jan-M · 2022-10-04T09:53:48Z

Let`s have a conversation about this, we will try to catchup on this topic quickly.

Would be good to understand the complexity of a feature toggle around the owner reference. Potentially it is the very right thing though to have it non optional, but this needs some elaboration on what happens during sync&migration.

Jan-M · 2022-10-04T10:03:30Z

https://kubernetes.io/docs/concepts/overview/working-with-objects/owners-dependents/

Maybe more informed users can confirm or correct that this would be a very good first step to also redo future work on cleanup and or delete protection. So far it seems that owner reference should be implemented, but does not immediately mean cascading delete. Depending on the answer here, we could skip the "optional".

Earlier objections and adding ownership to statefulset are ok, so proceeding with ownership pointing to the CRD seem good to me.

Edit:
https://kubernetes.io/docs/concepts/architecture/garbage-collection/#owners-dependents

Reading more shines some light on apparent default cascading delete

sagikazarmark · 2022-10-04T13:20:09Z

To be completely honest: I never really understood the fear from "accidental deletion". I mean, what if I "accidentally" delete the StatefulSet? How is that any different?

Perhaps we should address that question first, so that we can come up with an optional solution.

On potential argument I could make is that accidentally deleting the CRD would also lead to deleting all CRs and databases. If that's the concern, the operator could apply a finalizer to the CRD as long as there are CRs in the cluster. That would prevent the CRD from being deleted as long as it has any CRs.

Other than the above, I simply don't see how "accidentally" deleting the CR is different from deleting the StatefulSet.

Now, adding an owner reference will result in a cascading delete by default, unless an orphan deletion policy is applied to the delete request. However, that has to be explicit, so it doesn't protect you from accidental deletion either.

Applying finalizers to CRs could also prevent deletions, but these would have to be applied "manually" and directly to the CRs, thus would have nothing to do with the operator itself.

Again, I'm trying to understand the accidental deletion argument and as others before me, I don't see the difference between the CR and the StatefulSet from this perspective and as a result I can't see why owner reference shouldn't be the default behavior.

I've seen in occasions that a Custom Resource allowed configuring whether child objects should be orphaned: https://external-secrets.io/v0.5.7/guides-ownership-deletion-policy/#orphan

The problem with a controller level configuration value is exactly what makes this issue hard: how do you handle config changes.

When the CR explicitly changes the behavior, you can react to the change. When you have a controller that starts with a new config value, you have no idea what the initial state way. We can guess based on whether ownerReferences are present or not, but migrating between the two becomes significantly harder (especially since we are talking about owner references, it wouldn't be this difficult IMO if we were talking about any other field).

Admittedly, I haven't looked into how the current lifecycle is managed, so it may not be that complicated to move between two states. We might also just add the creationPolicy flag mentioned above with a default value first and add support for owner references in a next step. We have to figure out step-by-step plan here.

But the first step I believe is deciding the default behavior and how we want to control differences from that. If, for some reason adding owner references by default (+ finalizers for whoever wants to prevent deleting CRs) is not enough, I would suggest adding a creationPolicy flag to the spec.

Jan-M · 2022-10-04T17:23:30Z

You are right, anyone can delete things we do not want them to deleted.

We do have two steps here originally, the operator ignoring CRD deletes and better we do now have admission controller level delete protection. So we could look at this being solved. The difference is mostly "risk" related, people will come into the habbit of cleaning up and are more likely to delete CRDs then statefulsets. We also see that it takes time until migrations or deletes happen, but we can consider this a bit void now. As much as it makes the implemented ignore CRD delete also void, once we have the proper owner references.

Seems many here favor proper ownership vs accidental deletes and that can be the right approach.

You raise a concern around going forth an back between owner reference on and off, something we might also downvote. The migration should be forward, towards ownership references. And ownership references can also become the default.

sagikazarmark · 2022-10-04T17:47:30Z

Not sure I fully understand your last comment. From what I can tell, you propose adding owner references and it should be enabled by default.

Do you still want to be able to "turn it off"? If so, which approach: controller option or CRD flag?

We can limit migration between the two in some cases (eg. make the CRD flag immutable), but not handling those situations could potentially mean if you change the config, you may accidentally destroy your Postgres cluster. (Yeah, I see the irony :) )

Wikiwix · 2022-10-04T20:44:26Z

In my opinion if some delete protection should be available it should be on the data, i.e. the PersistentVolume(Claim).

This is possible for example for Elasticsearch managed by the ECK operator.

And, in my opinion, upside to this sort of "delete protection" is that on CR (postgresql instance) deletion, the undesired state will become apparent immediately (because the database becomes unavailable). That means such a mistake cannot remain unnoticed for a prolonged time (that unfortunately happened for some of our users). Yet a restore only requires a recreation of the postgresql object (for which the PVC template will be fulfilled by the still existing old PVCs).

Jan-M · 2022-10-06T14:09:05Z

Yes, I propose to work on owner reference, given the strong need and a concept employed by others. This will hopefully then also reduce our own cleanup code at some future point.

I do think we need an off state and a migration path to "on". The new default can be "on". I do not think we need to invest into a path from "on" to "off".

sagikazarmark · 2022-10-06T15:30:58Z

I do not think we need to invest into a path from "on" to "off".

That's only possible if on->off is not even possible (ie. immutable field in the spec). If there is a way to move from one to the other (user can change config), but we don't handle it the result of that is unknown, most probably a disaster.

So I'd suggest that we either find a way to make sure a user can't change that config (the only way I can imagine right now is a field in the CR, not a controller level config) or add some sort of support for moving between the two states.

If for nothing else, to make sure the issue tracker is not flooded with bug reports after the feature is released.

sagikazarmark · 2022-10-06T15:33:45Z

Some high-level steps to implement this feature from top of the mind:

Add a field to the CRD spec (eg. ownerShipPolicy: Owned/Orphaned)
Add a webhook that converts existing CRs to ownershipPolicy: Orphaned
Maybe merge to master/release
Implement ownership, switch between "old" and "new" strategy based on the CR ownershipPolicy value
Cleanup

cod-r · 2022-11-09T10:01:49Z

From Argo CD UI you can open a terminal and exec into pods. Some developers need terminal access to run psql commands in the Zalando created Pods. Because ownerReference is missing we can't see the postgres pod and thus can't open a terminal from Argo CD.
Because of this we need to create other form of access to the k8s cluster instead of simply relying on Argo CD UI.
This is a big issue because we need to change our access processes just for a single use case - accessing the postgres pod.

dcharbonnier · 2023-06-14T07:11:32Z

This pull request aligns this amazing operator with the standards of Kubernetes, making it a very important improvement, especially with the rise of gitops.
How can we make progress on this matter ? That's big a barrier to adoption.

amitde69 · 2024-01-16T16:53:17Z

I can agree that this matter is a big a barrier to adoption

itayvolo · 2024-01-16T18:52:41Z

+1

Alaith · 2024-04-03T03:44:34Z

Many other operators use ownerReference very successfully like Vitess (CNCF Graduated project), ECK operator, etc.
Can't wait to see this adopted.

sdudoladov added the enhancement label Feb 25, 2019

FxKu mentioned this issue May 23, 2019

quick create-delete leaves orphaned objects #551

Closed

FxKu mentioned this issue Jan 15, 2020

missing owner reference #793

Closed

FxKu mentioned this issue Mar 30, 2020

[Feature] Define relationship between cluster and statefulset #886

Closed

FxKu added the discussion label Apr 2, 2020

frittentheke mentioned this issue Apr 28, 2020

Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

Merged

FxKu mentioned this issue Jul 14, 2020

Recreate clusters after helm delete operator #1057

Closed

MPV mentioned this issue Feb 6, 2022

ArgoCD "custom health check" #1693

Open

Wikiwix mentioned this issue Mar 14, 2022

Display the resources created in ArgoCD associated to a CRD instance #1766

Open

mbegenau linked a pull request Jan 31, 2023 that will close this issue

feat: Add owner references to all resources #2199

Open

Add ownerReference to managed entities #498

Add ownerReference to managed entities #498

Comments

aermakov-zalando commented Feb 22, 2019

sdudoladov commented Feb 25, 2019

aermakov-zalando commented Feb 25, 2019

sdudoladov commented Feb 25, 2019

aermakov-zalando commented Feb 25, 2019

Jan-M commented Feb 25, 2019

dmayle commented Apr 1, 2020 • edited

bchrobot commented Apr 1, 2020

dave08 commented Apr 1, 2020

frittentheke commented Apr 1, 2020 • edited

FxKu commented Apr 2, 2020

dave08 commented Apr 2, 2020

frittentheke commented Apr 2, 2020 • edited

synchris commented Nov 4, 2021

sarahhenkens commented Dec 25, 2021

sagikazarmark commented Jun 22, 2022

ErikLundJensen commented Jun 28, 2022

sagikazarmark commented Jun 28, 2022

cod-r commented Aug 10, 2022

Silvenga commented Aug 10, 2022

JuanRamino commented Aug 17, 2022

sagikazarmark commented Oct 4, 2022

Jan-M commented Oct 4, 2022

Jan-M commented Oct 4, 2022 • edited

sagikazarmark commented Oct 4, 2022

Jan-M commented Oct 4, 2022

sagikazarmark commented Oct 4, 2022

Wikiwix commented Oct 4, 2022

Jan-M commented Oct 6, 2022

sagikazarmark commented Oct 6, 2022

sagikazarmark commented Oct 6, 2022

cod-r commented Nov 9, 2022

dcharbonnier commented Jun 14, 2023

amitde69 commented Jan 16, 2024

itayvolo commented Jan 16, 2024

Alaith commented Apr 3, 2024

dmayle commented Apr 1, 2020 •

edited

frittentheke commented Apr 1, 2020 •

edited

frittentheke commented Apr 2, 2020 •

edited

Jan-M commented Oct 4, 2022 •

edited