New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: Treat cluster as not existing if the cleanup policy is set #9041
Conversation
// If the cluster has a cleanup policy to destroy the cluster, treat it as if it does not exist | ||
if cephCluster.Spec.CleanupPolicy.HasDataDirCleanPolicy() { | ||
cephClusterExists = false | ||
return cephCluster, false, cephClusterExists, WaitForRequeueIfCephClusterNotReady | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to if it has the cleanup policy set and it has a deletion timestamp? Otherwise, a user could conceivably set the cleanup policy accidentally and intend to unset it without deleting the CephCluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to take place accidentally, they would have to set the cleanup policy and delete the child CR. Since it already requires two actions, and we interpret the yes-really-destroy-data
cleanup policy as truly destructive, it seems sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the harm in looking for both, and it is safer for user data if we only make the assumption when the CephCluster is also being deleted. That is what we talked about with Jose, and what I thought we all agreed to.
[Update] This would also mean mean that a user has more chances to opt out of deletion along their journey. Maybe they set the yes-really-destroy-data
and suddenly remember that they forgot to back up one of their resources before hitting delete on the CephCluster. They would still have the opportunity to back out if we also look for the deletion timestamp, whereas without, they are committed as soon as they set yes-really-destroy-data
. Also, for deleting the CephCluster, they aren't committed until they actually hit delete on the CephCluster resource. Let's keep the same level of commitment for all of the resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they have set yes-really-destroy-data, they are only committed to destroying the cluster at the moment they actually start deleting CRs. This seems consistent to me that we would go ahead and delete if the policy is to destroy the cluster. I don't see why we need yet another hoop to jump through to force delete resources. This allows them to start cleaning up resources in steps if desired.
Interestingly, it isn't even really destructive if they delete the child CRs like this since no Ceph resources will be deleted by the finalizers in this case. If they mistakenly deleted the CRs, they could just create the CRs again and everything would still remain in the same state.
what I thought we all agreed to.
It's never agreed until the PRs are merged. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The policy is for deleting the cluster, and I am against extending cluster-policy to sub-resource-policy. I think that is counterintuitive for users, and I don't want to sign-off on something that I think will be detrimental to their experience or understanding of how Rook operates.
8445246
to
38bc5a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes because this does not meet the proposal I agreed to. If we are missing the check that the user is actually deleting the CephCluster, my said-in-chat arguments about the unintuitive user experience remains.
The proposal I will accept:
If : we are deleting a CephObjectStore (could be any dependent resource, but let's keep this for the example)
and : the CephCluster has yes-really-destroy-data-set
and : the CephCluster has a nonzero deletion timestamp
then: we can treat deletion of the CephObjectStore as though the CephCluster doesn't exist, because we can be pretty sure it won't exist very soon. (i.e., just delete the CephObjectStore resource)
Why I am rejecting the changes as they are now:
The cleanupPolicy exists in the CephCluster CR because we intentionally, with effort, and by design, decided that users should have a protection from their Ceph cluster being accidentally deleted. Either by themselves due to manual error, or by an automation tool around it (helm, ocs-operator). This is an intentionally designed default-safe state. With this design, we would be forcing users to set their CephCluster into a default-unsafe configuration in order to allow them to cleanly delete CephObjectStores (as an example) in a way that force-removes that single resource. Then they would have to reset their configuration to the default-safe state again. If we truly treat user data protections as sacred, we should not be implementing the feature in this way.
As a further note, we already have a way for users to force-remove single resources, being removal of the finalizer on those resources, and that doesn't require setting default-unsafe configs for the underlying CephCluster, which this design could encourage users to do. Instead, I think we should merely treat this as a full-uninstallation optimization that allows everything to be deleted more quickly when the CephCluster is being deleted at the same time as other resources and with yes-delete-all-data set.
To be clear, if the CephCluster has the
I do not understand where this scenario comes from. Why would they set it, then unset it just to delete the object store? The purpose of allowing deletion with the cleanup policy is just for the full cleanup case. I don't see it for another scenario.
Agreed, this is a full-uninstallation optimization. And since Rook expects that the CephCluster CR is imminently going to be deleted after the cleanup policy is set, if not already deleted, I don't see how it improves the scenario. The cluster is expected to be fully destroyed at this point anyway. Adding the check for deletion is simple, I just am not understanding how it improves the scenario. |
The important thing from my perspective is that the user can still back out of destroying the whole CephCluster even after they have set From the flow of the CephObjectStore controller, for example, the CephCluster is probably going to be deleted forcefully if I can imagine a real-life scenario where the admin intends to forcefully delete the CephCluster, thinking that they have backed up their CephObjectStore as needed. They set
While this is a pretty edgy corner case, I think it explains the logical discrepancy I am stuck on. I think it's best to forcibly delete the CephObjectStore only when we are certain that the CephCluster is going to be imminently deleted. This way, the whole cluster and all resources (not just the CephCluster) have the time between when the cleanupPolicy is set and when the CephCluster delete call is made as a safety margin to catch any last mistakes in backing up before removal. The margin exists for the CephCluster itself, and so I think we should extend the same amount of margin to the subresources also. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, it seems convenient, nice, and simple. However, the problem with this approach is that the cleanup policy can always be reverted. So now, when the cluster has a cleanup policy we can remove any of the child CRs. I don't think we should treat the CleanupPolicy as if there was no CephCluster at all. It looks dangerous and can lead to unintended. behaviors.
EDIT: I missed reading the current comments, at a quick glance it looks like @BlaineEXE has a similar concern.
38bc5a3
to
6b2501b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any doc changes we should make about this new behavior? Despite the cluster.yaml
.
# After the cleanup policy is set, if any other Ceph CRs (e.g. CephFilesystem, CephObjectStore, etc) | ||
# are deleted, Rook will immediately remove their finalizers and allow them to be deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we could make this change without changing the docs. At any rate, this statement could be clearer if it clarifies this is only if the CephCluster is being deleted. Otherwise, looks ready to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good just to leave it out for simplicity.
The cluster CR can be forcefully deleted and cleanup the cluster resources if the yes-really-destroy-data policy is set on the CR. In this case, the other controllers should treat the cluster CR as not existing and allow the finalizers to be removed on those resources if they are requested for deletion. Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
6b2501b
to
fd10d98
Compare
core: Treat cluster as not existing if the cleanup policy is set (backport #9041)
Description of your changes:
The cluster CR can be forcefully deleted and cleanup the cluster resources if the yes-really-destroy-data policy is set on the CR. In this case, the other controllers should treat the cluster CR as not existing and allow the finalizers to be removed on those resources if they are requested for deletion. This helps avoid the chicken-and-egg problem of trying to destroy the whole cluster and being blocked on resources such as buckets existing when their existence is really irrelevant. Setting yes-really-destroy-data on the cluster CR is really expected to be a destructive action and there is no more trying to protect data.
Checklist:
make codegen
) has been run to update object specifications, if necessary.