pool: file: object: clean up health checkers for both types of deletion #9094

BlaineEXE · 2021-11-03T19:07:21Z

Clean up the code used to stop health checkers for all controllers
(pool, file, object). Health checkers should now be stopped when
removing the finalizer for a forced deletion when the CephCluster does
not exist. This prevents leaking a running health checker for a resource
that is going to be imminently removed.

Also tidy the health checker stopping code so that it is similar for all
3 controllers. Of note, the object controller now uses namespace and
name for the object health checker, which would create a problem for
users who create a CephObjectStore with the same name in different
namespaces.

Signed-off-by: Blaine Gardner blaine.gardner@redhat.com

Any leaks would likely come into play more often in the real world for test clusters where a user is installing clusters with various names/namespaces and deleting them to test features. It could always be resolved by restarting the operator for any existing Rook releases.

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

BlaineEXE · 2021-11-03T19:12:32Z

pkg/operator/ceph/object/controller.go

-		if _, ok := r.objectStoreContexts[cephObjectStore.Name]; ok {
-			if r.objectStoreContexts[cephObjectStore.Name].internalCtx.Err() == nil {
-				// get the latest version of the object now that the health checker is stopped


I don't believe this was ever actually necessary to stop the health checker to look for dependents. At worst, there could be an error checking that causes the reconcile to re-run, but I don't think that is possible unless we have multiple simultaneous reconciles in the future. Therefore, I remove the if/if check and un-indent the stuff below.

BlaineEXE · 2021-11-03T19:13:17Z

pkg/operator/ceph/object/controller.go

 	// Start monitoring object store
-	if r.objectStoreContexts[objectstore.Name].started {


The object store used to just use the name of the store for the key, which would cause conflicts for stores of the same name in different namespaces.

BlaineEXE · 2021-11-03T19:13:49Z

pkg/operator/ceph/file/controller.go

 func fsChannelKeyName(cephFilesystem *cephv1.CephFilesystem) string {
 	return fmt.Sprintf("%s-%s", cephFilesystem.Namespace, cephFilesystem.Name)
 }


I extended this pattern to block and object controllers.

BlaineEXE · 2021-11-03T19:14:46Z

pkg/operator/ceph/pool/controller.go

-func (r *ReconcileCephBlockPool) cancelMirrorMonitoring(cephBlockPoolName string) {
-	// Cancel the context to stop the go routine
-	r.blockPoolContexts[cephBlockPoolName].internalCancel()


Although the diff looks like I deleted this :, I extended this cancelMonitoring pattern to file and object controllers.

travisn

just a small suggestion

travisn · 2021-11-03T20:54:03Z

pkg/operator/ceph/pool/controller.go

-	// Cancel the context to stop the go routine
-	r.blockPoolContexts[cephBlockPoolName].internalCancel()
+func blockPoolChannelKeyName(cephBlockPool *cephv1.CephBlockPool) string {
+	return fmt.Sprintf("%s-%s", cephBlockPool.Namespace, cephBlockPool.Name)


Shall we use a NamespacedName.String() similar to the object keyname?

I was considering that. This merely keeps exactly what existed from before. I had an initial concern about the update scenario, but since the operator is restarted for the update, I don't think there will be any pre-existing channels to conflict with. Does that sound correct to you also?

Agreed, the key is just used in-memory so the upgrade should be fine

leseb

Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.

BlaineEXE · 2021-11-04T17:03:10Z

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

travisn · 2021-11-04T17:08:55Z

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported.

BlaineEXE · 2021-11-04T17:20:21Z

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported.

I think it's actually pretty simple to support. I added the call to cancel the health checker in one more place during reconcile for all the controllers.

leseb · 2021-11-04T17:22:43Z

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported.

I think it's actually pretty simple to support. I added the call to cancel the health checker in one more place during reconcile for all the controllers.

If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate.

BlaineEXE · 2021-11-04T20:46:46Z

Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.

I'm not seeing how this would stop the health checker. To my understanding, it would merely cancel any ongoing reconciles, but it could still leave the health checker goroutines running.

If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

leseb · 2021-11-05T09:20:27Z

Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.

I'm not seeing how this would stop the health checker. To my understanding, it would merely cancel any ongoing reconciles, but it could still leave the health checker goroutines running.

If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate.

When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

BlaineEXE · 2021-11-05T17:07:42Z

When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.

Ah, I see now.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

I still think that since the predicate returns true for delete events, we will see a reconcile whenever an object is deleted, and when we find it doesn't exist, we can verify that the health checker is stopped, if any errors happened or a user force-deleted. Then we don't need to re-run all reconciles, which could be expensive if there are lots of resources.

leseb · 2021-11-08T07:43:16Z

When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.

Ah, I see now.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

I still think that since the predicate returns true for delete events, we will see a reconcile whenever an object is deleted, and when we find it doesn't exist, we can verify that the health checker is stopped, if any errors happened or a user force-deleted. Then we don't need to re-run all reconciles, which could be expensive if there are lots of resources.

I'm still fine with that.

Clean up the code used to stop health checkers for all controllers (pool, file, object). Health checkers should now be stopped when removing the finalizer for a forced deletion when the CephCluster does not exist. This prevents leaking a running health checker for a resource that is going to be imminently removed. Also tidy the health checker stopping code so that it is similar for all 3 controllers. Of note, the object controller now uses namespace and name for the object health checker, which would create a problem for users who create a CephObjectStore with the same name in different namespaces. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>

pool: file: object: clean up health checkers for both types of deletion (backport #9094)

BlaineEXE requested review from leseb, travisn and sp98 November 3, 2021 19:07

BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from 8f7f925 to 674132f Compare November 3, 2021 19:09

BlaineEXE commented Nov 3, 2021

View reviewed changes

travisn approved these changes Nov 3, 2021

View reviewed changes

leseb approved these changes Nov 4, 2021

View reviewed changes

BlaineEXE added the backport-release-1.7 label Nov 4, 2021

BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from 674132f to 10c74bb Compare November 4, 2021 17:29

BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from 10c74bb to cea0b63 Compare November 19, 2021 17:04

BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from cea0b63 to 03ba7de Compare November 19, 2021 17:29

BlaineEXE merged commit fcd0d90 into rook:master Nov 19, 2021

BlaineEXE deleted the clean-up-resource-health-checkers-when-removing-finalizers branch November 19, 2021 18:57

mergify bot mentioned this pull request Nov 19, 2021

pool: file: object: clean up health checkers for both types of deletion (backport #9094) #9213

Closed

This was referenced Dec 13, 2021

pool: file: object: clean up health checkers for both types of deletion (backport #9094) #9416

Merged

pool: file: object: clean up health checkers for force deletion #9417

Merged

leseb added a commit that referenced this pull request Dec 15, 2021

Merge pull request #9416 from rook/mergify/bp/release-1.7/pr-9094

616efb9

pool: file: object: clean up health checkers for both types of deletion (backport #9094)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pool: file: object: clean up health checkers for both types of deletion #9094

pool: file: object: clean up health checkers for both types of deletion #9094

BlaineEXE commented Nov 3, 2021 •

edited

BlaineEXE Nov 3, 2021

BlaineEXE Nov 3, 2021

BlaineEXE Nov 3, 2021

BlaineEXE Nov 3, 2021

travisn left a comment

travisn Nov 3, 2021

BlaineEXE Nov 3, 2021

travisn Nov 3, 2021

leseb left a comment

BlaineEXE commented Nov 4, 2021

travisn commented Nov 4, 2021

BlaineEXE commented Nov 4, 2021

leseb commented Nov 4, 2021

BlaineEXE commented Nov 4, 2021 •

edited

leseb commented Nov 5, 2021

BlaineEXE commented Nov 5, 2021

leseb commented Nov 8, 2021

		// Start monitoring object store
		if r.objectStoreContexts[objectstore.Name].started {

pool: file: object: clean up health checkers for both types of deletion #9094

pool: file: object: clean up health checkers for both types of deletion #9094

Conversation

BlaineEXE commented Nov 3, 2021 • edited

BlaineEXE Nov 3, 2021

Choose a reason for hiding this comment

BlaineEXE Nov 3, 2021

Choose a reason for hiding this comment

BlaineEXE Nov 3, 2021

Choose a reason for hiding this comment

BlaineEXE Nov 3, 2021

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

travisn Nov 3, 2021

Choose a reason for hiding this comment

BlaineEXE Nov 3, 2021

Choose a reason for hiding this comment

travisn Nov 3, 2021

Choose a reason for hiding this comment

leseb left a comment

Choose a reason for hiding this comment

BlaineEXE commented Nov 4, 2021

travisn commented Nov 4, 2021

BlaineEXE commented Nov 4, 2021

leseb commented Nov 4, 2021

BlaineEXE commented Nov 4, 2021 • edited

leseb commented Nov 5, 2021

BlaineEXE commented Nov 5, 2021

leseb commented Nov 8, 2021

BlaineEXE commented Nov 3, 2021 •

edited

BlaineEXE commented Nov 4, 2021 •

edited