Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool: file: object: clean up health checkers for both types of deletion #9094

Conversation

BlaineEXE
Copy link
Member

@BlaineEXE BlaineEXE commented Nov 3, 2021

Clean up the code used to stop health checkers for all controllers
(pool, file, object). Health checkers should now be stopped when
removing the finalizer for a forced deletion when the CephCluster does
not exist. This prevents leaking a running health checker for a resource
that is going to be imminently removed.

Also tidy the health checker stopping code so that it is similar for all
3 controllers. Of note, the object controller now uses namespace and
name for the object health checker, which would create a problem for
users who create a CephObjectStore with the same name in different
namespaces.

Signed-off-by: Blaine Gardner blaine.gardner@redhat.com

Any leaks would likely come into play more often in the real world for test clusters where a user is installing clusters with various names/namespaces and deleting them to test features. It could always be resolved by restarting the operator for any existing Rook releases.

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@BlaineEXE BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from 8f7f925 to 674132f Compare November 3, 2021 19:09
Comment on lines -239 to -241
if _, ok := r.objectStoreContexts[cephObjectStore.Name]; ok {
if r.objectStoreContexts[cephObjectStore.Name].internalCtx.Err() == nil {
// get the latest version of the object now that the health checker is stopped
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this was ever actually necessary to stop the health checker to look for dependents. At worst, there could be an error checking that causes the reconcile to re-run, but I don't think that is possible unless we have multiple simultaneous reconciles in the future. Therefore, I remove the if/if check and un-indent the stuff below.

// Start monitoring object store
if r.objectStoreContexts[objectstore.Name].started {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The object store used to just use the name of the store for the key, which would cause conflicts for stores of the same name in different namespaces.

Comment on lines 472 to 479
func fsChannelKeyName(cephFilesystem *cephv1.CephFilesystem) string {
return fmt.Sprintf("%s-%s", cephFilesystem.Namespace, cephFilesystem.Name)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended this pattern to block and object controllers.

Comment on lines -406 to -408
func (r *ReconcileCephBlockPool) cancelMirrorMonitoring(cephBlockPoolName string) {
// Cancel the context to stop the go routine
r.blockPoolContexts[cephBlockPoolName].internalCancel()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the diff looks like I deleted this :, I extended this cancelMonitoring pattern to file and object controllers.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small suggestion

// Cancel the context to stop the go routine
r.blockPoolContexts[cephBlockPoolName].internalCancel()
func blockPoolChannelKeyName(cephBlockPool *cephv1.CephBlockPool) string {
return fmt.Sprintf("%s-%s", cephBlockPool.Namespace, cephBlockPool.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use a NamespacedName.String() similar to the object keyname?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering that. This merely keeps exactly what existed from before. I had an initial concern about the update scenario, but since the operator is restarted for the update, I don't think there will be any pre-existing channels to conflict with. Does that sound correct to you also?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the key is just used in-memory so the upgrade should be fine

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.

@BlaineEXE
Copy link
Member Author

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

@travisn
Copy link
Member

travisn commented Nov 4, 2021

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported.

@BlaineEXE
Copy link
Member Author

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported.

I think it's actually pretty simple to support. I added the call to cancel the health checker in one more place during reconcile for all the controllers.

@leseb
Copy link
Member

leseb commented Nov 4, 2021

I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch Delete events on resources also and remove the health checker when the object is deleted also.

Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported.

I think it's actually pretty simple to support. I added the call to cancel the health checker in one more place during reconcile for all the controllers.

If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate.

@BlaineEXE BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from 674132f to 10c74bb Compare November 4, 2021 17:29
@BlaineEXE
Copy link
Member Author

BlaineEXE commented Nov 4, 2021

Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.

I'm not seeing how this would stop the health checker. To my understanding, it would merely cancel any ongoing reconciles, but it could still leave the health checker goroutines running.

If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

@leseb
Copy link
Member

leseb commented Nov 5, 2021

Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.

I'm not seeing how this would stop the health checker. To my understanding, it would merely cancel any ongoing reconciles, but it could still leave the health checker goroutines running.

If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate.

When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

@BlaineEXE
Copy link
Member Author

When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.

Ah, I see now.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

I still think that since the predicate returns true for delete events, we will see a reconcile whenever an object is deleted, and when we find it doesn't exist, we can verify that the health checker is stopped, if any errors happened or a user force-deleted. Then we don't need to re-run all reconciles, which could be expensive if there are lots of resources.

@leseb
Copy link
Member

leseb commented Nov 8, 2021

When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.

Ah, I see now.

Also, the predicates return true for delete events which should mean we run a "Reconcile" with the object missing, activating my recent addition, so I don't see why we need to do anything special inside predicates or consider them specially.

I still think that since the predicate returns true for delete events, we will see a reconcile whenever an object is deleted, and when we find it doesn't exist, we can verify that the health checker is stopped, if any errors happened or a user force-deleted. Then we don't need to re-run all reconciles, which could be expensive if there are lots of resources.

I'm still fine with that.

@BlaineEXE BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from 10c74bb to cea0b63 Compare November 19, 2021 17:04
Clean up the code used to stop health checkers for all controllers
(pool, file, object). Health checkers should now be stopped when
removing the finalizer for a forced deletion when the CephCluster does
not exist. This prevents leaking a running health checker for a resource
that is going to be imminently removed.

Also tidy the health checker stopping code so that it is similar for all
3 controllers. Of note, the object controller now uses namespace and
name for the object health checker, which would create a problem for
users who create a CephObjectStore with the same name in different
namespaces.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
@BlaineEXE BlaineEXE force-pushed the clean-up-resource-health-checkers-when-removing-finalizers branch from cea0b63 to 03ba7de Compare November 19, 2021 17:29
@BlaineEXE BlaineEXE merged commit fcd0d90 into rook:master Nov 19, 2021
@BlaineEXE BlaineEXE deleted the clean-up-resource-health-checkers-when-removing-finalizers branch November 19, 2021 18:57
leseb added a commit that referenced this pull request Dec 15, 2021
pool: file: object: clean up health checkers for both types of deletion (backport #9094)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants