New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pool: file: object: clean up health checkers for both types of deletion #9094
pool: file: object: clean up health checkers for both types of deletion #9094
Conversation
8f7f925
to
674132f
Compare
if _, ok := r.objectStoreContexts[cephObjectStore.Name]; ok { | ||
if r.objectStoreContexts[cephObjectStore.Name].internalCtx.Err() == nil { | ||
// get the latest version of the object now that the health checker is stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this was ever actually necessary to stop the health checker to look for dependents. At worst, there could be an error checking that causes the reconcile to re-run, but I don't think that is possible unless we have multiple simultaneous reconciles in the future. Therefore, I remove the if/if check and un-indent the stuff below.
// Start monitoring object store | ||
if r.objectStoreContexts[objectstore.Name].started { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object store used to just use the name of the store for the key, which would cause conflicts for stores of the same name in different namespaces.
pkg/operator/ceph/file/controller.go
Outdated
func fsChannelKeyName(cephFilesystem *cephv1.CephFilesystem) string { | ||
return fmt.Sprintf("%s-%s", cephFilesystem.Namespace, cephFilesystem.Name) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extended this pattern to block and object controllers.
func (r *ReconcileCephBlockPool) cancelMirrorMonitoring(cephBlockPoolName string) { | ||
// Cancel the context to stop the go routine | ||
r.blockPoolContexts[cephBlockPoolName].internalCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the diff looks like I deleted this :, I extended this cancelMonitoring
pattern to file and object controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a small suggestion
pkg/operator/ceph/pool/controller.go
Outdated
// Cancel the context to stop the go routine | ||
r.blockPoolContexts[cephBlockPoolName].internalCancel() | ||
func blockPoolChannelKeyName(cephBlockPool *cephv1.CephBlockPool) string { | ||
return fmt.Sprintf("%s-%s", cephBlockPool.Namespace, cephBlockPool.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use a NamespacedName.String() similar to the object keyname?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering that. This merely keeps exactly what existed from before. I had an initial concern about the update scenario, but since the operator is restarted for the update, I don't think there will be any pre-existing channels to conflict with. Does that sound correct to you also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the key is just used in-memory so the upgrade should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could cancel the orchestration when a DELETE event is received (from the predicate) if we want to be 100% sure to avoid leaks (since the context is terminated). However, this approach is a bit more flexible and does not trigger a full reconcile for all the controllers so LGTM.
I wonder if this is really the best way to go about this. Or perhaps, this doesn't fix the issue fully. What happens if a user manually deletes the finalizer, for example? The routine could still be running. Maybe it would be good to watch |
Good question on how complete this solution is. My inclination is that we only support when the finalizer is properly handled by the operator. If the finalizer is removed manually, we just won't clean up. The operator will likely be deleted at the same time as they are cleaning up the finalizers anyway. Or worst case, they could restart the operator if they hit unexpected behavior, but that's fine if we claim it's unsupported. |
I think it's actually pretty simple to support. I added the call to cancel the health checker in one more place during reconcile for all the controllers. |
If we are after simplicity, why not go with what I described here #9094 (review)? Also, you cannot remove the healthchecker if you are in the predicate. |
674132f
to
10c74bb
Compare
I'm not seeing how this would stop the health checker. To my understanding, it would merely cancel any ongoing reconciles, but it could still leave the health checker goroutines running.
Also, the predicates return |
When the orch is canceled so is the context and the goroutines use the context, so they will terminate also.
|
Ah, I see now.
I still think that since the predicate returns |
I'm still fine with that. |
10c74bb
to
cea0b63
Compare
Clean up the code used to stop health checkers for all controllers (pool, file, object). Health checkers should now be stopped when removing the finalizer for a forced deletion when the CephCluster does not exist. This prevents leaking a running health checker for a resource that is going to be imminently removed. Also tidy the health checker stopping code so that it is similar for all 3 controllers. Of note, the object controller now uses namespace and name for the object health checker, which would create a problem for users who create a CephObjectStore with the same name in different namespaces. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
cea0b63
to
03ba7de
Compare
pool: file: object: clean up health checkers for both types of deletion (backport #9094)
Clean up the code used to stop health checkers for all controllers
(pool, file, object). Health checkers should now be stopped when
removing the finalizer for a forced deletion when the CephCluster does
not exist. This prevents leaking a running health checker for a resource
that is going to be imminently removed.
Also tidy the health checker stopping code so that it is similar for all
3 controllers. Of note, the object controller now uses namespace and
name for the object health checker, which would create a problem for
users who create a CephObjectStore with the same name in different
namespaces.
Signed-off-by: Blaine Gardner blaine.gardner@redhat.com
Any leaks would likely come into play more often in the real world for test clusters where a user is installing clusters with various names/namespaces and deleting them to test features. It could always be resolved by restarting the operator for any existing Rook releases.
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
make codegen
) has been run to update object specifications, if necessary.