Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: add events to CephBucketNotification CR #9400

Merged
merged 1 commit into from Jan 4, 2022

Conversation

yuvalif
Copy link
Contributor

@yuvalif yuvalif commented Dec 12, 2021

the events are used to mark the start/finish of the reconciliation
and also to log the bucket names on which the CephbucketNotification is
created

Signed-off-by: Yuval Lifshitz ylifshit@redhat.com

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@yuvalif yuvalif added test unit or integration testing ceph-rgw labels Dec 12, 2021
Copy link
Contributor

@thotz thotz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the codes LGTM. But I don't much internal about events, so another pair of eyes will be helpful

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is good. I have some clarifying questions/suggestions:

  • Do you also think it's useful to create events about when notifications are removed? If so, maybe we should we until after object: fix handling for notifications for OBC #9365 merges to add those at the same time
  • We have a function ReportIfNotPresent for events, which can be used to prevent spamming events to the API server. IMO, it is better to use the raw Eventf() function as you have here (I don't think the ReportIfNotPresent code is necessary and really only fixes a symptom and not a problem but I digress.). All this to say, if in your testing you are seeing events get repeatedly added, we might consider using that to spare the K8s apiserver.
  • I see that this focuses only on the "notification" controller and not OBC label controller. I think that's great, and I assume that there will be a follow-up PR for OBC label controller after?

@@ -159,6 +164,7 @@ func (r *ReconcileNotifications) reconcile(request reconcile.Request) (reconcile
}
if len(obcList.Items) == 0 {
logger.Debugf("no ObjectbucketClaim associated with CephBucketNotification %q", bnName)
r.recorder.Eventf(notification, "Normal", "Reconcile", "Finished reconciling %q", bnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that using recorder.Eventf() is easy to read at a glance, but in the interest of creating more of a "contract" about the events that can be used in integration tests, I think it might be useful to use constants where possible. For integration tests, the things I think will be most important are verifying that a reconcile started as a result of a change (to know Rook has started work after an update) and verifying that a reconcile finished (to know when Rook has finished applying changes).

For example, "Normal" is a k8s core v1 api v1.EventTypeNormal.

In other places, we have used constants here: https://github.com/BlaineEXE/rook/blob/3f766406cb4a0e780a3a02e0e03a5ebb277ed08a/pkg/apis/ceph.rook.io/v1/types.go#L413-L416

IMO, it is worth considering a new event type "ReconcileStarted" given that we already have the pattern of ReconcileSucceeded/ReconcileFailed. It isn't strictly an API, but I think being consistent is good for users who might script around such things similarly to how we might in our integration tests.

Ideally, I'd like to reuse ReportReconcileResult. This will automatically add events ReconcileSucceeded or ReconcileFailed after a reconcile. I don't think we need to add another "Finished reconciling" event unless you see a particular need. Example: https://github.com/BlaineEXE/rook/blob/1af302fe683faaf7a1ed3faac32e28ee2921e858/pkg/operator/ceph/object/controller.go#L176-L177
However, I just realized that ReportReconcileResult uses the ReportIfNotPresent function, which means it won't always update when a reconcile finishes twice within 60 minutes. I think I will work on a PR to avoid using ReportIfNotPresent so we can rely on pre-built behavior here.

Copy link
Member

@BlaineEXE BlaineEXE Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#9427 <-- always report events PR

@@ -197,8 +203,10 @@ func (r *ReconcileNotifications) reconcile(request reconcile.Request) (reconcile
if err != nil {
return reconcile.Result{}, errors.Wrapf(err, "failed to provision CephBucketNotification %q for ObjectBucketClaims %q", bnName, bucketName)
}
r.recorder.Eventf(notification, "Normal", "Reconcile", "Notification %q provisioned on bucket %q", bnName, bucketName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we might consider a "Provisioned" event type here instead of reusing "Reconcile". Also should probably make this part of the API here: https://github.com/BlaineEXE/rook/blob/3f766406cb4a0e780a3a02e0e03a5ebb277ed08a/pkg/apis/ceph.rook.io/v1/types.go#L401-L426

Here are some events for comparison...

Pod:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  24s   default-scheduler  Successfully assigned rook-ceph/rook-ceph-rgw-my-store-a-85cf66b968-44hqh to minikube
  Normal  Pulled     24s   kubelet            Container image "quay.io/ceph/ceph:v16.2.6" already present on machine
  Normal  Created    24s   kubelet            Created container chown-container-data-dir
  Normal  Started    24s   kubelet            Started container chown-container-data-dir
  Normal  Pulled     23s   kubelet            Container image "quay.io/ceph/ceph:v16.2.6" already present on machine
  Normal  Created    23s   kubelet            Created container rgw
  Normal  Started    23s   kubelet            Started container rgw

replication controller :

Events:
Type    Reason            Age   From                   Message
----    ------            ----  ----                   -------
Normal  SuccessfulCreate  11m   replicaset-controller  Created pod: rook-ceph-rgw-my-store-a-85cf66b968-44hqh

Job:

Events:
Type    Reason            Age   From            Message
----    ------            ----  ----            -------
Normal  SuccessfulCreate  22m   job-controller  Created pod: rook-ceph-osd-prepare-minikube--1-s57vq
Normal  Completed         22m   job-controller  Job completed

@yuvalif
Copy link
Contributor Author

yuvalif commented Dec 15, 2021

Overall I think this is good. I have some clarifying questions/suggestions:

  • Do you also think it's useful to create events about when notifications are removed?

yes

If so, maybe we should we until after #9365 merges to add those at the same time

agree

  • We have a function ReportIfNotPresent for events, which can be used to prevent spamming events to the API server. IMO, it is better to use the raw Eventf() function as you have here (I don't think the ReportIfNotPresent code is necessary and really only fixes a symptom and not a problem but I digress.). All this to say, if in your testing you are seeing events get repeatedly added, we might consider using that to spare the K8s apiserver.

currently, events don't report errors/failures, so we probably not going to spam the system

  • I see that this focuses only on the "notification" controller and not OBC label controller. I think that's great, and I assume that there will be a follow-up PR for OBC label controller after?

i send events from the OBC label controller as well. see: https://github.com/rook/rook/pull/9400/files#diff-9dd4db96c329ea9c209cd43523aa04e1edc3df69304322939b34dba13598df51R166-R187
i do not send events on the OBC itself. please let me know if you think this would be useful?

@BlaineEXE
Copy link
Member

I was hoping we might be able to have a chat in the Rook huddle this morning to go over this with a little more bandwidth. @travisn and I were discussing using events vs status.conditions. We both believe conditions may be worth investigating long term, but those also come with a need to design and maintain the API for the conditions.

I think we finally agreed that it makes sense to use events for now since it can get us to a working state more quickly, and we can revisit the topic of status.conditions later. We also want to be careful not to spam the user/resource with events that make it hard to examine the state of the resource at a glance, and we think it's worth planning our strategy beforehand so we don't have too much churn in this PR, dragging it out and adding frustration.

For now, let's make sure we can converge on a plan for the events before requesting changes of @yuvalif that might have us going back and forth. My suggestion is to focus first on the events that will help us do the integration tests and consider other events later. For integration tests, I think we only need to be able to know that (1) a reconcile started after changes to a Notification/OBC resource, and (2) that the reconcile then finishes.

My suggestion is to use the ReportReconcileResult function to set ReconcileSucceeded/ReconcileFailed events on the CephObjectBucketNotification resource and on OBC resources to accomplish item (2). For item (1), we could add a ReconcileStarted event. Once we have these event types, we can use the k8s API in integration tests to wait for a new reconcile to start after changes, and then for it to finish.

For integration tests, this would enable the following specific behavior:

  • if a test changes labels on an OBC, we know that we will be looking for a new (present timestamp) ReconcileStarted event from the notification controller on the OBC, and then a ReconcileSucceeded event that follows it chronologically. Afterwards, it is valid to check that the RGW's bucket notifications are configured as we expect.
  • if a test changes labels on a CephObjectBucketNotification, we know that we will be looking for a new ReconcileStarted event followed by a ReconcileSucceeded event, just as before. And just as before, it would then be valid to check that the RGW's bucket notifications are configured as necessary.

@travisn
Copy link
Member

travisn commented Dec 15, 2021

While the events would give the tests deeper insight into what the operator is doing, why does this test need it when we haven't used it in the past? Sure, it's a more predictable pattern and would help avoid the need for timeouts, let's just be clear why the events are really necessary for these tests.

The pattern in the CI is generally: 1) Set the desired state, 2) Wait for the desired state, 3) If the desired state doesn't get applied after a timeout, fail the test.

The events are really for user awareness of what is happening. If we are adding events just for the testing, we will likely add some events that are either verbose or uninteresting to the user.

@BlaineEXE
Copy link
Member

While the events would give the tests deeper insight into what the operator is doing, why does this test need it when we haven't used it in the past? Sure, it's a more predictable pattern and would help avoid the need for timeouts, let's just be clear why the events are really necessary for these tests.

The pattern in the CI is generally: 1) Set the desired state, 2) Wait for the desired state, 3) If the desired state doesn't get applied after a timeout, fail the test.

There are tests for notifications that are a bit different from what we are currently doing. For our current tests, you are right that we currently apply a desired state and wait for the state to change to what we want. We are verifying a positive change. However, for notifications, we also have tests that verify that things do not change when they shouldn't. Like when a label is added to an OBC but the label should not lead to a new notification being created. In this test, we verify a negative, which requires new methods for determining when Rook has finished reconciling where we don't have a positive change to look for. We need a way to verify that Rook is done with changes recently made in order to verify that the notification was unchanged.

The events are really for user awareness of what is happening. If we are adding events just for the testing, we will likely add some events that are either verbose or uninteresting to the user.

Currently, we already have a pattern of sending ReconcileSucceeded/ReconcileFailed events. The only event we are really talking about adding is ReconcileStarted.

@yuvalif
Copy link
Contributor Author

yuvalif commented Dec 16, 2021

The events are really for user awareness of what is happening. If we are adding events just for the testing, we will likely add some events that are either verbose or uninteresting to the user.

yes, the events added here are, IMO, useful for user awareness

Currently, we already have a pattern of sending ReconcileSucceeded/ReconcileFailed events. The only event we are really talking about adding is ReconcileStarted.

i didn't add the ReconcileFailed event. mainly not to spam the API server in case of retries.
we can add them later if found useful

when a label is added to an OBC but the label should not lead to a new notification being created.

i did not add a ReconcileStarted event to the OBC, so I'm not sure we use the new events to tests this scenario.
in case of label change, i set events (start/provisioned/sucsee) only on the notification that needs to be provisioned

@BlaineEXE
Copy link
Member

Let's discuss in more detail in Monday's huddle.

@yuvalif
Copy link
Contributor Author

yuvalif commented Dec 21, 2021

added reconcile failure events here: f5acb87
events are added explicitly for better control of the content (in case of retry vs. error)

@mergify
Copy link

mergify bot commented Dec 21, 2021

This pull request has merge conflicts that must be resolved before it can be merged. @yuvalif please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork

}

r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileSucceeded), "Successfully reconciled %q", bnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ReconcileSucceeded and ReconcileFailed events are added automatically by the reusable ReportReconcileResult. See here: https://github.com/BlaineEXE/rook/blob/cab3ce34fd96dd196c7ab1ff4de0c8182b97b1ce/pkg/operator/ceph/object/controller.go#L177

Please use this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about the case that I return a retry indication without an error? (for the case that the topic is not provisioned yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be an error. Similarly to how a Pod gets a failure event if a Volume source doesn't exist. It helps the user understand what is wrong and correct it if it is their mistake (not creating the Topic).

Copy link
Contributor Author

@yuvalif yuvalif Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was specifically requested not to be an error in the original PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what about the case that i failed to fetch the notification object itself?
how can i report an event on it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last thing. if we use the EventReporter i cannot use the FakeRecorder in the unit test

should I add a FakeReporter to k8sutil?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was specifically requested not to be an error in the original PR

I thought I had asked for it to be an error, but I may have not been clear at the time.

last thing. if we use the EventReporter i cannot use the FakeRecorder in the unit test

should I add a FakeReporter to k8sutil?

Have you rebased on the latest master? I removed the deduplicating stuff from ReportReconcileResult so it should work with the recorder.EventRecorder now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works fine after rebase!

return reconcile.Result{}, errors.Wrapf(err, "failed to provision CephBucketNotification %q for ObjectBucketClaims %q", bnName, bucketName)
}
r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.Provisioned), "Notification %q provisioned on bucket %q", bnName, bucketName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed that adding "Provisioned" events was going to be too much and that notifications should be later put into the status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. will remove

@@ -128,56 +134,69 @@ func (r *ReconcileNotifications) reconcile(request reconcile.Request) (reconcile
return reconcile.Result{}, nil
}

namespace := notification.Namespace
bnName := types.NamespacedName{Namespace: namespace, Name: notification.Name}
r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileStarted), "Start reconciling %q", bnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileStarted), "Start reconciling %q", bnName)
r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileStarted), "Started reconciling CephObjectBucketNotification %q", bnName)

logger.Infof("CephBucketTopic %q not provisioned yet", topicName)
failureMessage := fmt.Sprintf("Reconcile failed for %s, topic %q not provisioned yet", bnName, topicName)
logger.Info(failureMessage)
r.recorder.Event(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileFailed), failureMessage)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure events are handled automatically by reading the error message when ReportReconcileResult is used.

@@ -161,24 +165,32 @@ func (r *ReconcileOBCLabels) reconcile(request reconcile.Request) (reconcile.Res
return reconcile.Result{}, errors.Wrapf(err, "failed to retrieve CephBucketNotification %q", bnName)
}

r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileStarted), "Start reconciling %q", bnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For integration tests, it will be important to know when reconcile begins due to OBC changes and when due to Notification changes.

Suggested change
r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileStarted), "Start reconciling %q", bnName)
r.recorder.Eventf(notification, kapiv1.EventTypeNormal, string(cephv1.ReconcileStarted), "Started reconciling CephObjectBucketNotification %q for ObjectBucketClaim %q", bnName, obcNamespacedName)

the events are used to mark the start/success/failure of the reconciliation
and also to log the bucket names on which the CephbucketNotification is
created

Signed-off-by: Yuval Lifshitz <ylifshit@redhat.com>
Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me. If there are any nits I missed, they'll be easy to patch later.

@BlaineEXE BlaineEXE merged commit 5b26cb1 into rook:master Jan 4, 2022
mergify bot added a commit that referenced this pull request Jan 4, 2022
rgw: add events to CephBucketNotification CR (backport #9400)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph-rgw test unit or integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants