New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitoring: customize prometheus rule alerts #9503
Conversation
73bf466
to
035be1b
Compare
With this pr we can easily add "labels" section for giving the option to override the prometheus rule labels as described in the resolved issue #8502
|
5c8ed6c
to
5ae4fcb
Compare
@@ -0,0 +1,118 @@ | |||
alerts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we need this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this section if we would like to add a new section for overriding default values of prometheus rule in the future.
For example, adding "labels" section as described in the comment. Do you want to add this section in this pr?
If we are adding the "labels" section, we will add the default values in this file:
labels:
prometheus: rook-prometheus
role: alert-rules
alerts:
.
.
.
@@ -265,6 +265,10 @@ spec: | |||
- name: CSI_CEPHFS_PLUGIN_RESOURCE | |||
value: {{ .Values.csi.csiCephFSPluginResource | quote }} | |||
{{- end }} | |||
{{- if .Values.monitoring.prometheusRule }} | |||
- name: ROOK_CEPH_MONITORING_PROMETHEUS_RULE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is set when not using Helm? operator.yaml
needs some changes too.
Also, these rules will apply to all the cephcluster deployed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added changes to operator.yaml
. Should we also add changes to operator-openshift.yaml
?
Yes, these rules will apply to all the cephclusters deployed, if we would like to apply different rules for different cephclusters, we should add a new field in cephCluster CRD instead of using env var in the operator (as suggested in #9082). Do you think it's a better solution for our case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more minor points, then I think it's looking good to merge soon.
@leseb other feedback?
Documentation/ceph-cluster-crd.md
Outdated
# disabled: true | ||
# CephMgrIsAbsent: | ||
# for: 1m | ||
# severityLevel: <custom-severityLevel> #<custom-severityLevel> must be one of the next levels: warning, critical, error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we just use one of the examples values here?
# severityLevel: <custom-severityLevel> #<custom-severityLevel> must be one of the next levels: warning, critical, error | |
# severityLevel: warning # must be warning, critical, or error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, make sense.
4c8b793
to
f2baaa8
Compare
f2baaa8
to
d7b688e
Compare
d8e2216
to
1a8e84f
Compare
I'm really grateful that yuvalman is getting involved with the community here, and I think the quality of work and dedication has been spectacular. I don't want my disagreement about the Rook operator's role in deploying Prometheus rules to be a reflection on the good quality I see here. Fundamentally, I don't think Rook should have ever been deploying Prometheus rules. It creates a tight integration with Prometheus and a precedent that means users may want Rook to be tightly integrated with other logging systems like Graylog, etc. This was also born of a Red Hat downstream need as evidenced by the hard-coded "openshift-storage" namespace on the rules, which I think was poor upstream citizenship at the time. I looked at some high profile Kubernetes projects, and I didn't find documented evidence that they deploy Prometheus rules like Rook does. I link my findings below. I think it would be better for Rook to stop deploying Prometheus rules.
If we want the Rook project to be more helpful to users with respect to Prometheus and Ceph integration (which I think is a good benefit for users), then I think adding the benefit at a more business-logic level is more appropriate. I don't think that adding complexity to override/modify Prometheus rules in the Rook operator's application logic (Go code) via our main CRD is the best approach. The CRD-override pattern forces the Rook operator to handle the case where users wish to make modifications to the resources, adding complexity to the application logic (Go code) and also introducing a domain-specific method of overriding values which creates additional burden on users to understand how to configure the overrides. I think it would be a better user experience to either (1) include recommended Prometheus rules in the rook-ceph-cluster Helm chart or (2) create a new helm chart with Rook-Ceph recommended Prometheus rules. If users wish to override or modify any of the rules, they can do so much more easily by using kustomize as a post-renderer, or they can render the helm charts into files to modify themselves (which they could also version with gitops). This still benefits users by having recommended Rook-Ceph Prometheus rules, but it does not experience the downsides to developers and users I mentioned above. Helm (with the option for a kustomize post-renderer) uses pre-existing methods that are well understood in the Kubernetes community, and the feature is provided in pure business logic. I believe the Helm chart is the best place to provide this benefit to users if we wish to provide it. It creates the least burden on developers by implementing the feature without having to maintain application logic, and it provides the most transparent means of customization for users by utilizing well-understood Kubernetes community patterns. |
I second what @BlaineEXE said about the @yuvalman's dedication to this PR. However, before jumping right away into the code/implementation it would have been great to start discussing this with a design doc instead like we almost always do (like explained here). Anyway, what is done is done I guess. Also, I have a part of responsibility too since I didn't mention this once that PR opened so sorry about this. That said, onto the main subject. I believe PrometheusRule creation shouldn't have been in Rook from the get-go. It's not because we have this precedent that we have to keep building on top of it (especially if we agree it's not the right approach). Every company running on Kubernetes has probably a set of internal tools that do rules injection, so having the injection native to an operator is not so much needed. One of the reasons we believe it's useful is that we might have a too narrowed vision and a bias. People running Rook on Kubernetes most likely do not only run Rook but other software that might have their own rules. So I think users have their own automation and if they don't, this should be bundled via Helm. As an action item, I'd like us to try removing the existing rule injection we have and manage this via Helm. |
@BlaineEXE @leseb Is it fair to say that your perspective is that prometheus integration should not be the Rook operator's responsibility and that prometheus should not be treated as a first-class integration point? If we go with that philosophy, it has the following implications to remove all Prometheus integration from the CephCluster CR:
Alternatively, perhaps you are saying that only the PrometheusRule CRs should not be configurable through the cluster CR? In that case, this seems like a slippery slope... Why would we treat some prometheus artifacts such as It is certainly an attractive idea for the helm chart to instead create the rules since it naturally has templating capabilities that would make it simple to insert custom values. But at the same time, it means that any non-helm deployments (e.g. downstream) now require a separate implementation. While helm would be very nice for this customization feature, requiring two separate implementations is generally a red flag. Overall, I'd say my design perspectives are:
Thanks for all the perspective and discussion, it's a lively topic! @yuvalman This has grown into a philosophical debate, but certainly look forward to your perspective as well on the impact these different approaches would have on you. |
I tend to agree with @BlaineEXE's approach. It is more practical for the customers to have direct interaction with the metrics that are exposed by ceph rather than going through an opaque layer dictated by rook CRs. Rook as a solution is excellent for provisioning ceph clusters into kubernetes and it is far more useful and flexible to expose ceph's capabilities directly and make them accessible. The alerts themselves are usually very tailored to the specific business use case - thresholds, receivers, severity, on/off toggles... Such highly fluid declarations are more suitable to be templated via charts, surely less than the provisioning-focused CRs of rook. I do think that providing ServiceMonitor via the CRs is somewhat useful as it is the de-facto common method for scraping metrics and can be thought as a part of provisioning Ceph Manager and exposing its capabilities. Not really a critical point IMHO. |
I believe that the approach suggested by @BlaineEXE might improve the reconcile loop and make the templating easier for the upstream user. Also, it is right that it's not a common use in high-profile Kubernetes projects to deploy the monitoring solution by their operator. In Rook-Ceph use case, I can see the benefit of using the operator for deploying all the monitoring resources that are related to the CephCluster, it makes a stronger integration with Ceph. I think that if only the upstream user is taken into account, we should go with @BlaineEXE approach. When it comes to the downstream user, as @travisn mentioned, I believe this PR can solve both of the cases. |
This pull request has merge conflicts that must be resolved before it can be merged. @yuvalman please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork |
6421eeb
to
f6aa14b
Compare
@BlaineEXE @leseb and I discussed through the different pros and cons of the approaches. The proposal is the following:
This is proposed for v1.9. |
In case we decide to keep the monitoring solution as part of the reconcile loop, I want to make sure that you considered the following points:
If we still decide to keep the monitoring solution in Rook operator's responsibility (which in that case, I believe there is a great value for the users to use the solution provided in this pr) and not using the customize feature proposed in this pr, my design perspectives are:
Please let me know what do you think about the points I mentioned. |
This pull request has merge conflicts that must be resolved before it can be merged. @yuvalman please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork |
f6aa14b
to
0efabf1
Compare
Signed-off-by: Yuval Manor <yuvalman958@gmail.com>
0efabf1
to
ca1ace4
Compare
This pull request has merge conflicts that must be resolved before it can be merged. @yuvalman please rebase it. https://rook.io/docs/rook/latest/development-flow.html#updating-your-fork |
@yuvalman I've opened #9837 as a draft of the proposal, that the prometheus rules could be installed by the helm chart and customized with tools such as kustomize. The really nice outcome of that is that the rules were picked up directly from the ceph repo, and no changes were needed. I haven't reviewed the rules to see if that's completely true, but at least the rules can install in a test cluster with these changes. This way, with all the customization at the post-processing with kustomize or similar tools, we really don't have to maintain any special processing in rook or even in the helm chart. |
@travisn I agree. Regarding the rules from the ceph local repo, I can see that they changed the expression when checking if osd is full. That means that if a user would like to change the limit of this alert, he will have to customize the whole expression to the old one for being able to control the limit from Rook. I believe that "OSD_NEARFULL" will now be able to be configured from ceph tools pod (but I didn't test it). |
Shall we close this PR then? Thanks for all the discussion and work on this PR, it really is appreciated!
Yes, I know the nearfull is customizable from the toolbox (or with the ceph.conf overrides), so it could be customized either way. I do need to review the rule changes in more depth before getting this merged. I may end up separating the rule updates into a follow up PR to keep these changes only for the redesign... |
Description of your changes:
Make Prometeus Rule alerts customized by user preference
Signed-off-by: Yuval Manor yuvalman958@gmail.com
Which issue is resolved by this Pull Request:
Resolves #9082, #9005
Checklist:
make codegen
) has been run to update object specifications, if necessary.