Merge pull request #9837 from travisn/helm-prometheus-rules

monitoring: Create prometheus rules with helm chart
rook · Mar 21, 2022 · 8e3350a · 8e3350a
2 parents 4f9a72c + 8dd4b77
commit 8e3350a
Show file tree

Hide file tree

Showing 28 changed files with 1,953 additions and 527 deletions.
diff --git a/Documentation/ceph-monitoring.md b/Documentation/ceph-monitoring.md
@@ -95,12 +95,20 @@ A guide to how you can write your own Prometheus consoles can be found on the of
 
 ## Prometheus Alerts
 
-To enable the Ceph Prometheus alerts follow these steps:
+To enable the Ceph Prometheus alerts via the helm charts, set the following properties in values.yaml:
+- rook-ceph chart:
+  `monitoring.enabled: true`
+- rook-ceph-cluster chart:
+  `monitoring.enabled: true`
+  `monitoring.createPrometheusRules: true`
 
-1. Create the RBAC rules to enable monitoring.
+Alternatively, to enable the Ceph Prometheus alerts with example manifests follow these steps:
+
+1. Create the RBAC and prometheus rules:
 
 ```console
 kubectl create -f deploy/examples/monitoring/rbac.yaml
+kubectl create -f deploy/examples/monitoring/localrules.yaml
 ```
 
 2. Make following changes to your CephCluster object (e.g., `cluster.yaml`).
@@ -116,12 +124,9 @@ spec:
 [...]
   monitoring:
     enabled: true
-    rulesNamespace: "rook-ceph"
 [...]
 ```
 
-(Where `rook-ceph` is the CephCluster name / namespace)
-
 3. Deploy or update the CephCluster object.
 
 ```console
@@ -130,6 +135,52 @@ kubectl apply -f cluster.yaml
 
 > **NOTE**: This expects the Prometheus Operator and a Prometheus instance to be pre-installed by the admin.
 
+### Customize Alerts
+
+The Prometheus alerts can be customized with a post-processor using tools such as [Kustomize](https://kustomize.io/).
+For example, first extract the helm chart:
+
+```console
+helm template -f values.yaml rook-release/rook-ceph-cluster > cluster-chart.yaml
+```
+
+Now create the desired customization configuration files. This simple example will show how to
+update the severity of a rule, add a label to a rule, and change the `for` time value.
+
+Create a file named kustomization.yaml:
+
+```yaml
+patches:
+- path: modifications.yaml
+  target:
+    group: monitoring.coreos.com
+    kind: PrometheusRule
+    name: prometheus-ceph-rules
+    version: v1
+resources:
+- cluster-chart.yaml
+```
+
+Create a file named modifications.yaml
+
+```yaml
+- op: add
+  path: /spec/groups/0/rules/0/labels
+  value:
+    my-label: foo
+    severity: none
+- op: add
+  path: /spec/groups/0/rules/0/for
+  value: 15m
+```
+
+Finally, run kustomize to update the desired prometheus rules:
+
+```console
+kustomize build . > updated-chart.yaml
+kubectl create -f updated-chart.yaml
+```
+
 ## Grafana Dashboards
 
 The dashboards have been created by [@galexrt](https://github.com/galexrt). For feedback on the dashboards please reach out to him on the [Rook.io Slack](https://slack.rook.io).

diff --git a/Documentation/helm-ceph-cluster.md b/Documentation/helm-ceph-cluster.md
@@ -64,6 +64,7 @@ The following tables lists the configurable parameters of the rook-operator char
 | `toolbox.affinity`     | Toolbox affinity                                                     | `{}`             |
 | `toolbox.resources`    | Toolbox resources                                                    | see values.yaml  |
 | `monitoring.enabled`   | Enable Prometheus integration, will also create necessary RBAC rules | `false`          |
+| `monitoring.createPrometheusRules` | Whether to create the Prometheus rules for Ceph alerts   | `false`          |
 | `cephClusterSpec.*`    | Cluster configuration, see below                                     | See below        |
 | `ingress.dashboard`    | Enable an ingress for the ceph-dashboard                             | `{}`             |
 | `cephBlockPools.[*]`   | A list of CephBlockPool configurations to deploy                     | See below        |

diff --git a/PendingReleaseNotes.md b/PendingReleaseNotes.md
@@ -4,6 +4,8 @@
 
 * The mds liveness and startup probes are now configured by the filesystem CR instead of the cluster CR. To apply the mds probes, they need to be specified in the filesystem CR. See the [filesystem CR doc](Documentation/ceph-filesystem-crd.md#metadata-server-settings) for more details. See #9550
 * In the helm charts, all Ceph components now have default values for the pod resources. The values can be modified or removed in values.yaml depending on cluster requirements.
+* Prometheus rules are installed by the helm chart. If you were relying on the cephcluster setting `monitoring.enabled` to create the prometheus rules, they instead need to be enabled by setting `monitoring.createPrometheusRules` in the helm chart values.
+
 ## Features
 
 * The number of mgr daemons for example clusters is increased to 2 from 1, resulting in a standby mgr daemon.
@@ -12,3 +14,4 @@
 * Network encryption is configurable with settings in the CephCluster CR. Requires the 5.11 kernel or newer.
 * Network compression is configurable with settings in the CephCluster CR. Requires Ceph Quincy (v17) or newer.
 * Add support for custom ceph.conf for csi pods. See #9567
+* Added and updated many Ceph prometheus rules, picked up from the ceph repo
diff --git a/deploy/charts/rook-ceph-cluster/prometheus/externalrules.yaml b/deploy/charts/rook-ceph-cluster/prometheus/externalrules.yaml
@@ -0,0 +1,25 @@
+groups:
+  - name: persistent-volume-alert.rules
+    rules:
+      - alert: PersistentVolumeUsageNearFull
+        annotations:
+          description: PVC {{ $labels.persistentvolumeclaim }} utilization has crossed 75%. Free up some space or expand the PVC.
+          message: PVC {{ $labels.persistentvolumeclaim }} is nearing full. Data deletion or PVC expansion is required.
+          severity_level: warning
+          storage_type: ceph
+        expr: |
+          (kubelet_volume_stats_used_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass)  group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) / (kubelet_volume_stats_capacity_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass)  group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) > 0.75
+        for: 5s
+        labels:
+          severity: warning
+      - alert: PersistentVolumeUsageCritical
+        annotations:
+          description: PVC {{ $labels.persistentvolumeclaim }} utilization has crossed 85%. Free up some space or expand the PVC immediately.
+          message: PVC {{ $labels.persistentvolumeclaim }} is critically full. Data deletion or PVC expansion is required.
+          severity_level: error
+          storage_type: ceph
+        expr: |
+          (kubelet_volume_stats_used_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass)  group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) / (kubelet_volume_stats_capacity_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass)  group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) > 0.85
+        for: 5s
+        labels:
+          severity: critical