Skip to content

Commit

Permalink
Merge pull request #9837 from travisn/helm-prometheus-rules
Browse files Browse the repository at this point in the history
monitoring: Create prometheus rules with helm chart
  • Loading branch information
travisn committed Mar 21, 2022
2 parents 4f9a72c + 8dd4b77 commit 8e3350a
Show file tree
Hide file tree
Showing 28 changed files with 1,953 additions and 527 deletions.
61 changes: 56 additions & 5 deletions Documentation/ceph-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,20 @@ A guide to how you can write your own Prometheus consoles can be found on the of

## Prometheus Alerts

To enable the Ceph Prometheus alerts follow these steps:
To enable the Ceph Prometheus alerts via the helm charts, set the following properties in values.yaml:
- rook-ceph chart:
`monitoring.enabled: true`
- rook-ceph-cluster chart:
`monitoring.enabled: true`
`monitoring.createPrometheusRules: true`

1. Create the RBAC rules to enable monitoring.
Alternatively, to enable the Ceph Prometheus alerts with example manifests follow these steps:

1. Create the RBAC and prometheus rules:

```console
kubectl create -f deploy/examples/monitoring/rbac.yaml
kubectl create -f deploy/examples/monitoring/localrules.yaml
```

2. Make following changes to your CephCluster object (e.g., `cluster.yaml`).
Expand All @@ -116,12 +124,9 @@ spec:
[...]
monitoring:
enabled: true
rulesNamespace: "rook-ceph"
[...]
```

(Where `rook-ceph` is the CephCluster name / namespace)

3. Deploy or update the CephCluster object.

```console
Expand All @@ -130,6 +135,52 @@ kubectl apply -f cluster.yaml

> **NOTE**: This expects the Prometheus Operator and a Prometheus instance to be pre-installed by the admin.
### Customize Alerts

The Prometheus alerts can be customized with a post-processor using tools such as [Kustomize](https://kustomize.io/).
For example, first extract the helm chart:

```console
helm template -f values.yaml rook-release/rook-ceph-cluster > cluster-chart.yaml
```

Now create the desired customization configuration files. This simple example will show how to
update the severity of a rule, add a label to a rule, and change the `for` time value.

Create a file named kustomization.yaml:

```yaml
patches:
- path: modifications.yaml
target:
group: monitoring.coreos.com
kind: PrometheusRule
name: prometheus-ceph-rules
version: v1
resources:
- cluster-chart.yaml
```

Create a file named modifications.yaml

```yaml
- op: add
path: /spec/groups/0/rules/0/labels
value:
my-label: foo
severity: none
- op: add
path: /spec/groups/0/rules/0/for
value: 15m
```

Finally, run kustomize to update the desired prometheus rules:

```console
kustomize build . > updated-chart.yaml
kubectl create -f updated-chart.yaml
```

## Grafana Dashboards

The dashboards have been created by [@galexrt](https://github.com/galexrt). For feedback on the dashboards please reach out to him on the [Rook.io Slack](https://slack.rook.io).
Expand Down
1 change: 1 addition & 0 deletions Documentation/helm-ceph-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ The following tables lists the configurable parameters of the rook-operator char
| `toolbox.affinity` | Toolbox affinity | `{}` |
| `toolbox.resources` | Toolbox resources | see values.yaml |
| `monitoring.enabled` | Enable Prometheus integration, will also create necessary RBAC rules | `false` |
| `monitoring.createPrometheusRules` | Whether to create the Prometheus rules for Ceph alerts | `false` |
| `cephClusterSpec.*` | Cluster configuration, see below | See below |
| `ingress.dashboard` | Enable an ingress for the ceph-dashboard | `{}` |
| `cephBlockPools.[*]` | A list of CephBlockPool configurations to deploy | See below |
Expand Down
3 changes: 3 additions & 0 deletions PendingReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

* The mds liveness and startup probes are now configured by the filesystem CR instead of the cluster CR. To apply the mds probes, they need to be specified in the filesystem CR. See the [filesystem CR doc](Documentation/ceph-filesystem-crd.md#metadata-server-settings) for more details. See #9550
* In the helm charts, all Ceph components now have default values for the pod resources. The values can be modified or removed in values.yaml depending on cluster requirements.
* Prometheus rules are installed by the helm chart. If you were relying on the cephcluster setting `monitoring.enabled` to create the prometheus rules, they instead need to be enabled by setting `monitoring.createPrometheusRules` in the helm chart values.

## Features

* The number of mgr daemons for example clusters is increased to 2 from 1, resulting in a standby mgr daemon.
Expand All @@ -12,3 +14,4 @@
* Network encryption is configurable with settings in the CephCluster CR. Requires the 5.11 kernel or newer.
* Network compression is configurable with settings in the CephCluster CR. Requires Ceph Quincy (v17) or newer.
* Add support for custom ceph.conf for csi pods. See #9567
* Added and updated many Ceph prometheus rules, picked up from the ceph repo
25 changes: 25 additions & 0 deletions deploy/charts/rook-ceph-cluster/prometheus/externalrules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
groups:
- name: persistent-volume-alert.rules
rules:
- alert: PersistentVolumeUsageNearFull
annotations:
description: PVC {{ $labels.persistentvolumeclaim }} utilization has crossed 75%. Free up some space or expand the PVC.
message: PVC {{ $labels.persistentvolumeclaim }} is nearing full. Data deletion or PVC expansion is required.
severity_level: warning
storage_type: ceph
expr: |
(kubelet_volume_stats_used_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) / (kubelet_volume_stats_capacity_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) > 0.75
for: 5s
labels:
severity: warning
- alert: PersistentVolumeUsageCritical
annotations:
description: PVC {{ $labels.persistentvolumeclaim }} utilization has crossed 85%. Free up some space or expand the PVC immediately.
message: PVC {{ $labels.persistentvolumeclaim }} is critically full. Data deletion or PVC expansion is required.
severity_level: error
storage_type: ceph
expr: |
(kubelet_volume_stats_used_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) / (kubelet_volume_stats_capacity_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) > 0.85
for: 5s
labels:
severity: critical

0 comments on commit 8e3350a

Please sign in to comment.