Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: Create prometheus rules with helm chart #9837

Merged
merged 2 commits into from
Mar 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
61 changes: 56 additions & 5 deletions Documentation/ceph-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,20 @@ A guide to how you can write your own Prometheus consoles can be found on the of

## Prometheus Alerts

To enable the Ceph Prometheus alerts follow these steps:
To enable the Ceph Prometheus alerts via the helm charts, set the following properties in values.yaml:
- rook-ceph chart:
`monitoring.enabled: true`
- rook-ceph-cluster chart:
`monitoring.enabled: true`
`monitoring.createPrometheusRules: true`

1. Create the RBAC rules to enable monitoring.
Alternatively, to enable the Ceph Prometheus alerts with example manifests follow these steps:

1. Create the RBAC and prometheus rules:

```console
kubectl create -f deploy/examples/monitoring/rbac.yaml
kubectl create -f deploy/examples/monitoring/localrules.yaml
```

2. Make following changes to your CephCluster object (e.g., `cluster.yaml`).
Expand All @@ -116,12 +124,9 @@ spec:
[...]
monitoring:
enabled: true
rulesNamespace: "rook-ceph"
[...]
```

(Where `rook-ceph` is the CephCluster name / namespace)

3. Deploy or update the CephCluster object.

```console
Expand All @@ -130,6 +135,52 @@ kubectl apply -f cluster.yaml

> **NOTE**: This expects the Prometheus Operator and a Prometheus instance to be pre-installed by the admin.

### Customize Alerts

The Prometheus alerts can be customized with a post-processor using tools such as [Kustomize](https://kustomize.io/).
For example, first extract the helm chart:

```console
helm template -f values.yaml rook-release/rook-ceph-cluster > cluster-chart.yaml
```

Now create the desired customization configuration files. This simple example will show how to
update the severity of a rule, add a label to a rule, and change the `for` time value.

Create a file named kustomization.yaml:

```yaml
patches:
- path: modifications.yaml
target:
group: monitoring.coreos.com
kind: PrometheusRule
name: prometheus-ceph-rules
version: v1
resources:
- cluster-chart.yaml
```

Create a file named modifications.yaml

```yaml
- op: add
path: /spec/groups/0/rules/0/labels
value:
my-label: foo
severity: none
- op: add
path: /spec/groups/0/rules/0/for
value: 15m
```

Finally, run kustomize to update the desired prometheus rules:

```console
kustomize build . > updated-chart.yaml
kubectl create -f updated-chart.yaml
```

## Grafana Dashboards

The dashboards have been created by [@galexrt](https://github.com/galexrt). For feedback on the dashboards please reach out to him on the [Rook.io Slack](https://slack.rook.io).
Expand Down
1 change: 1 addition & 0 deletions Documentation/helm-ceph-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ The following tables lists the configurable parameters of the rook-operator char
| `toolbox.affinity` | Toolbox affinity | `{}` |
| `toolbox.resources` | Toolbox resources | see values.yaml |
| `monitoring.enabled` | Enable Prometheus integration, will also create necessary RBAC rules | `false` |
| `monitoring.createPrometheusRules` | Whether to create the Prometheus rules for Ceph alerts | `false` |
| `cephClusterSpec.*` | Cluster configuration, see below | See below |
| `ingress.dashboard` | Enable an ingress for the ceph-dashboard | `{}` |
| `cephBlockPools.[*]` | A list of CephBlockPool configurations to deploy | See below |
Expand Down
3 changes: 3 additions & 0 deletions PendingReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

* The mds liveness and startup probes are now configured by the filesystem CR instead of the cluster CR. To apply the mds probes, they need to be specified in the filesystem CR. See the [filesystem CR doc](Documentation/ceph-filesystem-crd.md#metadata-server-settings) for more details. See #9550
* In the helm charts, all Ceph components now have default values for the pod resources. The values can be modified or removed in values.yaml depending on cluster requirements.
* Prometheus rules are installed by the helm chart. If you were relying on the cephcluster setting `monitoring.enabled` to create the prometheus rules, they instead need to be enabled by setting `monitoring.createPrometheusRules` in the helm chart values.

## Features

* The number of mgr daemons for example clusters is increased to 2 from 1, resulting in a standby mgr daemon.
Expand All @@ -12,3 +14,4 @@
* Network encryption is configurable with settings in the CephCluster CR. Requires the 5.11 kernel or newer.
* Network compression is configurable with settings in the CephCluster CR. Requires Ceph Quincy (v17) or newer.
* Add support for custom ceph.conf for csi pods. See #9567
* Added and updated many Ceph prometheus rules, picked up from the ceph repo
25 changes: 25 additions & 0 deletions deploy/charts/rook-ceph-cluster/prometheus/externalrules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
groups:
- name: persistent-volume-alert.rules
rules:
- alert: PersistentVolumeUsageNearFull
annotations:
description: PVC {{ $labels.persistentvolumeclaim }} utilization has crossed 75%. Free up some space or expand the PVC.
message: PVC {{ $labels.persistentvolumeclaim }} is nearing full. Data deletion or PVC expansion is required.
severity_level: warning
storage_type: ceph
expr: |
(kubelet_volume_stats_used_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) / (kubelet_volume_stats_capacity_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) > 0.75
for: 5s
labels:
severity: warning
- alert: PersistentVolumeUsageCritical
annotations:
description: PVC {{ $labels.persistentvolumeclaim }} utilization has crossed 85%. Free up some space or expand the PVC immediately.
message: PVC {{ $labels.persistentvolumeclaim }} is critically full. Data deletion or PVC expansion is required.
severity_level: error
storage_type: ceph
expr: |
(kubelet_volume_stats_used_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) / (kubelet_volume_stats_capacity_bytes * on (namespace,persistentvolumeclaim) group_left(storageclass, provisioner) (kube_persistentvolumeclaim_info * on (storageclass) group_left(provisioner) kube_storageclass_info {provisioner=~"(.*rbd.csi.ceph.com)|(.*cephfs.csi.ceph.com)"})) > 0.85
for: 5s
labels:
severity: critical