Skip to content

Commit

Permalink
monitoring: customize prometheus rule alerts
Browse files Browse the repository at this point in the history
  • Loading branch information
yuvalman committed Feb 27, 2022
1 parent ba165ec commit 0efabf1
Show file tree
Hide file tree
Showing 24 changed files with 1,001 additions and 68 deletions.
20 changes: 20 additions & 0 deletions Documentation/ceph-cluster-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@ If this value is empty, each pod will get an ephemeral directory to store their
* `ssl`: Whether to serve the dashboard via SSL, ignored on Ceph versions older than `13.2.2`
* `monitoring`: Settings for monitoring Ceph using Prometheus. To enable monitoring on your cluster see the [monitoring guide](ceph-monitoring.md#prometheus-alerts).
* `enabled`: Whether to enable prometheus based monitoring for this cluster
* `alertRuleOverrides`: Custom prometheus rule alerts values to override default values
* `externalMgrEndpoints`: external cluster manager endpoints
* `externalMgrPrometheusPort`: external prometheus manager module port. See [external cluster configuration](#external-cluster) for more details.
* `rulesNamespace`: Namespace to deploy prometheusRule. If empty, namespace of the cluster will be used.
Expand Down Expand Up @@ -1402,6 +1403,25 @@ spec:
#externalMgrEndpoints:
#- ip: 192.168.39.182
#externalMgrPrometheusPort: 9283
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
```

Choose the namespace carefully, if you have an existing cluster managed by Rook, you have likely already injected `common.yaml`.
Expand Down
19 changes: 19 additions & 0 deletions Documentation/ceph-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,25 @@ spec:
monitoring:
enabled: true
rulesNamespace: "rook-ceph"
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error. Used for marking error, warning etc in UI
# severity: custom-severity # can be every custom value, used for
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
[...]
```

Expand Down
3 changes: 3 additions & 0 deletions PendingReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@
Pr: https://github.com/rook/rook/pull/9550

## Features

### Ceph
- Prometheus Rule alerts can be customized by user preference.
20 changes: 20 additions & 0 deletions deploy/charts/rook-ceph-cluster/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,26 @@ monitoring:
# enabling will also create RBAC rules to allow Operator to create ServiceMonitors
enabled: false
rulesNamespaceOverride:
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2


# If true, create & use PSP resources. Set this to the same value as the rook-ceph chart.
pspEnable: true
Expand Down
26 changes: 26 additions & 0 deletions deploy/charts/rook-ceph/templates/resources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1593,6 +1593,32 @@ spec:
description: Prometheus based Monitoring settings
nullable: true
properties:
alertRuleOverrides:
additionalProperties:
description: CephAlert basic customized alert
properties:
disabled:
type: boolean
for:
type: string
limit:
type: integer
namespace:
type: string
osdUpRate:
type: string
severity:
type: string
severityLevel:
enum:
- warning
- critical
- error
type: string
type: object
description: AlertRuleOverrides points to a customized Ceph prometheus alerts
nullable: true
type: object
enabled:
description: Enabled determines whether to create the prometheus rules for the ceph cluster. If true, the prometheus types must exist or the creation will fail.
type: boolean
Expand Down
19 changes: 19 additions & 0 deletions deploy/charts/rook-ceph/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -365,3 +365,22 @@ monitoring:
# requires Prometheus to be pre-installed
# enabling will also create RBAC rules to allow Operator to create ServiceMonitors
enabled: false
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
12 changes: 12 additions & 0 deletions deploy/examples/cluster-external.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,15 @@ spec:
# externalMgrEndpoints:
#- ip: ip
# externalMgrPrometheusPort: 9283
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# PersistentVolumeUsageNearFull:
# limit: 80
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
19 changes: 19 additions & 0 deletions deploy/examples/cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,25 @@ spec:
# If you have multiple rook-ceph clusters in the same k8s cluster, choose the same namespace (ideally, namespace with prometheus
# deployed) to set rulesNamespace for all the clusters. Otherwise, you will get duplicate alerts with multiple alert definitions.
rulesNamespace: rook-ceph
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
network:
# enable host networking
#provider: host
Expand Down
26 changes: 26 additions & 0 deletions deploy/examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1592,6 +1592,32 @@ spec:
description: Prometheus based Monitoring settings
nullable: true
properties:
alertRuleOverrides:
additionalProperties:
description: CephAlert basic customized alert
properties:
disabled:
type: boolean
for:
type: string
limit:
type: integer
namespace:
type: string
osdUpRate:
type: string
severity:
type: string
severityLevel:
enum:
- warning
- critical
- error
type: string
type: object
description: AlertRuleOverrides points to a customized Ceph prometheus alerts
nullable: true
type: object
enabled:
description: Enabled determines whether to create the prometheus rules for the ceph cluster. If true, the prometheus types must exist or the creation will fail.
type: boolean
Expand Down

This file was deleted.

1 change: 0 additions & 1 deletion deploy/examples/monitoring/prometheus-ceph-v15-rules.yaml

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion deploy/examples/monitoring/prometheus-ceph-v16-rules.yaml

This file was deleted.

117 changes: 117 additions & 0 deletions deploy/examples/monitoring/prometheusrule-default-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
CephMgrIsAbsent:
for: 5m
namespace: ${operatorNamespace}
severityLevel: critical
severity: critical
CephMgrIsMissingReplicas:
for: 5m
severityLevel: warning
severity: warning
CephMdsMissingReplicas:
for: 5m
severityLevel: warning
severity: warning
CephMonQuorumAtRisk:
for: 15m
severityLevel: error
severity: critical
CephMonQuorumLost:
for: 5m
severityLevel: critical
severity: critical
CephMonHighNumberOfLeaderChanges:
limit: 95
for: 5m
severityLevel: warning
severity: warning
CephNodeDown:
for: 30s
severityLevel: error
severity: critical
CephOSDCriticallyFull:
limit: 80
for: 40s
severityLevel: error
severity: critical
CephOSDFlapping:
limit: 10
osdUpRate: 5m
for: 0s
severityLevel: error
severity: critical
CephOSDNearFull:
limit: 75
for: 40s
severityLevel: warning
severity: warning
CephOSDDiskNotResponding:
for: 15m
severityLevel: error
severity: critical
CephOSDDiskUnavailable:
for: 1m
severityLevel: error
severity: critical
CephOSDSlowOps:
for: 30s
severityLevel: warning
severity: warning
CephDataRecoveryTakingTooLong:
for: 2h
severityLevel: warning
severity: warning
CephPGRepairTakingTooLong:
for: 1h
severityLevel: warning
severity: warning
PersistentVolumeUsageNearFull:
limit: 75
for: 5s
severityLevel: warning
severity: warning
PersistentVolumeUsageCritical:
limit: 85
for: 5s
severityLevel: error
severity: critical
CephClusterErrorState:
for: 10m
severityLevel: error
severity: critical
CephClusterWarningState:
for: 15m
severityLevel: warning
severity: warning
CephOSDVersionMismatch:
for: 10m
severityLevel: warning
severity: warning
CephMonVersionMismatch:
for: 10m
severityLevel: warning
severity: warning
CephClusterNearFull:
limit: 75
for: 5s
severityLevel: warning
severity: warning
CephClusterCriticallyFull:
limit: 80
for: 5s
severityLevel: error
severity: critical
CephClusterReadOnly:
limit: 85
for: 0s
severityLevel: error
severity: critical
CephPoolQuotaBytesNearExhaustion:
limit: 70
for: 1m
severityLevel: warning
severity: warning
CephPoolQuotaBytesCriticallyExhausted:
limit: 90
for: 1m
severityLevel: critical
severity: critical
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ require (
github.com/hashicorp/vault-plugin-secrets-kv v0.9.0
github.com/hashicorp/vault/api v1.1.2-0.20210713235431-1fc8af4c041f
github.com/hashicorp/vault/sdk v0.2.2-0.20211101151547-6654f4b913f9
github.com/imdario/mergo v0.3.12
github.com/k8snetworkplumbingwg/network-attachment-definition-client v1.1.0
github.com/kube-object-storage/lib-bucket-provisioner v0.0.0-20220105185820-c1da9586e05b
github.com/libopenstorage/secrets v0.0.0-20210709082113-dde442ea20ec
Expand Down
23 changes: 23 additions & 0 deletions pkg/apis/ceph.rook.io/v1/types.go
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,29 @@ type MonitoringSpec struct {
// +kubebuilder:validation:Maximum=65535
// +optional
ExternalMgrPrometheusPort uint16 `json:"externalMgrPrometheusPort,omitempty"`
// AlertRuleOverrides points to a customized Ceph prometheus alerts
// +optional
// +nullable
AlertRuleOverrides map[string]*CephAlert `json:"alertRuleOverrides,omitempty"`
}

//CephAlert basic customized alert
type CephAlert struct {
// +optional
Disabled bool `json:"disabled,omitempty"`
// +optional
For string `json:"for,omitempty"`
// +optional
// +kubebuilder:validation:Enum=warning;critical;error
SeverityLevel string `json:"severityLevel,omitempty"`
// +optional
Severity string `json:"severity,omitempty"`
// +optional
Namespace string `json:"namespace,omitempty"`
// +optional
Limit int `json:"limit,omitempty"`
// +optional
OsdUpRate string `json:"osdUpRate,omitempty"`
}

// ClusterStatus represents the status of a Ceph cluster
Expand Down

0 comments on commit 0efabf1

Please sign in to comment.