Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: customize prometheus rule alerts #9503

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 20 additions & 0 deletions Documentation/ceph-cluster-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@ If this value is empty, each pod will get an ephemeral directory to store their
* `ssl`: Whether to serve the dashboard via SSL, ignored on Ceph versions older than `13.2.2`
* `monitoring`: Settings for monitoring Ceph using Prometheus. To enable monitoring on your cluster see the [monitoring guide](ceph-monitoring.md#prometheus-alerts).
* `enabled`: Whether to enable prometheus based monitoring for this cluster
* `alertRuleOverrides`: Custom prometheus rule alerts values to override default values
* `externalMgrEndpoints`: external cluster manager endpoints
* `externalMgrPrometheusPort`: external prometheus manager module port. See [external cluster configuration](#external-cluster) for more details.
* `rulesNamespace`: Namespace to deploy prometheusRule. If empty, namespace of the cluster will be used.
Expand Down Expand Up @@ -1402,6 +1403,25 @@ spec:
#externalMgrEndpoints:
#- ip: 192.168.39.182
#externalMgrPrometheusPort: 9283
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
```

Choose the namespace carefully, if you have an existing cluster managed by Rook, you have likely already injected `common.yaml`.
Expand Down
19 changes: 19 additions & 0 deletions Documentation/ceph-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,25 @@ spec:
monitoring:
enabled: true
rulesNamespace: "rook-ceph"
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
travisn marked this conversation as resolved.
Show resolved Hide resolved
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error. Used for marking error, warning etc in UI
# severity: custom-severity # can be every custom value, used for
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
[...]
```

Expand Down
3 changes: 3 additions & 0 deletions PendingReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@
Pr: https://github.com/rook/rook/pull/9550

## Features

### Ceph
- Prometheus Rule alerts can be customized by user preference.
20 changes: 20 additions & 0 deletions deploy/charts/rook-ceph-cluster/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,26 @@ monitoring:
# enabling will also create RBAC rules to allow Operator to create ServiceMonitors
enabled: false
rulesNamespaceOverride:
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2


# If true, create & use PSP resources. Set this to the same value as the rook-ceph chart.
pspEnable: true
Expand Down
26 changes: 26 additions & 0 deletions deploy/charts/rook-ceph/templates/resources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1593,6 +1593,32 @@ spec:
description: Prometheus based Monitoring settings
nullable: true
properties:
alertRuleOverrides:
additionalProperties:
description: CephAlert basic customized alert
properties:
disabled:
type: boolean
for:
type: string
limit:
type: integer
namespace:
type: string
osdUpRate:
type: string
severity:
type: string
severityLevel:
enum:
- warning
- critical
- error
type: string
type: object
description: AlertRuleOverrides points to a customized Ceph prometheus alerts
nullable: true
type: object
enabled:
description: Enabled determines whether to create the prometheus rules for the ceph cluster. If true, the prometheus types must exist or the creation will fail.
type: boolean
Expand Down
19 changes: 19 additions & 0 deletions deploy/charts/rook-ceph/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -365,3 +365,22 @@ monitoring:
# requires Prometheus to be pre-installed
# enabling will also create RBAC rules to allow Operator to create ServiceMonitors
enabled: false
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
travisn marked this conversation as resolved.
Show resolved Hide resolved
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
12 changes: 12 additions & 0 deletions deploy/examples/cluster-external.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,15 @@ spec:
# externalMgrEndpoints:
#- ip: ip
# externalMgrPrometheusPort: 9283
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# PersistentVolumeUsageNearFull:
# limit: 80
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
19 changes: 19 additions & 0 deletions deploy/examples/cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,25 @@ spec:
# If you have multiple rook-ceph clusters in the same k8s cluster, choose the same namespace (ideally, namespace with prometheus
# deployed) to set rulesNamespace for all the clusters. Otherwise, you will get duplicate alerts with multiple alert definitions.
rulesNamespace: rook-ceph
# prometheus rule alerts values for overriding default prometheus rules values
# Notes:
# 1. Specific alerts can be disabled by setting disabled field to true
# 2. The difference between severityLevel and severity fields is:
# severityLevel - is an annotation for marking warning/critical/error in ceph dashboard UI.
# severity - is a label that can be used by Prometheus AlertManager for sending alerts based on this label.
#alertRuleOverrides:
# CephNodeDown:
# disabled: true
# CephMgrIsAbsent:
# for: 1m
# severityLevel: warning # must be warning, critical, or error
# severity: custom-severity
# CephOSDNearFull:
# limit: 80
# for: 2m
# CephOSDFlapping:
# osdUpRate: 10m
# severity: custom-severity-2
network:
# enable host networking
#provider: host
Expand Down
26 changes: 26 additions & 0 deletions deploy/examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1592,6 +1592,32 @@ spec:
description: Prometheus based Monitoring settings
nullable: true
properties:
alertRuleOverrides:
additionalProperties:
description: CephAlert basic customized alert
properties:
disabled:
type: boolean
for:
type: string
limit:
type: integer
namespace:
type: string
osdUpRate:
type: string
severity:
type: string
severityLevel:
enum:
- warning
- critical
- error
type: string
type: object
description: AlertRuleOverrides points to a customized Ceph prometheus alerts
nullable: true
type: object
enabled:
description: Enabled determines whether to create the prometheus rules for the ceph cluster. If true, the prometheus types must exist or the creation will fail.
type: boolean
Expand Down

This file was deleted.

1 change: 0 additions & 1 deletion deploy/examples/monitoring/prometheus-ceph-v15-rules.yaml

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion deploy/examples/monitoring/prometheus-ceph-v16-rules.yaml

This file was deleted.

117 changes: 117 additions & 0 deletions deploy/examples/monitoring/prometheusrule-default-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
CephMgrIsAbsent:
travisn marked this conversation as resolved.
Show resolved Hide resolved
for: 5m
namespace: ${operatorNamespace}
severityLevel: critical
severity: critical
CephMgrIsMissingReplicas:
for: 5m
severityLevel: warning
severity: warning
CephMdsMissingReplicas:
for: 5m
severityLevel: warning
severity: warning
CephMonQuorumAtRisk:
for: 15m
severityLevel: error
severity: critical
CephMonQuorumLost:
for: 5m
severityLevel: critical
severity: critical
CephMonHighNumberOfLeaderChanges:
limit: 95
for: 5m
severityLevel: warning
severity: warning
CephNodeDown:
for: 30s
severityLevel: error
severity: critical
CephOSDCriticallyFull:
limit: 80
for: 40s
severityLevel: error
severity: critical
CephOSDFlapping:
limit: 10
osdUpRate: 5m
for: 0s
severityLevel: error
severity: critical
CephOSDNearFull:
limit: 75
for: 40s
severityLevel: warning
severity: warning
CephOSDDiskNotResponding:
for: 15m
severityLevel: error
severity: critical
CephOSDDiskUnavailable:
for: 1m
severityLevel: error
severity: critical
CephOSDSlowOps:
for: 30s
severityLevel: warning
severity: warning
CephDataRecoveryTakingTooLong:
for: 2h
severityLevel: warning
severity: warning
CephPGRepairTakingTooLong:
for: 1h
severityLevel: warning
severity: warning
PersistentVolumeUsageNearFull:
limit: 75
for: 5s
severityLevel: warning
severity: warning
PersistentVolumeUsageCritical:
limit: 85
for: 5s
severityLevel: error
severity: critical
CephClusterErrorState:
for: 10m
severityLevel: error
severity: critical
CephClusterWarningState:
for: 15m
severityLevel: warning
severity: warning
CephOSDVersionMismatch:
for: 10m
severityLevel: warning
severity: warning
CephMonVersionMismatch:
for: 10m
severityLevel: warning
severity: warning
CephClusterNearFull:
limit: 75
for: 5s
severityLevel: warning
severity: warning
CephClusterCriticallyFull:
limit: 80
for: 5s
severityLevel: error
severity: critical
CephClusterReadOnly:
limit: 85
for: 0s
severityLevel: error
severity: critical
CephPoolQuotaBytesNearExhaustion:
limit: 70
for: 1m
severityLevel: warning
severity: warning
CephPoolQuotaBytesCriticallyExhausted:
limit: 90
for: 1m
severityLevel: critical
severity: critical
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ require (
github.com/hashicorp/vault-plugin-secrets-kv v0.9.0
github.com/hashicorp/vault/api v1.1.2-0.20210713235431-1fc8af4c041f
github.com/hashicorp/vault/sdk v0.2.2-0.20211101151547-6654f4b913f9
github.com/imdario/mergo v0.3.12
leseb marked this conversation as resolved.
Show resolved Hide resolved
github.com/k8snetworkplumbingwg/network-attachment-definition-client v1.1.0
github.com/kube-object-storage/lib-bucket-provisioner v0.0.0-20220105185820-c1da9586e05b
github.com/libopenstorage/secrets v0.0.0-20210709082113-dde442ea20ec
Expand Down
23 changes: 23 additions & 0 deletions pkg/apis/ceph.rook.io/v1/types.go
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,29 @@ type MonitoringSpec struct {
// +kubebuilder:validation:Maximum=65535
// +optional
ExternalMgrPrometheusPort uint16 `json:"externalMgrPrometheusPort,omitempty"`
// AlertRuleOverrides points to a customized Ceph prometheus alerts
// +optional
// +nullable
AlertRuleOverrides map[string]*CephAlert `json:"alertRuleOverrides,omitempty"`
}

//CephAlert basic customized alert
type CephAlert struct {
// +optional
Disabled bool `json:"disabled,omitempty"`
// +optional
For string `json:"for,omitempty"`
// +optional
// +kubebuilder:validation:Enum=warning;critical;error
SeverityLevel string `json:"severityLevel,omitempty"`
travisn marked this conversation as resolved.
Show resolved Hide resolved
// +optional
Severity string `json:"severity,omitempty"`
// +optional
Namespace string `json:"namespace,omitempty"`
// +optional
Limit int `json:"limit,omitempty"`
// +optional
OsdUpRate string `json:"osdUpRate,omitempty"`
}

// ClusterStatus represents the status of a Ceph cluster
Expand Down