From 871be7621cc585a92a614d83f57c37bb09ef4008 Mon Sep 17 00:00:00 2001 From: Arun Kumar Mohan Date: Fri, 15 Oct 2021 13:00:03 +0530 Subject: [PATCH] ceph: fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' CephMgrIsAbsent ---------------- This query initially had the following query absent(up{job="rook-ceph-mgr"}) which will fire when the 'up' query is not present, but had two flows a. it will not be fired if 'up' provides a result with ZERO value b. it will not give any fields in the metric, so 'namespace' was missing when the above query was replaced with the following, up{job="rook-ceph-mgr"} == 0 query had the following shortage a. whenever mgr pod is completely down (like 'replicas' set to ZERO and 'mgr' is not coming up), 'up' query will not give any result. Thus we had to combine both the queries to get results in both the scenarios. CephMgrIsMissingReplicas ------------------------ This query previously was, sum(up{job="rook-ceph-mgr"}) < 1 had the same structure as the above (Absent) query, but it's intention was to check the no: of 'replicas' count for ceph mgr. Now it is changed to a kube query which handles the replicas count. Signed-off-by: Arun Kumar Mohan --- .../kubernetes/ceph/monitoring/prometheus-ceph-v14-rules.yaml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cluster/examples/kubernetes/ceph/monitoring/prometheus-ceph-v14-rules.yaml b/cluster/examples/kubernetes/ceph/monitoring/prometheus-ceph-v14-rules.yaml index e0cf837100b1a..37925ade11cd8 100644 --- a/cluster/examples/kubernetes/ceph/monitoring/prometheus-ceph-v14-rules.yaml +++ b/cluster/examples/kubernetes/ceph/monitoring/prometheus-ceph-v14-rules.yaml @@ -42,7 +42,7 @@ spec: severity_level: critical storage_type: ceph expr: | - up{job="rook-ceph-mgr"} == 0 + label_replace((up{job="rook-ceph-mgr"} == 0 or absent(up{job="rook-ceph-mgr"})), "namespace", "openshift-storage", "", "") for: 5m labels: severity: critical @@ -53,7 +53,7 @@ spec: severity_level: warning storage_type: ceph expr: | - sum(up{job="rook-ceph-mgr"}) by (namespace) < 1 + sum(kube_deployment_spec_replicas{deployment=~"rook-ceph-mgr-.*"}) by (namespace) < 1 for: 5m labels: severity: warning