Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' #96

Merged
merged 7 commits into from Feb 2, 2022

Conversation

aruniiird
Copy link
Contributor

CephMgrIsAbsent

This alert initially had the following query

absent(up{job="rook-ceph-mgr"})

which will be fired when the 'up' query is not present, but had two flows
a. it will not be fired if 'up' provides a result with ZERO value
b. it will not give any fields in the metric, so 'namespace' was missing

when the above query was replaced with the following,

up{job="rook-ceph-mgr"} == 0

query had the following shortage
a. whenever mgr pod is completely down (like 'replicas' set to ZERO and
'mgr' is not coming up), 'up' query will not give any result.

Thus we had to combine both the queries to get results in both the scenarios.

CephMgrIsMissingReplicas

This query previously was,

sum(up{job="rook-ceph-mgr"}) < 1

had the same structure as the above (Absent) query, but it's
intention was to check the no: of 'replicas' count for ceph mgr.
Now it is changed to a kube query which handles the replicas count.

aruniiird and others added 7 commits October 6, 2021 17:11
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
Instead of using 'absent' query, we are trying to use 'up' which should
provide us with the needed 'namespace' field in the resultant metrics

Signed-off-by: aruniiird <amohan@redhat.com>
Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
The following alerts,

CephMonHighNumberOfLeaderChanges
CephOSDDiskNotResponding
CephClusterWarningState

, which are resolved automatically, in most cases,
are causing unnecessary admin events. So we are increasing the
alert delay time to '15m'.

Signed-off-by: aruniiird <amohan@redhat.com>
Reverting the time delay of
'CephMonHighNumberOfLeaderChanges' back to 5m

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
…raising misleading alert

Signed-off-by: Gowtham Shanmugasundaram <gshanmug@redhat.com>
…Replicas'

CephMgrIsAbsent
----------------
This alert initially had the following query

absent(up{job="rook-ceph-mgr"})

which will be fired when the 'up' query is not present, but had two flows
  a. it will not be fired if 'up' provides a result with ZERO value
  b. it will not give any fields in the metric, so 'namespace' was missing

when the above query was replaced with the following,

up{job="rook-ceph-mgr"} == 0

query had the following shortage
  a. whenever mgr pod is completely down (like 'replicas' set to ZERO and
'mgr' is not coming up), 'up' query will not give any result.

Thus we had to combine both the queries to get results in both the scenarios.

CephMgrIsMissingReplicas
------------------------
This query previously was,

sum(up{job="rook-ceph-mgr"}) < 1

had the same structure as the above (Absent) query, but it's
intention was to check the no: of 'replicas' count for ceph mgr.
Now it is changed to a kube query which handles the replicas count.

Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
@aruniiird
Copy link
Contributor Author

based on top of,
PR##91
PR##93
PR##94
PR##92

@aruniiird aruniiird changed the title Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' [WIP] Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' Oct 15, 2021
@aruniiird aruniiird changed the title [WIP] Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' Feb 2, 2022
@umangachapagain umangachapagain merged commit eeb4045 into ceph:master Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants