Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissingReplicas' #96

Merged
merged 7 commits into from
Feb 2, 2022

Commits on Oct 6, 2021

  1. Adding 'namespace' to the 'ceph_node_down' query

    Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
    aruniiird authored and GowthamShanmugam committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    937f993 View commit details
    Browse the repository at this point in the history
  2. Change CephAbsentMgr to use 'up' query

    Instead of using 'absent' query, we are trying to use 'up' which should
    provide us with the needed 'namespace' field in the resultant metrics
    
    Signed-off-by: aruniiird <amohan@redhat.com>
    aruniiird authored and GowthamShanmugam committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    fcf1565 View commit details
    Browse the repository at this point in the history
  3. Adding namespace field into other alert queries

    Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
    aruniiird authored and GowthamShanmugam committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    e0b35f4 View commit details
    Browse the repository at this point in the history
  4. Increasing the auto-resolvable alerts' delay to 15m

    The following alerts,
    
    CephMonHighNumberOfLeaderChanges
    CephOSDDiskNotResponding
    CephClusterWarningState
    
    , which are resolved automatically, in most cases,
    are causing unnecessary admin events. So we are increasing the
    alert delay time to '15m'.
    
    Signed-off-by: aruniiird <amohan@redhat.com>
    aruniiird authored and GowthamShanmugam committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    ae51e28 View commit details
    Browse the repository at this point in the history
  5. Reverting the time delay of 'CephMonHighNumberOfLeaderChanges'

    Reverting the time delay of
    'CephMonHighNumberOfLeaderChanges' back to 5m
    
    Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
    aruniiird authored and GowthamShanmugam committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    082c58a View commit details
    Browse the repository at this point in the history
  6. Bug 1970354: Handle empty ceph_version in ceph_mon_metadata to avoid …

    …raising misleading alert
    
    Signed-off-by: Gowtham Shanmugasundaram <gshanmug@redhat.com>
    GowthamShanmugam committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    a5fa42d View commit details
    Browse the repository at this point in the history

Commits on Oct 15, 2021

  1. Fixing the queries for alerts 'CephMgrIsAbsent' and 'CephMgrIsMissing…

    …Replicas'
    
    CephMgrIsAbsent
    ----------------
    This alert initially had the following query
    
    absent(up{job="rook-ceph-mgr"})
    
    which will be fired when the 'up' query is not present, but had two flows
      a. it will not be fired if 'up' provides a result with ZERO value
      b. it will not give any fields in the metric, so 'namespace' was missing
    
    when the above query was replaced with the following,
    
    up{job="rook-ceph-mgr"} == 0
    
    query had the following shortage
      a. whenever mgr pod is completely down (like 'replicas' set to ZERO and
    'mgr' is not coming up), 'up' query will not give any result.
    
    Thus we had to combine both the queries to get results in both the scenarios.
    
    CephMgrIsMissingReplicas
    ------------------------
    This query previously was,
    
    sum(up{job="rook-ceph-mgr"}) < 1
    
    had the same structure as the above (Absent) query, but it's
    intention was to check the no: of 'replicas' count for ceph mgr.
    Now it is changed to a kube query which handles the replicas count.
    
    Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
    aruniiird committed Oct 15, 2021
    Configuration menu
    Copy the full SHA
    a32a4c3 View commit details
    Browse the repository at this point in the history