New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: increasing the auto-resolvable alerts' delay to 15m #8896
Conversation
@@ -150,7 +150,7 @@ spec: | |||
storage_type: ceph | |||
expr: | | |||
label_replace((ceph_osd_in == 1 and ceph_osd_up == 0),"disk","$1","ceph_daemon","osd.(.*)") + on(ceph_daemon) group_left(host, device) label_replace(ceph_disk_occupation,"host","$1","exported_instance","(.*)") | |||
for: 1m | |||
for: 15m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing interval for a critical severity error like CephOSDDiskNotResponding
from 1 minute to 15 minutes seems a bit ... risky!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aruniiird what's the rationale behind that change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is mainly for ODF Managed Service
product , their SREs (their customer support team) are getting (action required) alerts, which are automatically resolved (either by ceph itself or through OCS Operator reconciliation). So they want to increase the alert time delay for the following alerts,
CephMonHighNumberOfLeaderChanges
CephOSDDiskNotResponding
CephClusterWarningState
As we don't have a separate alert mechanism for Managed Services , making the changes here.
@Mergifyio rebase |
The following alerts, CephMonHighNumberOfLeaderChanges CephOSDDiskNotResponding CephClusterWarningState , which are resolved automatically, in most cases, are causing unnecessary admin events. So we are increasing the alert delay time to '15m'. Signed-off-by: Arun Kumar Mohan <amohan@redhat.com>
ba59842
to
a7f16c9
Compare
Command
|
ceph: increasing the auto-resolvable alerts' delay to 15m (backport #8896)
The following alerts,
CephMonHighNumberOfLeaderChanges
CephOSDDiskNotResponding
CephClusterWarningState
, which are resolved automatically, in most cases,
are causing unnecessary admin events. So we are increasing the
alert delay time to '15m'.
Signed-off-by: Arun Kumar Mohan amohan@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
make codegen
) has been run to update object specifications, if necessary.