Annotation convention #54

devopsjonas · 2019-05-03T05:44:15Z

I think we are duplicating info in our current annotations:

For example

            annotations: {
              message: 'Network interface is down on {{ $labels.host }}',
              description: 'Interface {{ $labels.interface }} on host {{ $labels.host }} is down. Please check cabling, and network switch logs',
              storage_type: $._config.storageType,
              severity_level: 'warning',
            },

message and description has almost the same information.

Maybe instead we just do description ?

also what is storage_type field, why does it matter during solving the alert?

I think if we would end up with something that looks like https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution
would be best.

His example

alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
   severity: major
annotations:
   summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
   dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
   impact: Billing is experiencing delays, causing orders to be marked as pending
   runbook: https://wiki.intra/runbooks/billing-queues-issues.html

Also I think serverity should be a label not an annotation. Alertmanager can only use labels for alert routing https://prometheus.io/docs/alerting/configuration/#route

So if it's an annotation then you can't send warning alerts to email, critical alerts to opsgenie/pagerduty

The text was updated successfully, but these errors were encountered:

shtripat · 2019-05-03T05:52:39Z

@devopsjonas The fields message and description are added for different purposes here. The message field is used for displaying one liner in UI dashboard and description would be used if a detailed listing of alerts done in UI. @cloudbehl please add UI perspective.

Regarding severity_level field in annotations is added for marking error, warning etc in UI. The label severity would rather be used by AlertManager in case we use the functionality.

The field storage_type was used to allow Ceph specific alerts in OCP UI.

devopsjonas · 2019-05-03T06:31:13Z

Oh I see.

So our use case is a bit different from yours. Yours is rendering in Ceph Manager UI, is that correct?
Ours is rendering in Opsgenie / pagerduty / zabbix / email

This is how it currently looks:

Would you guys be ok if added grafana_url, runbook_url & dashboard_url annotations, but only if there were not empty? So by default it wouldn't be rendered for you. But for people that set it, it would be rendered.

shtripat · 2019-05-03T06:34:07Z

At the moment these alerts are being used in OCP UI. I would suggest these fields to be added as and when you downstream this. May be having fields upstream with blank values is fine.

@cloudbehl @anmolsachan @umangachapagain sugestions?

devopsjonas · 2019-05-03T06:53:31Z

@shtripat what is OCP UI? could you link me to github repo or project repo?

Also does this project plan to follow Prometheus monitoring mixin structure?

Structure:

{
  grafanaDashboards+:: {
    “dashboard-name.json”: {...},
  },
  prometheusAlerts+:: [...],
  prometheusRules+:: [...],
}

Could we add grafanaDashboards & prometheusRules if we need to?

cloudbehl · 2019-05-03T07:06:37Z

@devopsjonas OCP is OpenShift Container Platform.

Yes, you can add Grafana Dashboards and Prometheus rules. The only point these should be generic enough so community can use it. I think we can take care for that in reviews.

devopsjonas · 2019-05-03T07:49:13Z

@cloudbehl could you link me to that OCP UI code? And how to compile & run it? I would like to try my PR'd alerts against it, just to see how they are rendered.

That said, we shouldn't be doing annotations that only OCP dashboard understands. Atleast in ceph repo, or maybe rename this to ceph-ocp-mixin then?

IMO human readable alerts should follow convention described in https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution or something similiar.

cloudbehl · 2019-05-03T08:56:07Z

@cloudbehl could you link me to that OCP UI code? And how to compile & run it? I would like to try my PR'd alerts against it, just to see how they are rendered.

The OCP UI can found here:
https://github.com/openshift/console

That said, we shouldn't be doing annotations that only OCP dashboard understands. Atleast in ceph repo, or maybe rename this to ceph-ocp-mixin then?

The idea for the repo is have a generic code that can be used by community for the ceph alert and dashboard. Right now we are testing it with kubernetes and OpenShift(OCP) but It is not specific to the OCP use case I can say.

I agree to a point that description and message are kind of same and can be worked upon. But our idea behind the message and description is that message will have short description and the description will have a detailed message of what happened where and what you can look up to correct the problem.

IMO human readable alerts should follow convention described in https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution or something similiar.

Ack! I think Summary and Impact in the example we are using alternate to Message and Description. Also I am not ready to Grafana link right now as URLs are dynamic and will not be same in every use case/ deployment.

devopsjonas · 2019-05-04T04:57:45Z

Ok, so mixin needs to be generic enough. But then arguments you wrote in #54 (comment) doesn't apply, as all that you write is that you need them in OCP UI. This projects needs to enable you guys to build OCP UI, but don't adapt to it's conventions.

Right now our model used in https://github.com/devopyio/ceph-monitoring-mixin is a bit different, as we ran these alerts on top of Prometheus, Alertmanager, Email and Zabbix.

I really think we should meet somewhere in the middle with our annotation & label convention.

We can drop considering grafana_url, dashboard_url and runbook_url (although kubernetes mixin does this https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/add-runbook-links.libsonnet too).

We didn't come with our proposed annotation convention, we just blatenly copied it from the internet as we think it's really good.

Here are the fields we propose to change:

Message and Description

But we should really rethink Message with Description. As our experience shows Message and Description gets duplicated info.

In your case I think it's true too:

Example:

            alert: 'CephClusterCriticallyFull',
            annotations: {
              message: 'Storage cluster is critically full and needs immediate expansion',
              description: 'Storage cluster utilization has crossed 95%.',
              ...
            }

In this case message has more info than description.

In a lot of cases it just adds Please check the node immediately. or Contact Support.:

https://github.com/ceph/ceph-mixins/blob/master/alerts/node.libsonnet#L18
https://github.com/ceph/ceph-mixins/blob/master/alerts/osd.libsonnet#L66
https://github.com/ceph/ceph-mixins/blob/master/alerts/monquorum.libsonnet#L18

IMO in 99% cases we could merge these 2 together into single message field.

severity and severity_level

How is severity different from severity_level ? Currently there are some alerts which have serverity: critical but severity_level: error.

Can we just use a severity label please? As in most cases it's duplicated. In your UI you can just use that label, no?

Here are some examples:

            alert: 'CephClusterCriticallyFull',
            labels: {
              severity: 'critical',
            },
            annotations: {
              severity_level: 'error',
            },

https://github.com/ceph/ceph-mixins/blob/master/alerts/utilization.libsonnet#L30
https://github.com/ceph/ceph-mixins/blob/master/alerts/utilization.libsonnet#L20
https://github.com/ceph/ceph-mixins/blob/master/alerts/state.libsonnet#L20
...

storage_type

What does storage_type mean? Right now project has a variable which defaults to ceph. But IMO it's not needed as it's obviously is a ceph mixin.

https://github.com/ceph/ceph-mixins/blob/master/config.libsonnet#L19

Is that for bluestore vs filestore? IF thats the case then we shouldn't be adding it everywhere, but only to those specific alerts.

So for now we will change our PRs to follow your convention.

But please 🙏 reconsider annotations. I know open source is hard sometimes and we think you guys did an amazing job of doing these alerts.

shtripat · 2019-05-06T10:29:20Z

Message and Description

The intention is very clear to use message to show in UI and description while detailed listing of alerts. Even if values are same in some cases, it makes sense to have both of them I feel.

severity and severity_level

This was specific requirement from UI to mark critical alerts as error. @cloudbehl what you say? if severity label can be utilized, we can get rid of this annotation severity_level.

storage_type

This was introduced for easy filtering of storage specific alerts from the full list of alerts from prometheus. In absence of this annotation, what other label/annotation could be used to filter the storage alerts?

This was referenced May 3, 2019

Add monquorum alert #52

Merged

Add version mismatch alerts #51

Merged

Add absent alerts #50

Merged

yuvalman mentioned this issue Jan 17, 2022

monitoring: customize prometheus rule alerts rook/rook#9503

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotation convention #54

Annotation convention #54

devopsjonas commented May 3, 2019

shtripat commented May 3, 2019

devopsjonas commented May 3, 2019 •

edited

shtripat commented May 3, 2019

devopsjonas commented May 3, 2019 •

edited

cloudbehl commented May 3, 2019

devopsjonas commented May 3, 2019

cloudbehl commented May 3, 2019

devopsjonas commented May 4, 2019 •

edited

shtripat commented May 6, 2019

Annotation convention #54

Annotation convention #54

Comments

devopsjonas commented May 3, 2019

shtripat commented May 3, 2019

devopsjonas commented May 3, 2019 • edited

shtripat commented May 3, 2019

devopsjonas commented May 3, 2019 • edited

cloudbehl commented May 3, 2019

devopsjonas commented May 3, 2019

cloudbehl commented May 3, 2019

devopsjonas commented May 4, 2019 • edited

Message and Description

severity and severity_level

storage_type

shtripat commented May 6, 2019

devopsjonas commented May 3, 2019 •

edited

devopsjonas commented May 3, 2019 •

edited

devopsjonas commented May 4, 2019 •

edited