Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation convention #54

Open
devopsjonas opened this issue May 3, 2019 · 9 comments
Open

Annotation convention #54

devopsjonas opened this issue May 3, 2019 · 9 comments

Comments

@devopsjonas
Copy link
Contributor

I think we are duplicating info in our current annotations:

For example

            annotations: {
              message: 'Network interface is down on {{ $labels.host }}',
              description: 'Interface {{ $labels.interface }} on host {{ $labels.host }} is down. Please check cabling, and network switch logs',
              storage_type: $._config.storageType,
              severity_level: 'warning',
            },

message and description has almost the same information.

Maybe instead we just do description ?

also what is storage_type field, why does it matter during solving the alert?

I think if we would end up with something that looks like https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution
would be best.

His example

alert: Lots_Of_Billing_Jobs_In_Queue
expr: sum(jobs_in_queue{service="billing-processing"}) > 100
for: 5m
labels:
   severity: major
annotations:
   summary: Billing queue appears to be building up (consistently more than 100 jobs waiting)
   dashboard: https://grafana.monitoring.intra/dashboard/db/billing-overview
   impact: Billing is experiencing delays, causing orders to be marked as pending
   runbook: https://wiki.intra/runbooks/billing-queues-issues.html

Also I think serverity should be a label not an annotation. Alertmanager can only use labels for alert routing https://prometheus.io/docs/alerting/configuration/#route

So if it's an annotation then you can't send warning alerts to email, critical alerts to opsgenie/pagerduty

@shtripat
Copy link
Contributor

shtripat commented May 3, 2019

@devopsjonas The fields message and description are added for different purposes here. The message field is used for displaying one liner in UI dashboard and description would be used if a detailed listing of alerts done in UI. @cloudbehl please add UI perspective.

Regarding severity_level field in annotations is added for marking error, warning etc in UI. The label severity would rather be used by AlertManager in case we use the functionality.

The field storage_type was used to allow Ceph specific alerts in OCP UI.

@devopsjonas
Copy link
Contributor Author

devopsjonas commented May 3, 2019

Oh I see.

So our use case is a bit different from yours. Yours is rendering in Ceph Manager UI, is that correct?
Ours is rendering in Opsgenie / pagerduty / zabbix / email

This is how it currently looks:

alert-rendered

Would you guys be ok if added grafana_url, runbook_url & dashboard_url annotations, but only if there were not empty? So by default it wouldn't be rendered for you. But for people that set it, it would be rendered.

@shtripat
Copy link
Contributor

shtripat commented May 3, 2019

At the moment these alerts are being used in OCP UI. I would suggest these fields to be added as and when you downstream this. May be having fields upstream with blank values is fine.

@cloudbehl @anmolsachan @umangachapagain sugestions?

@devopsjonas
Copy link
Contributor Author

devopsjonas commented May 3, 2019

@shtripat what is OCP UI? could you link me to github repo or project repo?

Also does this project plan to follow Prometheus monitoring mixin structure?

Structure:

{
  grafanaDashboards+:: {
    “dashboard-name.json”: {...},
  },
  prometheusAlerts+:: [...],
  prometheusRules+:: [...],
}

Could we add grafanaDashboards & prometheusRules if we need to?

@cloudbehl
Copy link
Collaborator

@devopsjonas OCP is OpenShift Container Platform.

Yes, you can add Grafana Dashboards and Prometheus rules. The only point these should be generic enough so community can use it. I think we can take care for that in reviews.

@devopsjonas
Copy link
Contributor Author

@cloudbehl could you link me to that OCP UI code? And how to compile & run it? I would like to try my PR'd alerts against it, just to see how they are rendered.

That said, we shouldn't be doing annotations that only OCP dashboard understands. Atleast in ceph repo, or maybe rename this to ceph-ocp-mixin then?

IMO human readable alerts should follow convention described in https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution or something similiar.

@cloudbehl
Copy link
Collaborator

@cloudbehl could you link me to that OCP UI code? And how to compile & run it? I would like to try my PR'd alerts against it, just to see how they are rendered.

The OCP UI can found here:
https://github.com/openshift/console

That said, we shouldn't be doing annotations that only OCP dashboard understands. Atleast in ceph repo, or maybe rename this to ceph-ocp-mixin then?

The idea for the repo is have a generic code that can be used by community for the ceph alert and dashboard. Right now we are testing it with kubernetes and OpenShift(OCP) but It is not specific to the OCP use case I can say.

I agree to a point that description and message are kind of same and can be worked upon. But our idea behind the message and description is that message will have short description and the description will have a detailed message of what happened where and what you can look up to correct the problem.

IMO human readable alerts should follow convention described in https://blog.pvincent.io/2017/12/prometheus-blog-series-part-5-alerting-rules/#provide-context-to-facilitate-resolution or something similiar.

Ack! I think Summary and Impact in the example we are using alternate to Message and Description. Also I am not ready to Grafana link right now as URLs are dynamic and will not be same in every use case/ deployment.

@devopsjonas
Copy link
Contributor Author

devopsjonas commented May 4, 2019

Ok, so mixin needs to be generic enough. But then arguments you wrote in #54 (comment) doesn't apply, as all that you write is that you need them in OCP UI. This projects needs to enable you guys to build OCP UI, but don't adapt to it's conventions.

Right now our model used in https://github.com/devopyio/ceph-monitoring-mixin is a bit different, as we ran these alerts on top of Prometheus, Alertmanager, Email and Zabbix.

I really think we should meet somewhere in the middle with our annotation & label convention.

We can drop considering grafana_url, dashboard_url and runbook_url (although kubernetes mixin does this https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/alerts/add-runbook-links.libsonnet too).

We didn't come with our proposed annotation convention, we just blatenly copied it from the internet as we think it's really good.

Here are the fields we propose to change:

Message and Description

But we should really rethink Message with Description. As our experience shows Message and Description gets duplicated info.

In your case I think it's true too:

Example:

            alert: 'CephClusterCriticallyFull',
            annotations: {
              message: 'Storage cluster is critically full and needs immediate expansion',
              description: 'Storage cluster utilization has crossed 95%.',
              ...
            }

In this case message has more info than description.

In a lot of cases it just adds Please check the node immediately. or Contact Support.:

https://github.com/ceph/ceph-mixins/blob/master/alerts/node.libsonnet#L18
https://github.com/ceph/ceph-mixins/blob/master/alerts/osd.libsonnet#L66
https://github.com/ceph/ceph-mixins/blob/master/alerts/monquorum.libsonnet#L18

IMO in 99% cases we could merge these 2 together into single message field.

severity and severity_level

How is severity different from severity_level ? Currently there are some alerts which have serverity: critical but severity_level: error.

Can we just use a severity label please? As in most cases it's duplicated. In your UI you can just use that label, no?

Here are some examples:

            alert: 'CephClusterCriticallyFull',
            labels: {
              severity: 'critical',
            },
            annotations: {
              severity_level: 'error',
            },

https://github.com/ceph/ceph-mixins/blob/master/alerts/utilization.libsonnet#L30
https://github.com/ceph/ceph-mixins/blob/master/alerts/utilization.libsonnet#L20
https://github.com/ceph/ceph-mixins/blob/master/alerts/state.libsonnet#L20
...

storage_type

What does storage_type mean? Right now project has a variable which defaults to ceph. But IMO it's not needed as it's obviously is a ceph mixin.

https://github.com/ceph/ceph-mixins/blob/master/config.libsonnet#L19

Is that for bluestore vs filestore? IF thats the case then we shouldn't be adding it everywhere, but only to those specific alerts.

So for now we will change our PRs to follow your convention.

But please 🙏 reconsider annotations. I know open source is hard sometimes and we think you guys did an amazing job of doing these alerts.

@shtripat
Copy link
Contributor

shtripat commented May 6, 2019

Message and Description

The intention is very clear to use message to show in UI and description while detailed listing of alerts. Even if values are same in some cases, it makes sense to have both of them I feel.

severity and severity_level

This was specific requirement from UI to mark critical alerts as error. @cloudbehl what you say? if severity label can be utilized, we can get rid of this annotation severity_level.

storage_type

This was introduced for easy filtering of storage specific alerts from the full list of alerts from prometheus. In absence of this annotation, what other label/annotation could be used to filter the storage alerts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants