Global configuration

If you notice a delay between an event and the first notification, read the following blog post => https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html.

Prometheus configuration

{% highlight yaml %}

prometheus.yml

global: scrape_interval: 20s

A short evaluation_interval will check alerting rules very often.

It can be costly if you run Prometheus with 100+ alerts.

evaluation_interval: 20s ...

rule_files:

'alerts/*.yml'

scrape_configs: ...

{% endhighlight %}

{% highlight yaml %}

alerts/example-redis.yml

groups:

name: ExampleRedisGroup rules:
- alert: ExampleRedisDown expr: redis_up{} == 0 for: 2m labels: severity: critical annotations: summary: "Redis instance down" description: "Whatever"

{% endhighlight %}

AlertManager configuration

{% highlight yaml %} {% raw %}

alertmanager.yml

route:

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification.

group_wait: 10s

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

group_interval: 30s

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

repeat_interval: 30m

A default receiver

receiver: "slack"

All the above attributes are inherited by all child routes and can

overwritten on each.

routes: - receiver: "slack" group_wait: 10s match_re: severity: critical|warning continue: true

- receiver: "pager"
  group_wait: 10s
  match_re:
    severity: critical
  continue: true

receivers:

name: "slack" slack_configs:
- api_url: 'https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/xxxxxxxxxxxxxxxxxxxxxxxxxxx' send_resolved: true channel: 'monitoring' text: "{{ range .Alerts }}<!channel> {{ .Annotations.summary }}\n{{ .Annotations.description }}\n{{ end }}"
name: "pager" webhook_configs:
- url: http://a.b.c.d:8080/send/sms send_resolved: true

{% endraw %} {% endhighlight %}

Reduce Prometheus server load

For expansive or frequent PromQL queries, Prometheus allows to precompute rules.

{% highlight yaml %} {% raw %} groups:

first define the recorded rule

name: ExampleRecordedGroup rules:
- record: job:rabbitmq_queue_messages_delivered_total:rate:5m expr: rate(rabbitmq_queue_messages_delivered_total[5m])

then use it in alerts

name: ExampleAlertingGroup rules:
- alert: ExampleRabbitmqLowMessageDelivery expr: sum(job:rabbitmq_queue_messages_delivered_total:rate:5m) < 10 for: 2m labels: severity: critical annotations: summary: "Low delivery rate in Rabbitmq queues" {% endraw %} {% endhighlight %}

Troubleshooting

If the notification takes too much time to be triggered, check the following delays:

scrape_interval = 20s (prometheus.yml)
evaluation_interval = 20s (prometheus.yml)
increase(mysql_global_status_slow_queries[1m]) > 0 (alerts/example-mysql.yml)
for: 5m (alerts/example-mysql.yml)
group_wait = 10s (alertmanager.yml)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alertmanager.md

alertmanager.md

Global configuration

Prometheus configuration

prometheus.yml

A short evaluation_interval will check alerting rules very often.

It can be costly if you run Prometheus with 100+ alerts.

alerts/example-redis.yml

AlertManager configuration

alertmanager.yml

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification.

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

A default receiver

All the above attributes are inherited by all child routes and can

overwritten on each.

Reduce Prometheus server load

first define the recorded rule

then use it in alerts

Troubleshooting

Files

alertmanager.md

Latest commit

History

alertmanager.md

File metadata and controls

Global configuration

Prometheus configuration

prometheus.yml

A short evaluation_interval will check alerting rules very often.

It can be costly if you run Prometheus with 100+ alerts.

alerts/example-redis.yml

AlertManager configuration

alertmanager.yml

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification.

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

A default receiver

All the above attributes are inherited by all child routes and can

overwritten on each.

Reduce Prometheus server load

first define the recorded rule

then use it in alerts

Troubleshooting