Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistance Needed with Prometheus and Alertmanager Configuration #3781

Open
Trio-official opened this issue Mar 29, 2024 · 3 comments
Open

Comments

@Trio-official
Copy link

I am encountering challenges with configuring Prometheus and Alertmanager for my application's alarm system. Below are the configurations I am currently using:

prometheus.yml:
Scrape Interval: 1h

rules.yml:

groups:
  - name: recording-rule
    interval: 1h
    rules:
      - record: myRecord
        expr: expression….. (calculating ratio by dividing two metric > than value)

  - name: alerting-rule
    interval: 4h
    rules:
      - alert: myAlert
        expr: max_over_time(myRecord[4h])
        labels:
          severity: warning
        annotations:
          summary: “summary”

alertmanager.yml:

group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

Issues:

  • Inconsistent Alerting: The similarity in scrape interval and recording rule evaluation interval (both set to 1 hour) leads to instances where Prometheus scrapes data before the recording rule evaluation. Consequently, during the recording rule evaluation, there may be no value in the metric, resulting in the recording rule failing to trigger an alert despite the condition being satisfied.

  • Discrepancy in Firing Alerts: The number of firing alerts in Prometheus varies significantly from the number of alerts received by Alertmanager, causing inconsistency and confusion in alert handling.

  • Uncertainty in Alert Evaluation Timing: The alerting rule seems to be evaluated inconsistently, sometimes triggering alerts shortly after service restart, while other times with delays beyond the expected 4-hour interval.

Request for Assistance:

I am seeking guidance on configuring Prometheus and Alertmanager to achieve the following:

  • Ensuring the alerting expression is evaluated every 4 hours, checking for the maximum of the recording metric over the specified interval.
  • The recording rule is evaluated every 1 hour to maintain accuracy in alert triggering.
  • I would appreciate any insights or recommendations on addressing these challenges and achieving the desired configuration for our use case.

Thanks in advance.

@TheMeier
Copy link
Contributor

Hi there, generally the issue tracking is not the right place for questions like this. Please consider taking it to https://groups.google.com/g/prometheus-users or similar forums.

Very likely the issue that you are facing here is staleness. If you only scrape every hour your metric will be stale (and thus non existent) for 55 minutes of every hour.
From a prometheus users post:

Either way, Prometheus is not going to handle hourly scraping well, the practical upper limit of scrape interval is 2 minutes. I would recommend changing the way your exporter works, I would probably do something like run it as a cron job and use the pushgateway or node_exporter textfile collector.
https://groups.google.com/g/prometheus-users/c/2DDL7FKMeVk/m/N5WJ8hUnAAAJ

So that basically means your configuration is not supported.

Please close this issue as it is not a bug or feature request for alertmanager.

@Trio-official
Copy link
Author

Trio-official commented Mar 30, 2024

Thank you for your prompt response and guidance on addressing the metric staleness issue.

Regarding your suggestion to use square brackets for the recording metric and alerting rule (the link that you shared), I confirm that I have already implemented this approach. However, the main challenge persists with the discrepancy in the number of alerts generated by Prometheus compared to those displayed in Alertmanager. (e.g. max_over_time(metric[1h]))

To illustrate, when observing Prometheus, I may observe approximately 25,000 alerts triggered within a given period. However, when reviewing the corresponding alerts in Alertmanager, the count often deviates significantly, displaying figures such as 10,000 or 18,000, rather than the expected 25,000.

This inconsistency poses a significant challenge in our alert management process, leading to confusion and potentially overlooking critical alerts.

I would greatly appreciate any further insights or recommendations you may have to address this issue and ensure alignment between Prometheus and Alertmanager in terms of the number of alerts generated and displayed.

@grobinson-grafana
Copy link
Contributor

As @TheMeier said https://groups.google.com/g/prometheus-users is the best place to ask such questions. Could you please close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants