Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a way to mute alert until it's resolved to receive a resolved notification once it's fixed #3825

Open
freak12techno opened this issue Apr 30, 2024 · 10 comments

Comments

@freak12techno
Copy link

Let's say I have an outage on one of my server I'm monitoring and it's inaccessible but I don't know how long it's gonna take to fix it, so I'm muting it for a really long time.

With this approach, I won't receive any resolved notifications, so to check if the alert is fixed I need to go to my alerts list to see if it's still firing, and given that I've muted it for a long time I also need to remove the mute to know if it's firing again.

What would be nice to have:

  • I have a server outage
  • an alert is triggered
  • I create a new mute and somehow specify that I need it to be active until the alert is resolved
  • a server fixes itself
  • a resolved alert notification is dispatched
  • a mute is removed
  • if a server starts misbehaving again and a new alert is triggered, I'm receiving an alert notification again

Pretty sure this would have a lot of cases that'll make it difficult, like if a mute has a lot of active alerts, but still would be really awesome to have.

Do you guys think it's manageable?

@grobinson-grafana
Copy link
Contributor

Hi! 👋 It sounds to me like you want a silence to expire when it is no longer silencing any active alerts.

I think there are a couple of problems that we would need to solve to add such a feature. For example:

  1. Alertmanager does not persist alerts to disk, so if an Alertmanager is restarted all of its alerts will be lost; and because of this all of its silences will also be expired. This is undesirable because the alert might not have actually resolved. Prometheus will resend all alerts to Alertmanager after the resend delay and you may receive duplicate notifications.
  2. If Alertmanager is run in HA (high availability) and one Alertmanager becomes partitioned, then its alerts will resolve as Prometheus will be unable to communicate to that Alertmanager. The partitioned Alertmanager will expire its silences as it has no more active alerts. When the partition recovers the Alertmanager that expired its silences will gossip these expirations to the other Alertmanagers, expiring them on those Alertmanagers too.

@freak12techno
Copy link
Author

@grobinson-grafana seems so.

For issues that you outlined:

  1. Can this be solved by saving alerts to disk every time Alertmanager is receiving one and loading it from disk if it's present, or do you think there are other caveats with this approach?
  2. I don't know a lot about Alertmanager HA internals, but let's say if there is a cluster of 3 nodes and one is partitioned and loses the mute given all the alerts there are resolved, once it goes back, won't other 2 nodes disagree and won't the consensus be that this mute isn't in fact removed? (and if there are 2/3 nodes partitioned, pretty sure expiring mutes aren't gonna be the biggest problem here lol)

@grobinson-grafana
Copy link
Contributor

  1. Yes that's right! The problem is that Alertmanager is stateless, so some kind of embedded database will need to be evaluated and then all the code will need to be written to use it.
  2. Alertmanager doesn't use consensus for gossiping silences, its a case of last write wins. Since the expiration was the most recent event the other Alertmanagers will believe it to be the correct one.

@freak12techno
Copy link
Author

@grobinson-grafana

  1. From what I'm seeing, it shouldn't be difficult:
  • when creating/editing/deleting the alert, just dump whatever alerts there are to the disk
  • when starting, load the alerts from state if it's present
  • afaik it's not a proper database, but basically alerts snapshot, so no need to sync between the file and Alertmanager in cases other than the two above

Do you think that introduces new troubles?

  1. Were the team ever considering using the consensus model? Wonder if it has any payoffs other than kinda being the requirement for the feature I propose and if it adds more problems.

@grobinson-grafana
Copy link
Contributor

  1. Do you think you have time to work on this? I think the best place to start is to evaluate some of the embedded k/v stores such as bbolt to see which would be the most appropriate.
  2. Yes, but the current Alertmanager design is that alerts should continue to work even if all but one Alertmanager is down. If we add consensus then we need N/2+1 Alertmanagers to be up at all times.

@freak12techno
Copy link
Author

@grobinson-grafana for 1) I can try implementing it by myself, but I'm not not sure if I can manage 2) or if it's even feasible.

@grobinson-grafana
Copy link
Contributor

Hi! 👋 Do you have time to evaluate some embedded k/v stores? That would be a fantastic contribution as we have discussed durable storage for Alertmanager in the past but haven't decided what to use.

For example, I know that Grafana Loki uses bbolt, but it would be nice to see a comparison of some other embedded databases. You could even include sqlite3. Alertmanager has avoided being dependent on other processes as it needs to operate even when these are unavailable, so that means no MySQL, PostgreSQL, memcache, redis, etc.

Second, it is not uncommon for users to have Alertmanager installations with 10,000s of alerts, so it would be nice to see some performance comparisons of different databases. I expect the workload to be write-heavy as reads will only happen at startup time.

@freak12techno
Copy link
Author

@grobinson-grafana so I looked a bit into how it's done for silences. Apparently it's all serialised into some binary format and stored on disk as a single file. Do you think it makes sense to do it the same way for alerts here as well, or would it be better to do it via a proper db? Basically we only need to read from it once to load all the alerts when starting Alertmanager and to write to it once an alert is created/updated.

(One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.)

@grobinson-grafana
Copy link
Contributor

One issue I see with that approach is that if there are 10k inserts creating alerts, then every time it'll have to overwrite the whole file, which is not nice.

Yes! That's the issue! :) It works for silences because silences are not created very often and you don't tend to have very many of them. But alerts are very different, and Alertmanager can be receiving 1000s of alerts per minute (i.e. the EndsAt timestamp needs to be updated to stop firing alerts from resolving).

@freak12techno
Copy link
Author

@grobinson-grafana okay, from my point of view, sqlite3 here doesn't make a lot of sense as it adds another layer of complexity by having to deal with db schema, so I think this won't be the best approach here.

From other kv databases, other than bbold that you've suggested, one cool option I found is https://github.com/dgraph-io/badger - it has quite a big community (it has more github stars than bbolt) is used by a lot of projects and seems to be maintained. I haven't used either of this in my projects, so I can mostly look at the library popularity and if it's maintained - both seem cool with it.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants