Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing delays and generic filter functionality #1475

Open
middelthun opened this issue Mar 5, 2021 · 1 comment · May be fixed by #1660
Open

Implementing delays and generic filter functionality #1475

middelthun opened this issue Mar 5, 2021 · 1 comment · May be fixed by #1660
Labels
blocked This issue can't be resolved unless something else is done question Further information is requested

Comments

@middelthun
Copy link
Contributor

Hi, Nick!

This will be a quite long post, but please bear with me. We are planning a fairly large PR, so I'd like to give you an early heads up before things are written in stone.

Our company has a need for a plugin that can delay alerts meeting certain (variable) criteria. This has several use cases, but as an example we monitor several services that are occasionally unstable, due to dependencies on infrastructure outside of our own. We want to receive the alerts as they occur, but we don't want them to be active and open until they have existed for a certain amount of time. If something flapped briefly, we don't really care. This could of course be implemented on the monitoring end of things (we mainly use Telegraf and Kapacitor), but given the high amount of stock Telegraf plugins we use, the amount of custom code escalates very quickly if we go down that road. Hence we would much prefer to have this functionality within Alerta.

Pressure to implement this has grown over time, and it has been the most requested feature within our organization for a very long time - which is why we are now dedicating programming resources to fix it.

The basic idea is to implement delays in an almost identical manner as blackouts. CRUD via API for managing delay rules, with an added timeout value per rule that defines how long the alert should be delayed if it matches. When an event reaches Alerta, the plugin will check a table for matching rules, and shelve the alert (for a "timeout" time, as defined by rules) if it matches the criteria. If the alert matches multiple rules, apply avg/min/max (configurable) to the timeout.

However, as we went through the design phase I realised that this should probably be generalised. Duplicating code with just minor differences between them (all those blackout*.py vs delay*.py files) doesn't seem like a very attractive idea.

As a result, we've changed our strategy a bit, and intend to implement this as a generic filter feature.

Each filter looks roughly like the current blackouts, and will indeed be designed so blackouts in time can use the exact same system.

Let me backtrack, and describe it in more detail.

Filters will have a database table, almost identical to the current blackouts table. There can be an arbitrary amount of different filter types (a text column in the table), e.g. "blackout" or "delay". Each filter type can have a set of filter specific data that will be stored in a custom type array (key + value), which can be used by plugins. For instance, start_time, end_time and duration are important for a blackout, but they don't make as much sense for a delay. A delay needs a timeout, to know how long a matching alert should be auto-shelved.

Generalising this part of the code has some big advantages. Blackout specific code can be more detached from Alerta core, since it can just use the generic filter feature. Blackout is a plugin after all, so this makes a lot of sense in my eyes. The strongest selling point, though, is probably that it will be possible for anyone to create a plugin that utilizes the same, flexible alert matching system that blackout currently has.

For example: The delay feature we are currently focusing on is probably useful for a lot of people, but we have another planned plugin which is more specific to us. Incoming events in our monitoring stack have a source-host-tag (the host that generated the alert), and we want to regulate which sources are allowed to deliver alarms on behalf of others. This is easily done, in a very flexible way, using the same filter system.

Another example: If someone wants to have built-in blackout scheduling (#960 #1413), they can just make a plugin for it. The filter could contain filter specific variables like "schedule", or whatever they'd like. The functionality may not fit in Alerta core, but if someone wants it anyway, they can code against the filter feature themselves, and add it as their own custom plugin.

I'm sure creative heads can think of a lot of other use cases for such plugins, both for the general public and specifically for their own setup.

The main challenge I see for a more dynamic filter feature is CRUD permissions. Each filter type should have its own set of permissions. The current set of permissions is not exactly flexible, and it doesn't make sense to have a set of (read|write|admin):filters permissions. That would just be way too generic. I believe this is a solvable problem though, and we'll happily help out, but we won't be taking that on in this iteration. However, in order for the generic filter feature to actually be useful for plugins, permissions must be reworked.

Which leads me to our current design decisions.

  • We will not do anything with blackouts. They can live side-by-side with the filter as is, and be deprecated at a later point in time.
  • We will focus on the delay implementation now, and when the code is considered stable, and the time is right, blackouts may follow in a controlled manner.
  • We will not do anything with permissions. (At least not now.) Small changes, one at a time. For now we will probably just add a fixed set of delays-permissions, to make the delay plugin work, and ensure that the filter feature works as intended. If a custom plugin wants to utilize the filter feature, however, it will need its own set of CRUD permissions. In other words, the biggest gain won't come to fruition until permissions are reworked. I'm sure you have some ideas here, and we can play ball.

The basic design is mostly ready now, and development has started. It shouldn't take us too long to put this in place, since most of the code is almost identical to blackouts. Once we have produced something more tangible, we can have a look at the gritty details.

Management considers this an important task, and we have a dedicated team working on making it happen. Naturally we will take on necessary changes to the frontend and alike as the need arises.

Hopefully our ideas seem sound to you. We've tried to streamline them along Alerta's general design, but your thoughts on the subject are highly appreciated.

I'll happily take questions and feeback.

--
Øystein Middelthun
Senior Systems Consultant
Basefarm

@satterly
Copy link
Member

satterly commented Mar 6, 2021

Thanks for the in-depth description of your proposed enhancement however there are a lot of unanswered questions.

Are you on the Slack channel as I think it would be better to discuss some of the design decisions there?

@satterly satterly added blocked This issue can't be resolved unless something else is done question Further information is requested labels Nov 9, 2021
@sixcare sixcare linked a pull request Dec 15, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked This issue can't be resolved unless something else is done question Further information is requested
Projects
Status: 🛠 In Progress
Development

Successfully merging a pull request may close this issue.

2 participants