RFE: A system monitoring and alerting role #47

myllynen · 2020-11-10T16:59:33Z

I would like to see a role added that would configure essential system monitoring and in case something bad happens then automatically alert the administrator. The areas to be monitored, the thresholds to raise alerts, and the methods of alerting should be configurable. Once configured, the administrator should not be required to manually monitor or read anything to see that a system is behaving as expected and in case of issues would receive a notification alert.

In practice at least the following could be considered as methods to alert:

D-Bus
email
HTTP POST (this would probably also cover chat)
SNMP
SMS
syslog

The following could be areas to monitor with configurable thresholds, e.g., by default 90% limit for the disk-full case:

CPU usage - e.g., detect CPU hogs on non-dedicated systems where no process should utilize CPU for a long time
memory usage - e.g., monitor how much memory and swap is used and how much there is swapping in/out activity
disk usage - e.g., monitor that no partition is getting full
network connectivity - e.g., monitor that gateway, DNS, NTP servers are pingable and no packet loss detected
application issues - e.g., generic cases like process segfaulting constantly or a service failing to start
security violations - e.g., high amount of failed SSH login attempts, SELinux AVCs, DDoS, or sudo failures
hardware failures - e.g., IO errors from storage or current hardware not matching a predefined configuration

The user could select one or more alerting methods, local syslog could be the default since it's probably easiest to set up correctly. The default set of what to monitor and the default thresholds could be determined after consulting people and organizations maintaining and supporting production systems.

Implementation-wise one potential candidate would be PCP/pmie at least for the CPU/memory/storage/network related areas. PCP/pmie uses the same PCP infra as the existing metrics role to detect anomalies, is fully configurable, allows calling external scripts on events, and is nowadays a standard component in most distributions. It should however be tested how PCP/pmie behaves in case an alert should be raised, e.g., when disk full.

Later on it could be considered whether adding optional remediation scripts would be helpful or possible.

Thanks.

richm · 2020-11-10T17:39:43Z

@myllynen could the metrics (which uses pcp for the implementation) role be used for this? https://github.com/linux-system-roles/metrics

myllynen · 2020-11-10T17:46:01Z

As mentioned, I think using PCP/pmie would be one (perhaps very potential) candidate for implementing this as then you could perhaps reuse parts of what is currently used for metrics role and the same building blocks would be used by both alerting and metrics during runtime.

However, I'm not sure would it make sense to expand the scope of the current metrics role, at least I was under impression it allows the user to investigate and study what has been and what is currently going on in and with the system.

It could be that the alerting role provides a notification to the user and then the functionality setup by the metrics roles would be used to investigate and diagnose the situation further.

Thanks.

richm · 2020-11-12T01:59:44Z

Is there already some sort of product provided with Fedora/EL that does this alerting? The purpose of the linux-system-roles projects is to provide Ansible roles/modules to manage components provided with the operating system. We don't really have a mandate for creating solutions. I suppose we could provide some sort of example playbook in the metrics role that shows how to configure something like this.

myllynen · 2020-11-12T08:51:53Z

PCP/pmie that I mentioned is a tool that can do this sort of alerting. It doesn't cover out-of-the-box all the possible cases I listed above but CPU/memory/disk/network related it could handle. It supports alerting over syslog natively and doing HTTP POST would probably be a one- or few-line shell script (which PCP/pmie can call based on configuration). Alerting over SNMP (or even SMS) is not supported but that could be something PCP/pmie could be extended later if there's a real need for that (or some additional scripts created), however I wouldn't consider SNMP/SMS support as a blocker at this point.

So I see configuring PCP/pmie would fit in the scope of linux-system-roles without the need to create new components or solutions. I'm not sure would it make sense to configure this as part of metrics since this could cover non-metric related aspects (like security) and on occasions the user might be interested to receive alerts but not metric setup otherwise. Of course if the metrics configuration would be flexible enough to take these into consideration then that might be an option.

Thanks.

richm · 2020-11-12T14:58:16Z

Can someone work up a playbook which sets up something like this using the metrics role? @natoscott or @andreasgerstmayr is this something that one of you could do?

natoscott · 2020-11-14T07:18:20Z

@richm @myllynen the default metrics role invocation sets up the PCP pmie utility with some default performance rules, and default event handling (syslog) when those rules evaluate to true.

So the playbook is the minimal case, like:

- hosts: all
  roles:
    - linux-system-roles.metrics

For the alerting Marko's interested in that can use performance metrics, I agree pmie is a good option - we'd want to extend the metrics role a little I think to provide more customisable alerting options. Currently logging to syslog is the only option, but the PCP pmieconf tool allows us to configure this to do other things as well (or instead), so we'd need to expose that configurability at the metrics role interface.

myllynen · 2020-11-16T09:09:40Z

In principle I don't have anything against this being part of the metrics role, and it might be indeed a good fit from PCP's point of view. However, in theory it perhaps could be seen breaking abstraction on linux-system-roles level in that sense that if alternative alerting or metrics implementations are supported later those might not be as tightly coupled as PCP components today. But I'll leave this of course for you to decide, since we should not aim to replace some well-established tools here but provide basic alerting for common cases with easy setup then the role usage should not get to complicated either. Thanks.

richm · 2020-11-16T17:35:22Z

@myllynen so in OpenShift they have this operator paradigm where the operator publishes metrics which are scraped by prometheus and an alerting engine which will tell the operator to change the node or application configuration based on the metrics/monitoring information e.g. the application has hit a cpu/memory threshold for 5 minutes over a 1 hour period, so provision a new node and allocate a new pod to the new node. Do we have any sort of integration like that with pmie and Ansible Tower such that we could fire off Ansible playbooks based on feedback from pcp from a node? @mprovenc

myllynen · 2020-11-16T20:06:16Z

cpu/memory threshold for 5 minutes over a 1 hour period, so provision a new node and allocate a new pod to the new node. Do we have any sort of integration like that with pmie and Ansible Tower such that we could fire off Ansible playbooks based on feedback from pcp from a node? @mprovenc

Yes, definitely, monitoring for conditions like this and then executing a certain action is the main functionality of pmie. The action in this case would be calling a shell script that does HTTP POST towards Ansible Tower to run a playbook against the host.

richm · 2020-11-16T20:19:25Z

The action in this case would be calling a shell script that does HTTP POST towards Ansible Tower to run a playbook against the host

@myllynen Do you know if anyone is doing something like this now? If so, I would like to collect some of those use cases.

myllynen · 2020-11-17T11:27:25Z

Do you know if anyone is doing something like this now? If so, I would like to collect some of those use cases.

I'm not aware of such uses yet, I think there's been a bit of a chicken and egg issue with pmie: not much used since pmie is not widely known and since it's not widely know not much use of pmie. But perhaps we could come up with some most fundamental cases when to alert by asking feedback e.g. from Red Hat Support. It doesn't have to be in any way complete list (I don't think you can even have one) but for the initial version something to have basics in place and make sure the role has a appropriate balance of flexibility and straightforwardness to use it. Also, at some point we probably need to draw the line what's out of scope for the role and should be configured directly with pmie.

natoscott · 2020-11-18T03:44:58Z

[...] but for the initial version something to have basics in place and make sure the role has a appropriate balance of flexibility and straightforwardness to use it.

I expect the simplest approach will be to leave the individual rules to the underlying PCP role as it is now, and not expose details of individual rules at the metrics role level. This is consistent with the way we handle recording of metrics with pmlogger - the PCP role has to come up with an ideal set of metrics for any given setup, and at the higher level (metrics role) we just enable logging with high level parameters (like sampling and retention intervals).

Also, at some point we probably need to draw the line what's out of scope for the role and should be configured directly with pmie.

In terms of rules and inference, probably the functionality to expose at the metrics role level is the choice of alerting mechanism.

This was referenced Dec 18, 2020

RFE: New PMIE action methods performancecopilot/pcp#1180

Closed

RFE: Additional PMIE rules performancecopilot/pcp#1181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: A system monitoring and alerting role #47

RFE: A system monitoring and alerting role #47

myllynen commented Nov 10, 2020 •

edited

richm commented Nov 10, 2020

myllynen commented Nov 10, 2020

richm commented Nov 12, 2020

myllynen commented Nov 12, 2020

richm commented Nov 12, 2020

natoscott commented Nov 14, 2020

myllynen commented Nov 16, 2020

richm commented Nov 16, 2020

myllynen commented Nov 16, 2020

richm commented Nov 16, 2020

myllynen commented Nov 17, 2020

natoscott commented Nov 18, 2020

RFE: A system monitoring and alerting role #47

RFE: A system monitoring and alerting role #47

Comments

myllynen commented Nov 10, 2020 • edited

richm commented Nov 10, 2020

myllynen commented Nov 10, 2020

richm commented Nov 12, 2020

myllynen commented Nov 12, 2020

richm commented Nov 12, 2020

natoscott commented Nov 14, 2020

myllynen commented Nov 16, 2020

richm commented Nov 16, 2020

myllynen commented Nov 16, 2020

richm commented Nov 16, 2020

myllynen commented Nov 17, 2020

natoscott commented Nov 18, 2020

myllynen commented Nov 10, 2020 •

edited