Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retrieval of VRRP instance status information via CLI option #2001

Open
ChrLau opened this issue Sep 21, 2021 · 13 comments
Open

Add retrieval of VRRP instance status information via CLI option #2001

ChrLau opened this issue Sep 21, 2021 · 13 comments

Comments

@ChrLau
Copy link

ChrLau commented Sep 21, 2021

Hi,

currently the status of a keepalived VRRP instance cannot be extracted in an easy, general way.
The status changes are logged in the logfile (for example: /var/log/messages), but as these get rotated it even cannot be extractable from there.
Another solution is to create local files containing the state using the keepalived notify parameter in the vrrp_sync_group.

It would be handy to, for example, have /proc/net/vrrp_INSTANCENAME_status which outputs only the state (like: MASTER, BACKUP, FAULT).

Where does it help?
I have my keepalived loadbalancers submitting their state into a key value store (etcd). This etcd is queried by our Rundeck to get a list of all loadbalancer in the selected state (MASTER or BACKUP) to only perform actions on loadbalancers which are in the desired state. (We only have 1 VRRP instance per loadbalancer.)
For example: Only do package updates and a reboot on loadbalancers in the state of BACKUP.

Here it would be handy to have an entry which is maintained by keepalived itself as the daemon is the authoritative source for this kind of information. Set ups via the notify-parameters can fail due to human error.

Alternatives:
Of course some kind of CLI option like keepalived --status $VRRP_INSTANCENAME which outputs the state of the VRRP instance would also be useful.

@ChrLau ChrLau changed the title Add IPVS status information entry to /proc/net/ip_vs* Add VRRP instance status information entries under /proc Sep 21, 2021
@pqarmitage
Copy link
Collaborator

@ChrLau Probably the best way to query the status currently is to use SNMP. For example, executing:

export MIBS="+KEEPALIVED-MIB"
snmpwalk -v2c -c public localhost KEEPALIVED-MIB::vrrpInstanceState.1

will produce output:

KEEPALIVED-MIB::vrrpInstanceState.1 = INTEGER: master(2)

(to see all available OIDs execute snmpwalk -v2c -c public localhost KEEPALIVED-MIB::keepalived)

or you can use the RFC based MIB:

MIBS="+VRRPV3-MIB" snmpwalk -v2c -c public localhost VRRPV3-MIB::vrrpv3MIB

shows all available output.

The man page for proc states:
The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures.
so I don't think it is appropriate to use /proc (or /sys) for a userspace program.

A little over 4 years ago I did start experimenting with implementing FUSE (filesystem in userspace), to do the sort of thing you are requesting, but not mounted under /proc, but I seem to remember that there was some difficulty due to passing control to the fuse library; I think I can see a way around that now though.

@ChrLau Would using SNMP work for you, at least for now? Implementing a filesystem based solution could take a little while.

@ChrLau
Copy link
Author

ChrLau commented Sep 22, 2021

Hi @pqarmitage Oh dammit, yes you are right about /proc being for the Kernel.. When I started the ticket I had the fallacy that this information is provided by IPVS. Which of course is a kernel module and therefore /proc is an option.
But it's of course provided by VRRP. And that's not a kernel module... Sorry for the confusion.

So let's stick to the alternative solution: The command line option. ;-)
As I don't necessarily need a filesystem based solution. Core requirement is an easy way to get the state of an VRRP instance from keepalived (the authoritative source) itself.

Regarding SNMP: It is an option I could implement, yes.
The only "problem" I have is, that anyone is able to execute "keepalived --status $VRRPINSTANCE" via SSH or the like. No matter if junior helpdesk support staff or seasoned IT Architect.
But many people don't really understand SNMP, what an OID is. Why they need a MIB-File to make sense out of all these "dotted numbers" (as a former colleague put it). Hence I try to limit the number of technologies used, if possible.
Advantage is of course that SNMP will always return the state. As I found enough discussion on the internet that the notify parameter isn't always executed in some circumstances (still have to test&verify this).

In general I think that keepalived would really benefit from integrating such an CLI option. As I see this as a constantly recurring topic for people new to loadbalancing/keepalived when they try to get familiar/understand keepalived or want to implement monitoring solutions.
I edited the title of this issue (again) to make it more precise.

@ChrLau ChrLau changed the title Add VRRP instance status information entries under /proc Add retrieval of VRRP instance status information via CLI option Sep 22, 2021
@mister2d
Copy link

A healthcheck/liveness/readiness probe would be ideal for cloud and on-prem environments. This would aid service discovery and monitoring.

@pqarmitage
Copy link
Collaborator

@ChrLau You state As I found enough discussion on the internet that the notify parameter isn't always executed in some circumstances (still have to test&verify this). So far as I am aware the notify scripts are always executed; I am not aware of any outstanding issue relating to this. There is a problem however with notify scripts, and that is that if, for example, two state transitions occur in very quick succession, whilst both scripts are run by keepalived, due to kernel scheduling there is no guarantee of the order in which they will be executed. For example, if a VRRP instance becomes master, and then almost immediately reverts to backup, a notify_master script and a notify_backup script will both be run, but the notify_backup script might execute before the notify_master script. It was for this reason that I implemented the notify_fifo feature, where the order of delivery of messages to the FIFO is guaranteed to be correct.

@pqarmitage
Copy link
Collaborator

@mister2d I think SNMP provides the functionality you are looking for.

@mister2d
Copy link

mister2d commented Sep 27, 2021

@mister2d I think SNMP provides the functionality you are looking for.

Not exactly. I was hoping for something less legacy (OIDs) and more simpler (an API endpoint).

Something like a HTTP /health endpoint listening on 127.0.0.1 would do.

@ChrLau
Copy link
Author

ChrLau commented Oct 13, 2021

@ChrLau You state As I found enough discussion on the internet that the notify parameter isn't always executed in some circumstances (still have to test&verify this). So far as I am aware the notify scripts are always executed; I am not aware of any outstanding issue relating to this.

@pqarmitage I finally had the time to test my planned set up and noticed that the notify_stop parameter isn't working with the keepalived versions from Debian Stretch (Version 1.3.2) and Debian Buster (Version 2.0.10). Nothing is logged upon stopping keepalived. Neither via notify parameter nor notify_stop.
But as we plan to run this on Debian Bullseye (shipping keepalived version 2.1.5) and there notify_stop works as expected, this is not that much of an issue.
Most likely this was the "issue" I vaguely remembered regarding the notify-parameters.

I left a comment with a useable workaround in #185 regarding this, as I quickly found this issue (among some StackOverflow threads ;-) ) and maybe this information is helping someone else.

@wydrych
Copy link

wydrych commented Oct 14, 2021

I do support the need of modern API to get the full runtime status of keepalived (via HTTP/REST, CLI etc.). Notify scripts provide information about state changes only (no other stats) and if notification is lost, it's lost forever. SNMP requires spawning snmpd just for this purpose (which is an overkill and may not be allowed on some systems due to security requirements). And signalling with SIGJSON (or USR1/2) is restricted to root (+ the file keepalived writes is readable by root only) and requires a monitoring script to wait an artificial number of seconds hoping that the state file is written fully.

It would be much more convenient to just curl keepalived over tcp or unix socket to get all the state and stats (even the form of current json or data/stats dumps would suffice) or call CLI command that would get current status and print it to stdout.

@pqarmitage
Copy link
Collaborator

Does a request https://router1.keeepalived.org:4433/vrrp/instance/VI_2/state returning:

{
  "instance": "VI_2",
  "state": "Master"
}

look to be the right sort of thing?

I have also experimented with https://router1.keeepalived.org:4433/vrrp/instances/name returning:

[
  {
    "instance": "VI_1"
  },
  {
    "instance": "VI_2"
  },
  {
    "instance": "VI_6"
  }
]

and https://router1.keeepalived.org:4433/vrrp/instances/state:

[
  {
    "instance": "VI_1",
    "state": "Master"
  },
  {
    "instance": "VI_2",
    "state": "Master"
  },
  {
    "instance": "VI_6",
    "state": "Master"
  }
]

and https://router1.keeepalived.org:4433/vrrp/instances:

[
  {
    "instance": "VI_1",
    "state": "Master",
    "interface": "vrrp.253@eth0,
    "priority": 200,
    "number of config faults": 0,
    "last state transition": "2021-12-08T11:46:08.803281Z",
    "source ip address": "10.1.0.3"
  },
  {
    "instance": "VI_2",
    "state": "Master",
    "interface": "vrrp.252@eth0,
    "priority": 200,
    "number of config faults": 0,
    "last state transition": "2021-12-08T11:46:08.802212Z",
    "source ip address": "10.1.0.3"
  },
  {
    "instance": "VI_6",
    "state": "Master",
    "interface": "vrrp6.253@eth0,
    "priority": 200,
    "number of config faults": 0,
    "last state transition": "2021-12-08T11:46:08.798882Z",
    "source ip address": "fe80::4005:5dff:fe72:f1a1"
  }
]

and https://router1.keepalived.org:4433/checker/virtual_servers:

[
  {
    "name": "10.0.0.1:TCP:80",
    "real_servers": [
      {
        "real_server": "192.168.0.1:80",
        "alive": true,
        "active": true,
        "weight": 1
      },
      {
        "real_server": "192.168.0.2:80",
        "alive": true,
        "active": true,
        "weight": 1
      }
    ],
    "scheduler": "rr"
  },
  {
    "name": "2001:224:69dd:135::210:TCP:80",
    "real_servers": [
      {
        "real_server": "[2001:224:69dd:235::210]:80",
        "alive": true,
        "active": true,
        "weight": 1
      },
      {
        "real_server": "[2001:224:69dd:235::211]:80",
        "alive": true,
        "active": true,
        "weight": 1
      }
    ],
    "scheduler": "rr"
  }
]

These are all examples, and can easily be added to, both in terms of what fields are contained in the output and additional URLs.

The port to connect to will be configurable, as will the server name. It will only use https, and valid certificates for the server name will be required (which can be obtained from letsencrypt.org for example). All requests will require to be authenticated using basic authentication to ensure that data cannot be inappropriately leaked. It currently only implements HTTP/1.1, but I may add HTTP/2 support.

So far I have only implemented GET requests, but I plan to also implement POST, PUT and DELETE as appropriate. I also haven't yet implemented versioning. Please note this will not be released until authentication is implemented.

I would welcome any and all feedback, since it is probably easier to make any modifications before the functionality is released, rather than make subsequent changes, although as I indicated above adding additional fields and URLs should be quite straightforward.

@elfranne
Copy link

A healthy status would be useful, with your example https://router1.keeepalived.org:4433/vrrp/instances :

[
  {
    "healthy": true,
    "reason": ""
  },
  [
    {
      "instance": "VI_1",
      "state": "Master",
      "interface": "vrrp.253@eth0",
      "priority": 200,
      "number of config faults": 0,
      "last state transition": "2021-12-08T11:46:08.803281Z",
      "source ip address": "10.1.0.3"
    },
    {
      "instance": "VI_2",
      "state": "Master",
      "interface": "vrrp.252@eth0",
      "priority": 200,
      "number of config faults": 0,
      "last state transition": "2021-12-08T11:46:08.802212Z",
      "source ip address": "10.1.0.3"
    },
    {
      "instance": "VI_6",
      "state": "Master",
      "interface": "vrrp6.253@eth0",
      "priority": 200,
      "number of config faults": 0,
      "last state transition": "2021-12-08T11:46:08.798882Z",
      "source ip address": "fe80::4005:5dff:fe72:f1a1"
    }
  ]
]

But it adds logic that might not fit all deployments...

@pqarmitage
Copy link
Collaborator

@elfranne Many thanks for your response. Could you please explain what circumstances would expect to lead to "healthy: false`.

@elfranne
Copy link

@pqarmitage , If the backup instance is offline for example. And for the real_servers, if the backend do no responds anymore healthy would be set to false (could be set to a percentage or maximum number offline backends).

@ChrLau
Copy link
Author

ChrLau commented Jul 19, 2022

@pqarmitage Wow, I'm impressed what became out of my ticket. And yes, an API would be absolutely stunning to have as it makes it much easier to integrate keepalived (and ipvsadm in that consequence too).

Regarding the "healthy: true" and false topic: I remember Apache Solr using a more layered approach. See https://solr.apache.org/guide/8_11/cluster-node-management.html

You could use the same, which would allow for some more flexibel configuration of health status levels. As I'm on your side when you say: "But it adds logic that might not fit all deployments..." as the VRRP part can be used very differently.

So "healthy: green" could be: all Instances are up, all configured nodes are up and therer is 1 master for each instance.
"healthy: yellow" could be all instances up, there is 1 master for each instance, but one (or more) instance(s) without a backup node, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants