Skip to content

Latest commit

 

History

History
177 lines (150 loc) · 8.05 KB

Health-check-API.asciidoc

File metadata and controls

177 lines (150 loc) · 8.05 KB

Introduction

There are multiple scenarios when monitoring cluster health is crucial. A common example is using Infinispan together with monitoring and log management platform (Kibana, Splink, Zabbix etc). Another interesting example is the Cloud. Health check is often required to plug an instance into a load balancer (in Kubernetes this is called Readiness probe. Finally, the health check API is used for a Kubernetes Rolling update procedure which is a necessary for implementing Infinispan Rolling Upgrades (see ISPN-6673)

Use cases

The health check API should resolve the following use cases:

  • Monitoring tools which could use JMX/CLI/REST interface

  • The Rolling upgrade procedure with Kubernetes (Readiness and liveness probes)

Exposed properties

The health check API should expose:

  • number of machines in the cluster

  • cluster name

  • overall cluster status (UNHEALTHY, HEALTHY and REBALANCING (operating but rebalance is in progress, don’t change the nodes))

  • per cache status

  • log tail (last X lines)

The implementation

The Health API is implemented as Java classes and exposed using different interfaces (including MBean, WildFly resources etc). This approach makes the API and exposing parts loosely coupled.

Embedded mode

The Health API can be easily accessed in embedded mode by using embeddedCacheManager.getHealth(). The implementation is heavily inspired by Cache Manager statistics since the functionality is very similar.

Exposing the Health API in embedded mode might be a bit problematic (especially when considering REST). One may consider using sophisticated tools for this such as Jolokia or Spring Boot Metrics.

WildFly integration

The Health API is plugged into WildFly integration bits (/server/integration/infinispan) as Metrics (runtime only resources which are read-only). Again, the main inspiration were previously implemented Metrics and Statistics.

WildFly exposes all management resources via HTTP Interface. Unfortunately since the Health API is exposed in runtime it needs to be invoked using HTTP POST method (GET allows to retrieve only standard attributes and resources).

Example query
curl --digest -L -D - "http://localhost:9990/management/subsystem/datagrid-infinispan/cache-container/clustered/health/HEALTH?operation=resource&include-runtime=true&json.pretty=1" --header "Content-Type: application/json" -u ispnadmin:ispnadmin
HTTP/1.1 401 Unauthorized
Connection: keep-alive
WWW-Authenticate: Digest realm="ManagementRealm",domain="/management",nonce="AuZzFxz7uC4NMTQ3MDgyNTU1NTQ3OCfIJBHXVpPHPBdzGUy7Qts=",opaque="00000000000000000000000000000000",algorithm=MD5,qop="auth"
Content-Length: 77
Content-Type: text/html
Date: Wed, 10 Aug 2016 10:39:15 GMT

HTTP/1.1 200 OK
Connection: keep-alive
Authentication-Info: nextnonce="AuZzFxz7uC4NMTQ3MDgyNTU1NTQ3OCfIJBHXVpPHPBdzGUy7Qts=",qop="auth",rspauth="b518c3170e627bd732055c382ce5d970",cnonce="NGViOWM0NDY5OGJmNjY0MjcyOWE4NDkyZDU3YzNhYjY=",nc=00000001
Content-Type: application/json; charset=utf-8
Content-Length: 1927
Date: Wed, 10 Aug 2016 10:39:15 GMT

{
    "cache-health" : "HEALTHY",
    "cluster-health" : ["test", "HEALTHY"],
    "cluster-name" : "clustered",
    "free-memory" : 96778,
    "log-tail" : [
        "2016-08-10 11:54:14,706 INFO  [org.infinispan.server.endpoint] (MSC service thread 1-5) DGENDPT10001: HotRodServer listening on 127.0.0.1:11222",
        "2016-08-10 11:54:14,706 INFO  [org.infinispan.server.endpoint] (MSC service thread 1-1) DGENDPT10001: MemcachedServer listening on 127.0.0.1:11211",
        "2016-08-10 11:54:14,785 INFO  [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___protobuf_metadata cache from clustered container",
        "2016-08-10 11:54:14,800 INFO  [org.jboss.as.clustering.infinispan] (MSC service thread 1-6) DGISPN0001: Started ___script_cache cache from clustered container",
        "2016-08-10 11:54:15,159 INFO  [org.jboss.as.clustering.infinispan] (MSC service thread 1-5) DGISPN0001: Started ___hotRodTopologyCache cache from clustered container",
        "2016-08-10 11:54:15,210 INFO  [org.infinispan.rest.NettyRestServer] (MSC service thread 1-6) ISPN012003: REST server starting, listening on 127.0.0.1:8080",
        "2016-08-10 11:54:15,210 INFO  [org.infinispan.server.endpoint] (MSC service thread 1-6) DGENDPT10002: REST mapped to /rest",
        "2016-08-10 11:54:15,306 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management",
        "2016-08-10 11:54:15,307 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990",
        "2016-08-10 11:54:15,307 INFO  [org.jboss.as] (Controller Boot Thread) WFLYSRV0025: Infinispan Server 9.0.0-SNAPSHOT (WildFly Core 2.2.0.CR9) started in 8681ms - Started 196 of 237 services (121 services are lazy, passive or on-demand)"
    ],
    "number-of-cpus" : 8,
    "number-of-nodes" : 1,
    "total-memory" : 235520
}%

Response codes: * 401 Unauthorized if user credentials are incorrect * 200 If everything is fine

API Query support

The WildFly HTTP Management API allows querying whole subtree of resources as well as single attributes.

How others implemented Health API

Elasticsearch
$curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "testcluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}
Spring Actuator metrics
$curl -XGET 'http://localhost:8080/metrics'
{
    "counter.status.200.root": 20,
    "counter.status.200.metrics": 3,
    "counter.status.200.star-star": 5,
    "counter.status.401.root": 4,
    "gauge.response.star-star": 6,
    "gauge.response.root": 2,
    "gauge.response.metrics": 3,
    "classes": 5808,
    "classes.loaded": 5808,
    "classes.unloaded": 0,
    "heap": 3728384,
    "heap.committed": 986624,
    "heap.init": 262144,
    "heap.used": 52765,
    "nonheap": 0,
    "nonheap.committed": 77568,
    "nonheap.init": 2496,
    "nonheap.used": 75826,
    "mem": 986624,
    "mem.free": 933858,
    "processors": 8,
    "threads": 15,
    "threads.daemon": 11,
    "threads.peak": 15,
    "threads.totalStarted": 42,
    "uptime": 494836,
    "instance.uptime": 489782,
    "datasource.primary.active": 5,
    "datasource.primary.usage": 0.25
}
ZooKeeper
$ echo mntr | nc localhost 2185

              zk_version  3.4.0
              zk_avg_latency  0
              zk_max_latency  0
              zk_min_latency  0
              zk_packets_received 70
              zk_packets_sent 69
              zk_outstanding_requests 0
              zk_server_state leader
              zk_znode_count   4
              zk_watch_count  0
              zk_ephemerals_count 0
              zk_approximate_data_size    27
              zk_followers    4                   - only exposed by the Leader
              zk_synced_followers 4               - only exposed by the Leader
              zk_pending_syncs    0               - only exposed by the Leader
              zk_open_file_descriptor_count 23    - only available on Unix platforms
              zk_max_file_descriptor_count 1024   - only available on Unix platforms