Introduce a degraded state and a pluggable combiner for global state computation #130

rmannibucau · 2018-07-09T08:44:04Z

The goal is to be able to be able to flag as "DEGRADED" a state, which means a system is not "UP" but not "DOWN". Interest is to be able to prioritize failure handling and identify systems which are responding but not at the best.

The combiner would be used if implemented (using cdi resolution rule) in the endpoint to compute the global state, it can likely be a BinaryOperator or BiFunction<State, State, State>.

derekm · 2018-10-19T17:33:39Z

+1 on @rmannibucau's recommendation for a 3rd state.

The draft Health Check Response Format RFC that wants to align with MP Health 2.0 calls for a 3rd "warn" state. (The RFC uses "pass" and "fail", but allows "up" and "down" aliases for Java compatibility. So far "warn" has no aliases.)

Whether underlying states roll up to "warn" instead of "down" depends on whether the check failures are in liveness checks, readiness checks or non-liveness/non-readiness health checks (when failures are only in the last category of checks, the overall state is "warn" or "degraded", but still live and ready).

See PR discussions here: inadarei/rfc-healthcheck#25
and here: #142

aguibert · 2019-12-04T18:06:19Z

+1 on this proposal, but similar to @derekm I would also like to see the status name being warn as opposed to the originally suggested DEGRADED because it aligns with the draft RFC

rmannibucau · 2019-12-04T18:32:29Z

No strong take on the naming, degraded is what I used by the past but if there is an actual final rfc we should just align on it IMHO

derekm · 2019-12-04T21:51:54Z

The RFC is still in draft stages, but this community should definitely be involved in helping to specify that RFC.

The RFC calls for Map<String, List<Check>> checks whereas MP Health calls for List<Check> checks. In my projects, I put MP Health into a mode where it serves out Map<String, Check> checks.

The rationale for Map<String, List<Check>> format to the checks field is, for example, a CPU stats healthcheck with results per-core or an HA dependency with results per-instance. In MP Health, I guess we'd stuff that stuff into a Check's Map<String, Object> data.

Now is the time to figure out where the RFC is over-specified and where MP Health is under-specified, such that the two ultimately align. The rename from outcome/state to status/status after the Health 1.0 release coincidentally put MP Health more in-line with the RFC, which was a nice happenstance.

rmannibucau · 2019-12-04T21:58:23Z

Sirona and most competitors use a list (https://github.com/apache/sirona/blob/192ba25c8951185367a5c54ffb34d4702a4e15ac/api/src/main/java/org/apache/sirona/status/NodeStatusReporter.java#L28) just for semantic reasons I guess. Using a map make it no more balanced and no more sortable (which is bad for some languages interoperability).

derekm · 2019-12-04T22:16:26Z

I use Hammock, which has: https://github.com/hammock-project/hammock/blob/master/util-health/src/main/java/ws/ament/hammock/health/HealthCheckManager.java#L79 via MP Config setting hammock.health.output.format=map.

We like being able to deference via checks["check-i-am-looking-for"] instead of searching an array by inspecting each object in it. The map forces check names to be unique, which a list does not do (this might actually be another rationale for the RFC's Map<String, List<Check>> format, such that checks with the same name extend the list instead of overwrite each other).

I don't understand what balanced or sortable mean in this context -- but really I only meant to encourage others to get seriously involved in rfc-healthcheck discussions. A healthcheck RFC will benefit everyone, and the people here getting involved in the RFC will benefit the RFC.

rmannibucau · 2019-12-05T05:38:50Z

A list has an order, a map lost it - not in java but think to a js consumer for example.

Also your point about name uniqueness in a map kind of encourages a list to not loose checks IMO. Discriminator can end up in "data" part and user can user whatever "selector" he needs - as in css where class is not an id but class + attribute can be.

Agree mp should join rfc but guess it should be done as "one". Any procedure there? Maybe eclipse can help?

derekm · 2019-12-05T15:55:32Z

Maybe it depends on the implementation, but I don't think list-order is reliable across application reboots... Internally, we have a few interfaces that are LinkedHashMap where we want to preserve key ordering, but those are special cases, and with key-based access in MP Health, I don't really see a need to preserve key-order. FWIW, checks["check-i-am-looking-for"] seems a lot more reliable than checks[3] (which is hopefully still the check I am looking for).

Nobody relies on list order, because it is unreliable. Everybody searches the list, which is unnecessary & even irresponsible: https://github.com/OpenLiberty/guide-microprofile-health/blob/master/start/src/test/java/it/io/openliberty/guides/health/HealthITUtil.java#L57

derekm · 2019-12-05T16:12:16Z

Agree mp should join rfc but guess it should be done as "one". Any procedure there? Maybe eclipse can help?

I think the IETF is more like Apache & less like Eclipse, where it is more about individual contributors and less about organizations. Certainly, Eclipse can hire someone and assign them to participate if it sees fit. But anyone interested in the future of MP Health is free to get involved in evolving @inadarei's draft: https://github.com/inadarei/rfc-healthcheck

rmannibucau · 2019-12-05T16:30:02Z

@derekm we have @priority so no real reason ordering can't be used and even if not deterministic between reboot, it is once started vs a map does not give any portability guarantee on that

derekm · 2019-12-05T17:08:47Z

@derekm we have @priority so no real reason ordering can't be used and even if not deterministic between reboot, it is once started vs a map does not give any portability guarantee on that

What do you mean by portability? Everyone gives key-based access to maps/JSON objects...

You would want the RFC to revert checks to a list?

rmannibucau · 2019-12-05T17:24:41Z

@derekm list are portable accross languages and container impl, maps are sadly not so list is the structure loosing the less data.

cescoffier · 2019-12-06T09:03:07Z

We have been discussed a long time ago about a third type of health check (that we named wellness at that time) that would allow such kind of flexibility.

The goal of this endpoint is to provide more details information about the current state (so degraded is perfectly fine) and allow to start raising an alarm which is not a full-stop yet.

xstefank · 2019-12-06T09:22:51Z

@cescoffier wellness endpoint is still on the roadmap not sure when we will get it in though. However, I would expect that health checks will be reused so we can't introduce degraded state only for one type of check.

derekm · 2019-12-06T15:30:06Z

The draft RFC says this about WARN/DEGRADED: https://inadarei.github.io/rfc-healthcheck/#status

status: (required) indicates whether the service status is acceptable or not. API publishers SHOULD use following values for the field:

“pass”: healthy (acceptable aliases: “ok” to support Node’s Terminus and “up” for Java’s SpringBoot),

“fail”: unhealthy (acceptable aliases: “error” to support Node’s Terminus and “down” for Java’s SpringBoot), and

“warn”: healthy, with some concerns.

The value of the status field is case-insensitive and is tightly related with the HTTP response code returned by the health endpoint. ... In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.

WARN doesn't have any "acceptable aliases", maybe it should! Otherwise, I think the semantics are the same as intended here.

antoinesd · 2020-01-07T15:30:15Z

Should be addressed when designing wellness endpoint

antoinesd added this to the 2.2 milestone Sep 3, 2019

antoinesd modified the milestones: 2.2, 3.0 Jan 7, 2020

xstefank added the wellness label Jan 7, 2020

antoinesd mentioned this issue May 19, 2020

Discussion about wellness endpoint #253

Open

pvdbosch mentioned this issue May 27, 2020

monitoring guidelines belgif/rest-guide#13

Closed

antoinesd modified the milestones: 3.0, 3.1 Jun 10, 2020

xstefank removed this from the 3.1 milestone May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a degraded state and a pluggable combiner for global state computation #130

Introduce a degraded state and a pluggable combiner for global state computation #130

rmannibucau commented Jul 9, 2018

derekm commented Oct 19, 2018 •

edited

aguibert commented Dec 4, 2019

rmannibucau commented Dec 4, 2019

derekm commented Dec 4, 2019

rmannibucau commented Dec 4, 2019

derekm commented Dec 4, 2019

rmannibucau commented Dec 5, 2019

derekm commented Dec 5, 2019

derekm commented Dec 5, 2019

rmannibucau commented Dec 5, 2019

derekm commented Dec 5, 2019

rmannibucau commented Dec 5, 2019

cescoffier commented Dec 6, 2019

xstefank commented Dec 6, 2019

derekm commented Dec 6, 2019

antoinesd commented Jan 7, 2020

Introduce a degraded state and a pluggable combiner for global state computation #130

Introduce a degraded state and a pluggable combiner for global state computation #130

Comments

rmannibucau commented Jul 9, 2018

derekm commented Oct 19, 2018 • edited

aguibert commented Dec 4, 2019

rmannibucau commented Dec 4, 2019

derekm commented Dec 4, 2019

rmannibucau commented Dec 4, 2019

derekm commented Dec 4, 2019

rmannibucau commented Dec 5, 2019

derekm commented Dec 5, 2019

derekm commented Dec 5, 2019

rmannibucau commented Dec 5, 2019

derekm commented Dec 5, 2019

rmannibucau commented Dec 5, 2019

cescoffier commented Dec 6, 2019

xstefank commented Dec 6, 2019

derekm commented Dec 6, 2019

antoinesd commented Jan 7, 2020

derekm commented Oct 19, 2018 •

edited