Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a degraded state and a pluggable combiner for global state computation #130

Open
rmannibucau opened this issue Jul 9, 2018 · 16 comments
Labels

Comments

@rmannibucau
Copy link

The goal is to be able to be able to flag as "DEGRADED" a state, which means a system is not "UP" but not "DOWN". Interest is to be able to prioritize failure handling and identify systems which are responding but not at the best.

The combiner would be used if implemented (using cdi resolution rule) in the endpoint to compute the global state, it can likely be a BinaryOperator or BiFunction<State, State, State>.

@derekm
Copy link
Contributor

derekm commented Oct 19, 2018

+1 on @rmannibucau's recommendation for a 3rd state.

The draft Health Check Response Format RFC that wants to align with MP Health 2.0 calls for a 3rd "warn" state. (The RFC uses "pass" and "fail", but allows "up" and "down" aliases for Java compatibility. So far "warn" has no aliases.)

Whether underlying states roll up to "warn" instead of "down" depends on whether the check failures are in liveness checks, readiness checks or non-liveness/non-readiness health checks (when failures are only in the last category of checks, the overall state is "warn" or "degraded", but still live and ready).

See PR discussions here: inadarei/rfc-healthcheck#25
and here: #142

@antoinesd antoinesd added this to the 2.2 milestone Sep 3, 2019
@aguibert
Copy link

aguibert commented Dec 4, 2019

+1 on this proposal, but similar to @derekm I would also like to see the status name being warn as opposed to the originally suggested DEGRADED because it aligns with the draft RFC

@rmannibucau
Copy link
Author

No strong take on the naming, degraded is what I used by the past but if there is an actual final rfc we should just align on it IMHO

@derekm
Copy link
Contributor

derekm commented Dec 4, 2019

The RFC is still in draft stages, but this community should definitely be involved in helping to specify that RFC.

The RFC calls for Map<String, List<Check>> checks whereas MP Health calls for List<Check> checks. In my projects, I put MP Health into a mode where it serves out Map<String, Check> checks.

The rationale for Map<String, List<Check>> format to the checks field is, for example, a CPU stats healthcheck with results per-core or an HA dependency with results per-instance. In MP Health, I guess we'd stuff that stuff into a Check's Map<String, Object> data.

Now is the time to figure out where the RFC is over-specified and where MP Health is under-specified, such that the two ultimately align. The rename from outcome/state to status/status after the Health 1.0 release coincidentally put MP Health more in-line with the RFC, which was a nice happenstance.

@rmannibucau
Copy link
Author

Sirona and most competitors use a list (https://github.com/apache/sirona/blob/192ba25c8951185367a5c54ffb34d4702a4e15ac/api/src/main/java/org/apache/sirona/status/NodeStatusReporter.java#L28) just for semantic reasons I guess. Using a map make it no more balanced and no more sortable (which is bad for some languages interoperability).

@derekm
Copy link
Contributor

derekm commented Dec 4, 2019

I use Hammock, which has: https://github.com/hammock-project/hammock/blob/master/util-health/src/main/java/ws/ament/hammock/health/HealthCheckManager.java#L79 via MP Config setting hammock.health.output.format=map.

We like being able to deference via checks["check-i-am-looking-for"] instead of searching an array by inspecting each object in it. The map forces check names to be unique, which a list does not do (this might actually be another rationale for the RFC's Map<String, List<Check>> format, such that checks with the same name extend the list instead of overwrite each other).

I don't understand what balanced or sortable mean in this context -- but really I only meant to encourage others to get seriously involved in rfc-healthcheck discussions. A healthcheck RFC will benefit everyone, and the people here getting involved in the RFC will benefit the RFC.

@rmannibucau
Copy link
Author

A list has an order, a map lost it - not in java but think to a js consumer for example.

Also your point about name uniqueness in a map kind of encourages a list to not loose checks IMO. Discriminator can end up in "data" part and user can user whatever "selector" he needs - as in css where class is not an id but class + attribute can be.

Agree mp should join rfc but guess it should be done as "one". Any procedure there? Maybe eclipse can help?

@derekm
Copy link
Contributor

derekm commented Dec 5, 2019

Maybe it depends on the implementation, but I don't think list-order is reliable across application reboots... Internally, we have a few interfaces that are LinkedHashMap where we want to preserve key ordering, but those are special cases, and with key-based access in MP Health, I don't really see a need to preserve key-order. FWIW, checks["check-i-am-looking-for"] seems a lot more reliable than checks[3] (which is hopefully still the check I am looking for).

Nobody relies on list order, because it is unreliable. Everybody searches the list, which is unnecessary & even irresponsible: https://github.com/OpenLiberty/guide-microprofile-health/blob/master/start/src/test/java/it/io/openliberty/guides/health/HealthITUtil.java#L57

@derekm
Copy link
Contributor

derekm commented Dec 5, 2019

Agree mp should join rfc but guess it should be done as "one". Any procedure there? Maybe eclipse can help?

I think the IETF is more like Apache & less like Eclipse, where it is more about individual contributors and less about organizations. Certainly, Eclipse can hire someone and assign them to participate if it sees fit. But anyone interested in the future of MP Health is free to get involved in evolving @inadarei's draft: https://github.com/inadarei/rfc-healthcheck

@rmannibucau
Copy link
Author

@derekm we have @priority so no real reason ordering can't be used and even if not deterministic between reboot, it is once started vs a map does not give any portability guarantee on that

@derekm
Copy link
Contributor

derekm commented Dec 5, 2019

@derekm we have @priority so no real reason ordering can't be used and even if not deterministic between reboot, it is once started vs a map does not give any portability guarantee on that

What do you mean by portability? Everyone gives key-based access to maps/JSON objects...

You would want the RFC to revert checks to a list?

@rmannibucau
Copy link
Author

@derekm list are portable accross languages and container impl, maps are sadly not so list is the structure loosing the less data.

@cescoffier
Copy link
Contributor

We have been discussed a long time ago about a third type of health check (that we named wellness at that time) that would allow such kind of flexibility.

The goal of this endpoint is to provide more details information about the current state (so degraded is perfectly fine) and allow to start raising an alarm which is not a full-stop yet.

@xstefank
Copy link
Member

xstefank commented Dec 6, 2019

@cescoffier wellness endpoint is still on the roadmap not sure when we will get it in though. However, I would expect that health checks will be reused so we can't introduce degraded state only for one type of check.

@derekm
Copy link
Contributor

derekm commented Dec 6, 2019

The draft RFC says this about WARN/DEGRADED: https://inadarei.github.io/rfc-healthcheck/#status

status: (required) indicates whether the service status is acceptable or not. API publishers SHOULD use following values for the field:

  • “pass”: healthy (acceptable aliases: “ok” to support Node’s Terminus and “up” for Java’s SpringBoot),
  • “fail”: unhealthy (acceptable aliases: “error” to support Node’s Terminus and “down” for Java’s SpringBoot), and
  • “warn”: healthy, with some concerns.

The value of the status field is case-insensitive and is tightly related with the HTTP response code returned by the health endpoint. ... In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.

WARN doesn't have any "acceptable aliases", maybe it should! Otherwise, I think the semantics are the same as intended here.

@antoinesd antoinesd modified the milestones: 2.2, 3.0 Jan 7, 2020
@antoinesd
Copy link
Contributor

Should be addressed when designing wellness endpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants