Caching for health checks #157

jvcode1 · 2019-02-27T18:55:34Z

Some checks may be be expensive and it is not necessary to execute them each time the health endpoint is called. Also in case of system problems some checks may execute slower which could lead into a vicious cycle. Monitoring clients or users are bombarding the system in case of a failure to see if the problem is still there.

Therefore we introduced a cache, so that monitoring checks are not executed again when they are called within a specific timeframe. In my opinion microprofile health could deliver a huge benefit if such a behaviour would be configurable per check.

For instance by defining a expiration time for each check:

@Health(expirationTime=30)
@ApplicationScoped
public class SomeCheck implements HealthCheck {

The text was updated successfully, but these errors were encountered:

kenfinnigan · 2019-02-27T19:07:57Z

Did you consider that within the check itself, as opposed to the spec defining it?

jvcode1 · 2019-02-27T20:42:59Z

I'm not sure if I understood your question. Are you implying to implement this behavior in the check itself? Isn't that the point to have a framework to be released from implementing commonly used things?

kenfinnigan · 2019-02-27T20:45:14Z

That's what I meant, yes.

My thought here is that I don't know how beneficial such a cache would be, as a check could report as good during the timeout window when in fact it isn't. And if the timeout is tuned small enough to prevent that, then why worry about a cache at all?

jvcode1 · 2019-02-27T21:18:19Z

If a check reports good during the timeout window although it isn't then you get the result only delayed if the problem still exists (assuming you have a monitoring which calls the endpoint regularly). If the problem only occurred during the timeout and not after that, I would consider this as a temporary fluctuation which is not relevant.

My point is that there may be some expensive checks which lasts several seconds. What if you have multiple requests at the same time? Is it really necessary to execute the same check at the same time in multiple threads? Or some checks run amok? I've seen such situation multiple times. The monitoring worsened the problem by exhausting thread pools and cpu resources. Of course there are a lot of other possible countermeasures, but from my experience caching is a simple and easy way which works in most of the cases.

kenfinnigan · 2019-02-27T21:26:01Z

I'm curious as to the use case where you would potentially be executing multiple health check requests at, or near, the same time

jvcode1 · 2019-02-27T21:40:06Z

For instance because there are several monitoring systems for different use cases (and no we can't get rid of them). Also the functionality is integrated in our client and can be accessed by administrative users. And also several persons, mostly from the application maintenance department, know how to call the endpoint.

DariusX · 2020-10-15T04:40:56Z

We have a similar situation.

Some health checks too expensive for liveness
But, we don't want them in readiness, because we want to terminate the Pod, not just stop traffic

Our solution was to

include them as readiness
save UP/DOWN to global map: name --> state
have liveness check that global state

It would be nice if the framework could support:

running some healthchecks only if their earlier result has "expired"
running them asynchronously (probably an optional implementation issue)
returning the latest non-expired response

antoinesd added this to the 2.2 milestone Sep 3, 2019

antoinesd modified the milestones: 2.2, 3.0 Jan 7, 2020

xstefank modified the milestones: 3.0, 3.1 May 19, 2020

xstefank removed this from the 3.1 milestone May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching for health checks #157

Caching for health checks #157

jvcode1 commented Feb 27, 2019

kenfinnigan commented Feb 27, 2019

jvcode1 commented Feb 27, 2019

kenfinnigan commented Feb 27, 2019

jvcode1 commented Feb 27, 2019

kenfinnigan commented Feb 27, 2019

jvcode1 commented Feb 27, 2019

DariusX commented Oct 15, 2020 •

edited

Caching for health checks #157

Caching for health checks #157

Comments

jvcode1 commented Feb 27, 2019

kenfinnigan commented Feb 27, 2019

jvcode1 commented Feb 27, 2019

kenfinnigan commented Feb 27, 2019

jvcode1 commented Feb 27, 2019

kenfinnigan commented Feb 27, 2019

jvcode1 commented Feb 27, 2019

DariusX commented Oct 15, 2020 • edited

DariusX commented Oct 15, 2020 •

edited