Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching for health checks #157

Open
jvcode1 opened this issue Feb 27, 2019 · 7 comments
Open

Caching for health checks #157

jvcode1 opened this issue Feb 27, 2019 · 7 comments

Comments

@jvcode1
Copy link

jvcode1 commented Feb 27, 2019

Some checks may be be expensive and it is not necessary to execute them each time the health endpoint is called. Also in case of system problems some checks may execute slower which could lead into a vicious cycle. Monitoring clients or users are bombarding the system in case of a failure to see if the problem is still there.

Therefore we introduced a cache, so that monitoring checks are not executed again when they are called within a specific timeframe. In my opinion microprofile health could deliver a huge benefit if such a behaviour would be configurable per check.

For instance by defining a expiration time for each check:

@Health(expirationTime=30)
@ApplicationScoped
public class SomeCheck implements HealthCheck {
@kenfinnigan
Copy link

Did you consider that within the check itself, as opposed to the spec defining it?

@jvcode1
Copy link
Author

jvcode1 commented Feb 27, 2019

I'm not sure if I understood your question. Are you implying to implement this behavior in the check itself? Isn't that the point to have a framework to be released from implementing commonly used things?

@kenfinnigan
Copy link

That's what I meant, yes.

My thought here is that I don't know how beneficial such a cache would be, as a check could report as good during the timeout window when in fact it isn't. And if the timeout is tuned small enough to prevent that, then why worry about a cache at all?

@jvcode1
Copy link
Author

jvcode1 commented Feb 27, 2019

If a check reports good during the timeout window although it isn't then you get the result only delayed if the problem still exists (assuming you have a monitoring which calls the endpoint regularly). If the problem only occurred during the timeout and not after that, I would consider this as a temporary fluctuation which is not relevant.

My point is that there may be some expensive checks which lasts several seconds. What if you have multiple requests at the same time? Is it really necessary to execute the same check at the same time in multiple threads? Or some checks run amok? I've seen such situation multiple times. The monitoring worsened the problem by exhausting thread pools and cpu resources. Of course there are a lot of other possible countermeasures, but from my experience caching is a simple and easy way which works in most of the cases.

@kenfinnigan
Copy link

I'm curious as to the use case where you would potentially be executing multiple health check requests at, or near, the same time

@jvcode1
Copy link
Author

jvcode1 commented Feb 27, 2019

For instance because there are several monitoring systems for different use cases (and no we can't get rid of them). Also the functionality is integrated in our client and can be accessed by administrative users. And also several persons, mostly from the application maintenance department, know how to call the endpoint.

@antoinesd antoinesd added this to the 2.2 milestone Sep 3, 2019
@antoinesd antoinesd modified the milestones: 2.2, 3.0 Jan 7, 2020
@xstefank xstefank modified the milestones: 3.0, 3.1 May 19, 2020
@DariusX
Copy link

DariusX commented Oct 15, 2020

We have a similar situation.

  • Some health checks too expensive for liveness
  • But, we don't want them in readiness, because we want to terminate the Pod, not just stop traffic

Our solution was to

  • include them as readiness
  • save UP/DOWN to global map: name --> state
  • have liveness check that global state

It would be nice if the framework could support:

  • running some healthchecks only if their earlier result has "expired"
  • running them asynchronously (probably an optional implementation issue)
  • returning the latest non-expired response

@xstefank xstefank removed this from the 3.1 milestone May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants