Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain combinations of check periods and check intervals prevent service status updates. #9984

Open
yoshi314 opened this issue Jan 31, 2024 · 3 comments · May be fixed by #10070
Open

Certain combinations of check periods and check intervals prevent service status updates. #9984

yoshi314 opened this issue Jan 31, 2024 · 3 comments · May be fixed by #10070
Labels
area/checks Check execution and results

Comments

@yoshi314
Copy link

yoshi314 commented Jan 31, 2024

Describe the bug

To Reproduce

I had a service that would check ssl cert validity. Mistakenly i configured it to have

check_interval = 24h
check_period = "workhours"

which meant it would execute 7-21 on business days.

This check was set up to go critical if certificate was 7 days away from expiration.

For a while this worked fine. but i've noticed that my check was stuck in warning state, saying i have 23 days left until Feb 2 - which (as of yesterday) was obviously incorrect. And the certificate was expiring in 2 days. So i had a stale service, except it seemed that it was being checked on schedule every 24 hours.

Upon inspection the check scheduled itself to run at ~4am, and apparently stopped updating service state for a few weeks since it was outside of its timeperiod. Manual reschedule of the check fixed the issue temporarily.

It appeared as if the check was running on schedule, just not updating service state.

I had to readjust the check to have notification timeperiod assigned instead.

Expected behavior

I assumed that icinga2 would try to schedule the check within the timeperiod window - especially if it was running late. Or would update the service state on the beginning of check_timeperiod.

I've seen this on 2.14.1 and 2.14.2 on Debian 10 (packages from icinga repository).

Maybe some kind of warning at config reload would suffice if check_interval may cause the service to "fall out" of its check_period window.

@xeiss
Copy link

xeiss commented Apr 9, 2024

I also have found this behavior in my environment (2.14.2 on Debian 12). With my config history I found that this problem exists for me since year 2019. As a workaround I don’t use check_interval = 24h + check_period = "workhours" since then.
The checks will with time, when it is planed outside check_period timeperiod and then will never go back to work until manual reschedule.

I also have a productive example: HTTPS Cert Check on a printer which goes in deep sleep in the night and check will fail, I have to manual reschedule this to "daytimes" and it will work again for months until the checktime will slide back to a nighttime, but with a working check_period = "workhours" I could mitigate this problem completely.

Possible Solution: The scheduler should have the check_period in mind and should plan only checks when they are inside the next timeperiod, also when the 24h check will be planned for 28h later. Or 70h later if it goes over weekend.

@Al2Klimov Al2Klimov added the area/checks Check execution and results label May 14, 2024
@Al2Klimov
Copy link
Member

The scheduler should have the check_period in mind and should plan only checks when they are inside the next timeperiod

@julianbrost This would also solve our little overdue problem.

@Al2Klimov
Copy link
Member

Manual reschedule of the check fixed the issue temporarily.

This is indeed a legit workaround. You could also make the check interval just 1h lower than your check window, 13h I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants