Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus.remote_write: mark component unhealthy if sending samples fails #823

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

captncraig
Copy link
Contributor

It accomplishes this by observing the log entries from the remote storage writer. If we see "non-recoverable error" messages, we assume there is some problem (usually with a bad token or some kind of networking or configuration issue). It definitely means no samples are getting through.

Detecting a recovery is a bit harder. There is no clear log message from the prometheus code (even at debug level) to indicate things have resumed. It is possible if we also hooked into the sample append hooks we could find a combination of metrics that would indicate recovery, but for now I am just assuming if we don't see an error log for 2 minutes (fairly arbitrary, may need tuning) that it is recovered. Even flapping health status is better than false positives all the time like we have now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant