You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have found that the kubelet controller does not make any node condition checks prior to adding their addresses to the kubelet endpoint. Normally, this is not an issue as it ends in the endpoint target being marked as "down" and scraping fails because the node is not ready.
A much larger symptom occurs when there is an IP address that is reused from a down/NotReady node in the cluster (we have seen this exact scenario in GKE). For instance, node1 can be NotReady with IP address 1.2.3.4 and the underlying provisioner creates a new node, node2, reusing IP address 1.2.3.4.
During this scenario, because kubelet controller doesn't check for node status it will add two endpoint addresses with the same IP address. An example of subsets might look like this:
In this case, node1 is down (NotReady) and node2 is provisioned with the same IP address. Currently, prometheus-operator will add both node1 and node2 to the kubelet endpoint as you can see above. Given this, both of these subsets will be scraped and they will both succeed because node2 responds successfully from both scrapes to 1.2.3.4. The time series will have labels based off of their metadata. If metric1 is scraped from node2, it will be scraped twice and the only difference between the time series is the node label.
This is an instance where we get wrong and duplicate data.
Steps to Reproduce
Unfortunately, it is difficult to reproduce the issue. You'd have to set a node as NotReady (which isn't impossible) but then have a node provisioner create another node with the same IP address as the NotReady node.
Expected Result
Prometheus operator should not add a NotReady node's IP address to the kubelet endpoints.
Actual Result
Prometheus operator adds a NotReady node's IP address to the kubelet endpoint.
tstringer-fn
changed the title
kubelet endpoint contains IP addresses of NotReady nodes
kubelet endpoint contains IP addresses of nodes with Ready condition Unknown
Apr 25, 2024
Is there an existing issue for this?
What happened?
Description
We have found that the kubelet controller does not make any node condition checks prior to adding their addresses to the kubelet endpoint. Normally, this is not an issue as it ends in the endpoint target being marked as "down" and scraping fails because the node is not ready.
A much larger symptom occurs when there is an IP address that is reused from a down/NotReady node in the cluster (we have seen this exact scenario in GKE). For instance, node1 can be
NotReady
with IP address 1.2.3.4 and the underlying provisioner creates a new node, node2, reusing IP address 1.2.3.4.During this scenario, because kubelet controller doesn't check for node status it will add two endpoint addresses with the same IP address. An example of subsets might look like this:
In this case, node1 is down (NotReady) and node2 is provisioned with the same IP address. Currently, prometheus-operator will add both node1 and node2 to the kubelet endpoint as you can see above. Given this, both of these subsets will be scraped and they will both succeed because node2 responds successfully from both scrapes to 1.2.3.4. The time series will have labels based off of their metadata. If metric1 is scraped from node2, it will be scraped twice and the only difference between the time series is the node label.
This is an instance where we get wrong and duplicate data.
Steps to Reproduce
Unfortunately, it is difficult to reproduce the issue. You'd have to set a node as
NotReady
(which isn't impossible) but then have a node provisioner create another node with the same IP address as theNotReady
node.Expected Result
Prometheus operator should not add a
NotReady
node's IP address to the kubelet endpoints.Actual Result
Prometheus operator adds a
NotReady
node's IP address to the kubelet endpoint.Prometheus Operator Version
Kubernetes Version
Kubernetes Cluster Type
GKE
How did you deploy Prometheus-Operator?
helm chart:prometheus-community/kube-prometheus-stack
Manifests
No response
prometheus-operator log output
Anything else?
No response
The text was updated successfully, but these errors were encountered: