Only add node IPs to kubelet endpoint if they have a known Ready condition #6549
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
We have found that the kubelet controller does not make any node condition checks prior to adding their addresses to the kubelet endpoint. Normally, this is not an issue as it ends in the endpoint target being marked as "down" and scraping fails because the node is not ready.
A much larger symptom occurs when there is an IP address that is reused from a down/NotReady node in the cluster (we have seen this exact scenario in GKE). For instance, node1 can be
NotReady
with IP address 1.2.3.4 and the underlying provisioner creates a new node, node2, reusing IP address 1.2.3.4.During this scenario, because kubelet controller doesn't check for node status it will add two endpoint addresses with the same IP address. An example of subsets might look like this:
In this case, node1 is down (NotReady) and node2 is provisioned with the same IP address. Currently, prometheus-operator will add both node1 and node2 to the kubelet endpoint as you can see above. Given this, both of these subsets will be scraped and they will both succeed because node2 responds successfully from both scrapes to 1.2.3.4. The time series will have labels based off of their metadata. If metric1 is scraped from node2, it will be scraped twice and the only difference between the time series is the node label.
This is an instance where we get wrong and duplicate data.
Type of change
What type of changes does your code introduce to the Prometheus operator? Put an
x
in the box that apply.CHANGE
(fix or feature that would cause existing functionality to not work as expected)FEATURE
(non-breaking change which adds functionality)BUGFIX
(non-breaking change which fixes an issue)ENHANCEMENT
(non-breaking change which improves existing functionality)NONE
(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Verification
To verify this change I did made a slight modification to this kind-with-registry.sh script, and now it creates 3 local kind nodes.
kind-with-registry.sh (modified)
The verification script then does the following:
TAG=latest IMAGE_OPERATOR=localhost:5001/prometheus-operator/prometheus-operator make image && docker push localhost:5001/prometheus-operator/prometheus-operator:latest
kind-with-registry.sh
script to create a three node cluster.kind-worker2
node:docker exec kind-worker2 /bin/bash -c "systemctl stop kubelet"
kind-worker2
node to stop being ready:kubectl port-forward svc/kps-kube-prometheus-stack-prometheus 9090
curl -s 'localhost:9090/api/v1/targets?scrapePool=serviceMonitor%2Fdefault%2Fkps-kube-prometheus-stack-kubelet%2F0' | jq '.data.activeTargets[] | \"\(.scrapeUrl): \(.health)\"' -r
Before the fix, we can see the node is down:
And we can see that this endpoint IP is being scraped, and down.
After the fix, we can still see the node is down:
But now we can see that the NotReady node is not added to the kubelet endpoint and not being scraped.
Changelog entry
Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.