Only add node IPs to kubelet endpoint if they have a known Ready condition #6549

tstringer-fn · 2024-04-25T16:19:57Z

Description

We have found that the kubelet controller does not make any node condition checks prior to adding their addresses to the kubelet endpoint. Normally, this is not an issue as it ends in the endpoint target being marked as "down" and scraping fails because the node is not ready.

A much larger symptom occurs when there is an IP address that is reused from a down/NotReady node in the cluster (we have seen this exact scenario in GKE). For instance, node1 can be NotReady with IP address 1.2.3.4 and the underlying provisioner creates a new node, node2, reusing IP address 1.2.3.4.

During this scenario, because kubelet controller doesn't check for node status it will add two endpoint addresses with the same IP address. An example of subsets might look like this:

  - ip: 1.2.3.4
    targetRef:
      kind: Node
      name: node1
      uid: uid1
  - ip: 1.2.3.4
    targetRef:
      kind: Node
      name: node2
      uid: uid2

In this case, node1 is down (NotReady) and node2 is provisioned with the same IP address. Currently, prometheus-operator will add both node1 and node2 to the kubelet endpoint as you can see above. Given this, both of these subsets will be scraped and they will both succeed because node2 responds successfully from both scrapes to 1.2.3.4. The time series will have labels based off of their metadata. If metric1 is scraped from node2, it will be scraped twice and the only difference between the time series is the node label.

This is an instance where we get wrong and duplicate data.

Type of change

What type of changes does your code introduce to the Prometheus operator? Put an x in the box that apply.

CHANGE (fix or feature that would cause existing functionality to not work as expected)
FEATURE (non-breaking change which adds functionality)
BUGFIX (non-breaking change which fixes an issue)
ENHANCEMENT (non-breaking change which improves existing functionality)
NONE (if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)

Verification

To verify this change I did made a slight modification to this kind-with-registry.sh script, and now it creates 3 local kind nodes.

kind-with-registry.sh (modified)

#!/bin/sh
set -o errexit

# 1. Create registry container unless it already exists
reg_name='kind-registry'
reg_port='5001'
if [ "$(docker inspect -f '{{.State.Running}}' "${reg_name}" 2>/dev/null || true)" != 'true' ]; then
  docker run \
    -d --restart=always -p "127.0.0.1:${reg_port}:5000" --network bridge --name "${reg_name}" \
    registry:2
fi

# 2. Create kind cluster with containerd registry config dir enabled
# TODO: kind will eventually enable this by default and this patch will
# be unnecessary.
#
# See:
# https://github.com/kubernetes-sigs/kind/issues/2875
# https://github.com/containerd/containerd/blob/main/docs/cri/config.md#registry-configuration
# See: https://github.com/containerd/containerd/blob/main/docs/hosts.md
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri".registry]
    config_path = "/etc/containerd/certs.d"
EOF

# 3. Add the registry config to the nodes
#
# This is necessary because localhost resolves to loopback addresses that are
# network-namespace local.
# In other words: localhost in the container is not localhost on the host.
#
# We want a consistent name that works from both ends, so we tell containerd to
# alias localhost:${reg_port} to the registry container when pulling images
REGISTRY_DIR="/etc/containerd/certs.d/localhost:${reg_port}"
for node in $(kind get nodes); do
  docker exec "${node}" mkdir -p "${REGISTRY_DIR}"
  cat <<EOF | docker exec -i "${node}" cp /dev/stdin "${REGISTRY_DIR}/hosts.toml"
[host."http://${reg_name}:5000"]
EOF
done

# 4. Connect the registry to the cluster network if not already connected
# This allows kind to bootstrap the network but ensures they're on the same network
if [ "$(docker inspect -f='{{json .NetworkSettings.Networks.kind}}' "${reg_name}")" = 'null' ]; then
  docker network connect "kind" "${reg_name}"
fi

# 5. Document the local registry
# https://github.com/kubernetes/enhancements/tree/master/keps/sig-cluster-lifecycle/generic/1755-communicating-a-local-registry
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: local-registry-hosting
  namespace: kube-public
data:
  localRegistryHosting.v1: |
    host: "localhost:${reg_port}"
    help: "https://kind.sigs.k8s.io/docs/user/local-registry/"
EOF

The verification script then does the following:

Build and push my changes to a local registry: TAG=latest IMAGE_OPERATOR=localhost:5001/prometheus-operator/prometheus-operator make image && docker push localhost:5001/prometheus-operator/prometheus-operator:latest
Run the above kind-with-registry.sh script to create a three node cluster.
Stop kubelet on kind-worker2 node: docker exec kind-worker2 /bin/bash -c "systemctl stop kubelet"
Wait for kind-worker2 node to stop being ready:

while true; do
	echo "Waiting for not Ready status"
	STATUS=$(kubectl get no kind-worker2 -o json | jq '.status.conditions[] | select(.type == "Ready") | .status' -r)
	echo "Node status: $STATUS"
	if [[ "$STATUS" != "True" ]]; then
		echo "✅ NotReady status found"
		kubectl get no kind-worker2
		break
	fi
done

Install kube-prometheus-stack helm chart but specify the new local image:

helm upgrade \
	--install \
	kps \
	--set prometheusOperator.image.registry=localhost:5001 \
	--set prometheusOperator.image.repostory=prometheus-operator \
	--set prometheusOperator.image.tag=latest \
	--set prometheusOperator.image.pullPolicy=Always \
	prometheus-community/kube-prometheus-stack

Port-forward the prometheus service: kubectl port-forward svc/kps-kube-prometheus-stack-prometheus 9090
Scrape prometheus kubelet targets: curl -s 'localhost:9090/api/v1/targets?scrapePool=serviceMonitor%2Fdefault%2Fkps-kube-prometheus-stack-kubelet%2F0' | jq '.data.activeTargets[] | \"\(.scrapeUrl): \(.health)\"' -r

Before the fix, we can see the node is down:

kubectl get no -o wide
NAME                 STATUS     ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION                        CONTAINER-RUNTIME
kind-control-plane   Ready      control-plane   2m58s   v1.29.2   192.168.228.4   <none>        Debian GNU/Linux 12 (bookworm)   6.7.11-orbstack-00143-ge6b82e26cd22   containerd://1.7.13
kind-worker          Ready      <none>          2m36s   v1.29.2   192.168.228.5   <none>        Debian GNU/Linux 12 (bookworm)   6.7.11-orbstack-00143-ge6b82e26cd22   containerd://1.7.13
kind-worker2         NotReady   <none>          2m34s   v1.29.2   192.168.228.2   <none>        Debian GNU/Linux 12 (bookworm)   6.7.11-orbstack-00143-ge6b82e26cd22   containerd://1.7.13

And we can see that this endpoint IP is being scraped, and down.

curl -s 'localhost:9090/api/v1/targets?scrapePool=serviceMonitor%2Fdefault%2Fkps-kube-prometheus-stack-kubelet%2F0' | jq '.data.activeTargets[] | "\(.scrapeUrl): \(.health)"' -r
https://192.168.228.4:10250/metrics: up
https://192.168.228.5:10250/metrics: up
https://192.168.228.2:10250/metrics: down

After the fix, we can still see the node is down:

kubectl get no -o wide
NAME                 STATUS     ROLES           AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION                        CONTAINER-RUNTIME
kind-control-plane   Ready      control-plane   7m37s   v1.29.2   192.168.228.4   <none>        Debian GNU/Linux 12 (bookworm)   6.7.11-orbstack-00143-ge6b82e26cd22   containerd://1.7.13
kind-worker          Ready      <none>          7m15s   v1.29.2   192.168.228.2   <none>        Debian GNU/Linux 12 (bookworm)   6.7.11-orbstack-00143-ge6b82e26cd22   containerd://1.7.13
kind-worker2         NotReady   <none>          7m13s   v1.29.2   192.168.228.5   <none>        Debian GNU/Linux 12 (bookworm)   6.7.11-orbstack-00143-ge6b82e26cd22   containerd://1.7.13

But now we can see that the NotReady node is not added to the kubelet endpoint and not being scraped.

curl -s 'localhost:9090/api/v1/targets?scrapePool=serviceMonitor%2Fdefault%2Fkps-kube-prometheus-stack-kubelet%2F0' | jq '.data.activeTargets[] | "\(.scrapeUrl): \(.health)"' -r
https://192.168.228.4:10250/metrics: up
https://192.168.228.2:10250/metrics: up

Changelog entry

Please put a one-line changelog entry below. This will be copied to the changelog file during the release process.

Change prometheus-operator behavior to only add Ready nodes' endpoint addresses to the kubelet endpoint for scraping.

condition

tstringer-fn · 2024-04-25T16:20:40Z

This closes #6548.

simonpasquier · 2024-04-29T13:45:22Z

This would be a change. To reduce the impact, I'd prefer to skip nodes which have unknown status and for which another node reports with the same IP address.

…dress

tstringer-fn · 2024-05-02T18:28:40Z

@simonpasquier, thanks for the feedback! I've made the changes to reflect that. It's worth noting that because this new implementation uses a map, the original tests required sorting for their assertions. It doesn't appear as though the order of IP addresses is important outside of the tests, so I changed that behavior to no longer fail on order. But please let me know if that was a faulty assumption.

Thanks again!

simonpasquier

I'd rather preserve the order of the list to avoid unnecessary updates to the endpoints object.

Only add node IPs to kubelet endpoint if they have a known Ready

286c4a6

condition

tstringer-fn requested a review from a team as a code owner April 25, 2024 16:19

pull-request-size bot added the size/L label Apr 25, 2024

tstringer-fn added 3 commits May 2, 2024 11:38

Only omit unknown ready condition nodes if there is a duplicate IP ad…

269a402

…dress

Add test coverage

43b3e76

Change equal check to not care about order

8c406ee

simonpasquier reviewed May 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only add node IPs to kubelet endpoint if they have a known Ready condition #6549

Only add node IPs to kubelet endpoint if they have a known Ready condition #6549

tstringer-fn commented Apr 25, 2024

tstringer-fn commented Apr 25, 2024

simonpasquier commented Apr 29, 2024

tstringer-fn commented May 2, 2024

simonpasquier left a comment •

edited

Only add node IPs to kubelet endpoint if they have a known Ready condition #6549

Are you sure you want to change the base?

Only add node IPs to kubelet endpoint if they have a known Ready condition #6549

Conversation

tstringer-fn commented Apr 25, 2024

Description

Type of change

Verification

Changelog entry

tstringer-fn commented Apr 25, 2024

simonpasquier commented Apr 29, 2024

tstringer-fn commented May 2, 2024

simonpasquier left a comment • edited

Choose a reason for hiding this comment

simonpasquier left a comment •

edited