kubelet endpoint contains IP addresses of nodes with Ready condition Unknown #6548

tstringer-fn · 2024-04-25T16:09:02Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

Description

We have found that the kubelet controller does not make any node condition checks prior to adding their addresses to the kubelet endpoint. Normally, this is not an issue as it ends in the endpoint target being marked as "down" and scraping fails because the node is not ready.

A much larger symptom occurs when there is an IP address that is reused from a down/NotReady node in the cluster (we have seen this exact scenario in GKE). For instance, node1 can be NotReady with IP address 1.2.3.4 and the underlying provisioner creates a new node, node2, reusing IP address 1.2.3.4.

During this scenario, because kubelet controller doesn't check for node status it will add two endpoint addresses with the same IP address. An example of subsets might look like this:

  - ip: 1.2.3.4
    targetRef:
      kind: Node
      name: node1
      uid: uid1
  - ip: 1.2.3.4
    targetRef:
      kind: Node
      name: node2
      uid: uid2

In this case, node1 is down (NotReady) and node2 is provisioned with the same IP address. Currently, prometheus-operator will add both node1 and node2 to the kubelet endpoint as you can see above. Given this, both of these subsets will be scraped and they will both succeed because node2 responds successfully from both scrapes to 1.2.3.4. The time series will have labels based off of their metadata. If metric1 is scraped from node2, it will be scraped twice and the only difference between the time series is the node label.

This is an instance where we get wrong and duplicate data.

Steps to Reproduce

Unfortunately, it is difficult to reproduce the issue. You'd have to set a node as NotReady (which isn't impossible) but then have a node provisioner create another node with the same IP address as the NotReady node.

Expected Result

Prometheus operator should not add a NotReady node's IP address to the kubelet endpoints.

Actual Result

Prometheus operator adds a NotReady node's IP address to the kubelet endpoint.

Prometheus Operator Version

v0.66.0

Kubernetes Version

clientVersion:
  buildDate: "2023-06-14T09:56:58Z"
  compiler: gc
  gitCommit: 11902a838028edef305dfe2f96be929bc4d114d8
  gitTreeState: clean
  gitVersion: v1.26.6
  goVersion: go1.19.10
  major: "1"
  minor: "26"
  platform: darwin/arm64
kustomizeVersion: v4.5.7
serverVersion:
  buildDate: "2024-03-28T09:16:53Z"
  compiler: gc
  gitCommit: 5f47ffc3348d96baa3e5738450751c53f877c571
  gitTreeState: clean
  gitVersion: v1.26.15-gke.1090000
  goVersion: go1.21.8 X:boringcrypto
  major: "1"
  minor: "26"
  platform: linux/amd64

Kubernetes Cluster Type

GKE

How did you deploy Prometheus-Operator?

helm chart:prometheus-community/kube-prometheus-stack

Manifests

No response

prometheus-operator log output

N/A

Anything else?

No response

The text was updated successfully, but these errors were encountered:

tstringer-fn added kind/bug needs-triage Issues that haven't been triaged yet labels Apr 25, 2024

tstringer-fn changed the title ~~kubelet endpoint contains IP addresses of NotReady nodes~~ kubelet endpoint contains IP addresses of nodes with Ready condition Unknown Apr 25, 2024

tstringer-fn mentioned this issue Apr 25, 2024

Only add node IPs to kubelet endpoint if they have a known Ready condition #6549

Open

5 tasks

simonpasquier added kind/enhancement and removed kind/bug needs-triage Issues that haven't been triaged yet labels Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet endpoint contains IP addresses of nodes with Ready condition Unknown #6548

kubelet endpoint contains IP addresses of nodes with Ready condition Unknown #6548

tstringer-fn commented Apr 25, 2024

kubelet endpoint contains IP addresses of nodes with Ready condition Unknown #6548

kubelet endpoint contains IP addresses of nodes with Ready condition Unknown #6548

Comments

tstringer-fn commented Apr 25, 2024

Is there an existing issue for this?

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Prometheus Operator Version

Kubernetes Version

Kubernetes Cluster Type

How did you deploy Prometheus-Operator?

Manifests

prometheus-operator log output

Anything else?