[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

microyahoo · 2024-04-13T09:35:51Z

Is this a bug report or feature request?

Bug Report

I consider this an enhancement more than a bug report.

I deployed a new cluster and found ceph cluster predicate for node watcher periodically prints networkfences.csiaddons.openshift.io CRD not found, skip creating networkFence. The predicate function watch the node events to trigger reconcile, then controller-runtime client handles node failure whenROOK_WATCH_FOR_NODE_FAILURE env set to true. The log will be pop up if networkfences csiaddons CRDs not found.

In order to figure out why the update event keeps being triggered and the handleNodeFailure method is called periodically, I hacked the code to print out the diff of ObjectNew and ObjectOld, and found that when the cluster is stable, the update is only the resource version and heartbeat time. Therefore, my question is, can we ignore this kind of event in the update event and only handle the real node failure or other changed events.

{
  "metadata": {
    "resourceVersion": "14127"
  },
  "status": {
    "$setElementOrder/conditions": [
      {
        "type": "NetworkUnavailable"
      },
      {
        "type": "MemoryPressure"
      },
      {
        "type": "DiskPressure"
      },
      {
        "type": "PIDPressure"
      },
      {
        "type": "Ready"
      }
    ],
    "conditions": [
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "MemoryPressure"
      },
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "DiskPressure"
      },
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "PIDPressure"
      },
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "Ready"
      }
    ]
  }
}

Deviation from expected behavior:

Expected behavior:

How to reproduce it (minimal and precise):

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health
To get the status of the cluster, use kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

The text was updated successfully, but these errors were encountered:

parth-gr · 2024-04-16T06:06:15Z

@microyahoo I also see the type is changing with time, which seems right to watch for

microyahoo · 2024-04-16T07:18:22Z

hi @parth-gr, I compared the current and the modified object. The diff is as follows

~ cat current | python3 -m json.tool > current.json
~ cat modified | python3 -m json.tool > modified.json
~ vimdiff current.json modified.json

parth-gr · 2024-04-16T07:37:26Z

Okay then we need to create a custom resourceComparater that avoids looking at the difference for this

parth-gr · 2024-04-30T12:00:11Z

/assign

github-actions · 2024-04-30T12:00:21Z

Thanks for taking this issue! Let us know if you have any questions!

currently node predicate watches for resourceversion and heartbeat, which are not critical things so we should look for the reconcile logic and unnecessary increasing the predicate cache size, So with the PR we will ignore conidering for a reconcile of cephcluster if updates are not cruical closes: rook#14070 Signed-off-by: parth-gr <paarora@redhat.com>

microyahoo added the bug label Apr 13, 2024

travisn added this to To do in v1.14 via automation Apr 15, 2024

github-actions bot assigned parth-gr Apr 30, 2024

parth-gr linked a pull request May 9, 2024 that will close this issue

core: do not watch resourceversion and heartbeat node updates #14186

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

microyahoo commented Apr 13, 2024 •

edited

parth-gr commented Apr 16, 2024

microyahoo commented Apr 16, 2024 •

edited

parth-gr commented Apr 16, 2024

parth-gr commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

Comments

microyahoo commented Apr 13, 2024 • edited

parth-gr commented Apr 16, 2024

microyahoo commented Apr 16, 2024 • edited

parth-gr commented Apr 16, 2024

parth-gr commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

microyahoo commented Apr 13, 2024 •

edited

microyahoo commented Apr 16, 2024 •

edited