Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

Open
microyahoo opened this issue Apr 13, 2024 · 5 comments · May be fixed by #14186
Open

[enhancement] ignore the update events of heartbeat time changed when reconcile node events #14070

microyahoo opened this issue Apr 13, 2024 · 5 comments · May be fixed by #14186
Assignees
Labels
Projects

Comments

@microyahoo
Copy link
Member

microyahoo commented Apr 13, 2024

Is this a bug report or feature request?

  • Bug Report

I consider this an enhancement more than a bug report.

I deployed a new cluster and found ceph cluster predicate for node watcher periodically prints networkfences.csiaddons.openshift.io CRD not found, skip creating networkFence. The predicate function watch the node events to trigger reconcile, then controller-runtime client handles node failure whenROOK_WATCH_FOR_NODE_FAILURE env set to true. The log will be pop up if networkfences csiaddons CRDs not found.

In order to figure out why the update event keeps being triggered and the handleNodeFailure method is called periodically, I hacked the code to print out the diff of ObjectNew and ObjectOld, and found that when the cluster is stable, the update is only the resource version and heartbeat time. Therefore, my question is, can we ignore this kind of event in the update event and only handle the real node failure or other changed events.

{
  "metadata": {
    "resourceVersion": "14127"
  },
  "status": {
    "$setElementOrder/conditions": [
      {
        "type": "NetworkUnavailable"
      },
      {
        "type": "MemoryPressure"
      },
      {
        "type": "DiskPressure"
      },
      {
        "type": "PIDPressure"
      },
      {
        "type": "Ready"
      }
    ],
    "conditions": [
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "MemoryPressure"
      },
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "DiskPressure"
      },
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "PIDPressure"
      },
      {
        "lastHeartbeatTime": "2024-04-12T10:36:54Z",
        "type": "Ready"
      }
    ]
  }
}

Deviation from expected behavior:

Expected behavior:

How to reproduce it (minimal and precise):

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

  • Operator's logs, if necessary

  • Crashing pod(s) logs, if necessary

    To get logs, use kubectl -n <namespace> logs <pod name>
    When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
    Read GitHub documentation if you need help.

Cluster Status to submit:

  • Output of kubectl commands, if necessary

    To get the health of the cluster, use kubectl rook-ceph health
    To get the status of the cluster, use kubectl rook-ceph ceph status
    For more details, see the Rook kubectl Plugin

Environment:

  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
@microyahoo microyahoo added the bug label Apr 13, 2024
@travisn travisn added this to To do in v1.14 via automation Apr 15, 2024
@parth-gr
Copy link
Member

@microyahoo I also see the type is changing with time, which seems right to watch for

@microyahoo
Copy link
Member Author

microyahoo commented Apr 16, 2024

hi @parth-gr, I compared the current and the modified object. The diff is as follows

~ cat current | python3 -m json.tool > current.json
~ cat modified | python3 -m json.tool > modified.json
~ vimdiff current.json modified.json

diff

@parth-gr
Copy link
Member

Okay then we need to create a custom resourceComparater that avoids looking at the difference for this

@parth-gr
Copy link
Member

/assign

Copy link

Thanks for taking this issue! Let us know if you have any questions!

parth-gr added a commit to parth-gr/rook that referenced this issue May 9, 2024
currently node predicate watches for resourceversion and
heartbeat, which are not critical things so we should
look for the reconcile logic and unnecessary
increasing the predicate cache size,
So with the PR we will ignore conidering for a reconcile
of cephcluster if updates are not cruical

closes: rook#14070

Signed-off-by: parth-gr <paarora@redhat.com>
parth-gr added a commit to parth-gr/rook that referenced this issue May 9, 2024
currently node predicate watches for resourceversion and
heartbeat, which are not critical things so we should
look for the reconcile logic and unnecessary
increasing the predicate cache size,
So with the PR we will ignore conidering for a reconcile
of cephcluster if updates are not cruical

closes: rook#14070

Signed-off-by: parth-gr <paarora@redhat.com>
@parth-gr parth-gr linked a pull request May 9, 2024 that will close this issue
6 tasks
parth-gr added a commit to parth-gr/rook that referenced this issue May 9, 2024
currently node predicate watches for resourceversion and
heartbeat, which are not critical things so we should
look for the reconcile logic and unnecessary
increasing the predicate cache size,
So with the PR we will ignore conidering for a reconcile
of cephcluster if updates are not cruical

closes: rook#14070

Signed-off-by: parth-gr <paarora@redhat.com>
parth-gr added a commit to parth-gr/rook that referenced this issue May 10, 2024
currently node predicate watches for resourceversion and
heartbeat, which are not critical things so we should
look for the reconcile logic and unnecessary
increasing the predicate cache size,
So with the PR we will ignore conidering for a reconcile
of cephcluster if updates are not cruical

closes: rook#14070

Signed-off-by: parth-gr <paarora@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
v1.14
To do
Development

Successfully merging a pull request may close this issue.

2 participants