throttle events #3252

jeswanke · 2024-02-02T15:33:56Z

resolves ACM-8455

the magic of the asm UI architecture is that if there's change in kube, it will be automatically and instantly reflected in the ui. The downside of this is if the UI is monitoring lots of resources and events happen all at once-- or if a resource is thrashing it's results or both

to fix this problem, if an event passes into the throttler:

if that policy hasn't ever had an event, the event it just sent on--the magic lives on
however if the same policy sends another event within 10 seconds, the throttle holds onto it
if after 10 seconds nothing more happens, that event is sent on
however if lots of events happen. we use standard deviation to see if the compliance is thrashing (a big sd = thrashing)
a. if thrashing, the compliance is set to noncompliant
this event is then sent if it doesn't match the compliance in the event queue---iow client doesn't know the current state of the compliance

there is a policy thrasher in eventUtils.ts:
that is accessed through event.ts:
import { WatchEvent, ServerSideEventData, thrashEvents, purgeEvents, throttleEvents } from '../lib/eventUtils'

export function startWatching(): void {
ServerSideEvents.eventFilter = eventFilter

for (const definition of definitions) {
void listAndWatch(definition)
}
// thrashEvents(cacheResource) <<<<<<<<<<<<<< uncomment me
}

Signed-off-by: John Swanke jswanke@redhat.com

Signed-off-by: John Swanke <jswanke@redhat.com>

openshift-ci · 2024-02-02T15:34:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jeswanke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jeswanke]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-02-02T15:51:05Z

@jeswanke: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit-tests-sonarcloud	`465bda9`	link	true	`/test unit-tests-sonarcloud`
ci/prow/check	`465bda9`	link	true	`/test check`
ci/prow/pr-image-mirror-mce	`465bda9`	link	true	`/test pr-image-mirror-mce`
ci/prow/pr-image-mirror	`465bda9`	link	true	`/test pr-image-mirror`
ci/prow/images	`465bda9`	link	true	`/test images`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

KevinFCormier

I have a number of concerns with this approach.

When resources are thrashing, it is indicative of a bug in controllers or a configuration problem, and it causes over-utilization of cluster resources that should be addressed. It is not ideal that this causes problems in the UI, but on the other hand, symptoms in the UI have often been the canary in the coal mine. Just last week I identified a problem with ManagedCluster resources thrashing because of symptoms seen in the UI: https://issues.redhat.com/browse/ACM-9688
If we decide that the UI should be resilient to thrashing, it should be implemented generally, because any resource thrashing can cause issues. This implementation is only focused on policies, and specifically on the compliance state. I am not 100% certain that compliance was what was changing on the policies for this customer case, and it's not clear to me exactly what happens if some other part of a policy is rapidly changing.
I am concerned about manipulating the compliance state based on statistical analysis. I modified your simulation code to try something out and found a potential bug. Suppose a policy is thrashing and then the issue is resolved, such that the resource immediately stops updating and the final update puts it in a compliant state. See the code I used below. The policy remained noncompliant in the UI, and the purgeEvents code never removed the throttle (nor sent a final update). So there is a memory leak. I know the backend is already unbounded in memory usage, but we don't want to make matters any worse. I also wonder if there is potential for a final update to be sent but the compliance to be altered based on statistics, rather than on the last seen value.
There is no connection of the throttleCache to a deletion event. Is it possible that purgeEvents might call cacheResource for a resource that has been deleted?

const THRASH_MAX = 10
const name = 'thrasher-policy'
export function thrashEvents(cacheResource: (arg: IResource) => void) {
  let resourceVersion = 705556
  const thrashPolicy = (i: number, compliant: CompliantType) => {
    resourceVersion = resourceVersion + 1
    const policy = structuredClone(mockedPolicy)
    policy.metadata.name = `${name}-${i + 1}`
    policy.metadata.resourceVersion = resourceVersion.toString()
    policy.metadata.uid = `018a0bd7-9130-4246-ba59-5b6c30a54${i.toString().padStart(3, '0')}`
    policy.status.compliant = compliant
    policy.status.status[0].compliant = compliant
    console.log(`THROTTLE POLICY: ${compliant}`)
    throttleEvents(policy, cacheResource)
  }

  // Original configuration
  // setTimeout(() => {
  //   setInterval(() => {
  //     for (let i = 0; i < 50 * 70; i++) {
  //       thrashPolicy(i, Math.random() < 0.5 && i > 2000 ? 'NonCompliant' : 'Compliant')
  //     }
  //   }, 3000)
  // }, 60000)

  // One policy that thrashes for awhile then stabilizes on Compliant
  let thrashCount = 1
  function nextThrash() {
    thrashPolicy(0, Math.random() < 0.5 ? 'NonCompliant' : 'Compliant')
    thrashCount = thrashCount + 1
    if (thrashCount < THRASH_MAX) {
      setTimeout(nextThrash, 3000)
    } else {
      thrashPolicy(0, 'Compliant')
    }
  }
  function startThrash() {
    thrashPolicy(0, 'NonCompliant')
    setTimeout(nextThrash, 3000)
  }
  setTimeout(startThrash, 60000)
}

jeswanke · 2024-02-04T17:42:19Z

ok, how about this:

The UI shows the resource is thrashing--- instead of indicating Compliant/NonCompliant we add Fault or Error to indicate a thrashing policy-- I bet that would have helped out in your case and would definitely help out here since all the SRE can report is 7.5gb of background data over an hour
To be cautious I would start with just policy -- it seems suspicious that the bug mentions policy a lot and that's the. kind of thing that was made to change a lot
1. and 4. are bugs that can bee fixed

KevinFCormier · 2024-02-05T15:47:55Z

ok, how about this:

The UI shows the resource is thrashing--- instead of indicating Compliant/NonCompliant we add Fault or Error to indicate a thrashing policy-- I bet that would have helped out in your case and would definitely help out here since all the SRE can report is 7.5gb of background data over an hour

To be cautious I would start with just policy -- it seems suspicious that the bug mentions policy a lot and that's the. kind of thing that was made to change a lot

and 4. are bugs that can bee fixed

@mprahl can we get your thoughts on this? We had a case with a customer where misconfigured policies were thrashing back and forth between compliant and non-compliant. Do you encounter cases like this often? How are they usually discovered? UI symptoms have helped identify thrashing in policies and other resources in the past, so I'm not sure that we want to mask these problems by making the UI tolerate them. John proposes having this surfaced in the UI (for policies, specifically) while avoiding sending every update so that we don't run into performance issues. I think this might be useful, but it would have more user-visible impacts - a new compliance status value that would need to be handled at various places in the UI, documentation on what this status means, etc. I think that would need to be handled as a story, meaning it would have to wait until next release.

My proposal for us to handle this bug was only to stop the ACM console from listening for updates and consuming data when the tab is not in use, something that Chrome has recently implemented anyway. The excessive amount of data I considered a separate issue, that we do not want to mask because a configuration issue like this is something that needs to be solved for the health of the cluster.

openshift-merge-robot · 2024-04-04T18:24:10Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

throttle events

465bda9

Signed-off-by: John Swanke <jswanke@redhat.com>

openshift-ci bot added the dco-signoff: yes label Feb 2, 2024

openshift-ci bot added the approved label Feb 2, 2024

jeswanke requested a review from KevinFCormier February 2, 2024 15:34

KevinFCormier requested changes Feb 2, 2024

View reviewed changes

openshift-ci bot assigned KevinFCormier Feb 2, 2024

openshift-merge-robot added the needs-rebase label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

throttle events #3252

throttle events #3252

jeswanke commented Feb 2, 2024 •

edited

openshift-ci bot commented Feb 2, 2024

openshift-ci bot commented Feb 2, 2024

KevinFCormier left a comment •

edited

jeswanke commented Feb 4, 2024 •

edited

KevinFCormier commented Feb 5, 2024

openshift-merge-robot commented Apr 4, 2024

throttle events #3252

Are you sure you want to change the base?

throttle events #3252

Conversation

jeswanke commented Feb 2, 2024 • edited

openshift-ci bot commented Feb 2, 2024

openshift-ci bot commented Feb 2, 2024

KevinFCormier left a comment • edited

Choose a reason for hiding this comment

jeswanke commented Feb 4, 2024 • edited

KevinFCormier commented Feb 5, 2024

openshift-merge-robot commented Apr 4, 2024

jeswanke commented Feb 2, 2024 •

edited

KevinFCormier left a comment •

edited

jeswanke commented Feb 4, 2024 •

edited