Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throttle events #3252

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jeswanke
Copy link
Contributor

@jeswanke jeswanke commented Feb 2, 2024

resolves ACM-8455

the magic of the asm UI architecture is that if there's change in kube, it will be automatically and instantly reflected in the ui. The downside of this is if the UI is monitoring lots of resources and events happen all at once-- or if a resource is thrashing it's results or both

to fix this problem, if an event passes into the throttler:

  1. if that policy hasn't ever had an event, the event it just sent on--the magic lives on
  2. however if the same policy sends another event within 10 seconds, the throttle holds onto it
  3. if after 10 seconds nothing more happens, that event is sent on
  4. however if lots of events happen. we use standard deviation to see if the compliance is thrashing (a big sd = thrashing)
    a. if thrashing, the compliance is set to noncompliant
  5. this event is then sent if it doesn't match the compliance in the event queue---iow client doesn't know the current state of the compliance

there is a policy thrasher in eventUtils.ts:
that is accessed through event.ts:
import { WatchEvent, ServerSideEventData, thrashEvents, purgeEvents, throttleEvents } from '../lib/eventUtils'

export function startWatching(): void {
ServerSideEvents.eventFilter = eventFilter

for (const definition of definitions) {
void listAndWatch(definition)
}
// thrashEvents(cacheResource) <<<<<<<<<<<<<< uncomment me
}

Signed-off-by: John Swanke jswanke@redhat.com

Signed-off-by: John Swanke <jswanke@redhat.com>
Copy link

openshift-ci bot commented Feb 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jeswanke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Feb 2, 2024

@jeswanke: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit-tests-sonarcloud 465bda9 link true /test unit-tests-sonarcloud
ci/prow/check 465bda9 link true /test check
ci/prow/pr-image-mirror-mce 465bda9 link true /test pr-image-mirror-mce
ci/prow/pr-image-mirror 465bda9 link true /test pr-image-mirror
ci/prow/images 465bda9 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@KevinFCormier KevinFCormier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a number of concerns with this approach.

  1. When resources are thrashing, it is indicative of a bug in controllers or a configuration problem, and it causes over-utilization of cluster resources that should be addressed. It is not ideal that this causes problems in the UI, but on the other hand, symptoms in the UI have often been the canary in the coal mine. Just last week I identified a problem with ManagedCluster resources thrashing because of symptoms seen in the UI: https://issues.redhat.com/browse/ACM-9688
  2. If we decide that the UI should be resilient to thrashing, it should be implemented generally, because any resource thrashing can cause issues. This implementation is only focused on policies, and specifically on the compliance state. I am not 100% certain that compliance was what was changing on the policies for this customer case, and it's not clear to me exactly what happens if some other part of a policy is rapidly changing.
  3. I am concerned about manipulating the compliance state based on statistical analysis. I modified your simulation code to try something out and found a potential bug. Suppose a policy is thrashing and then the issue is resolved, such that the resource immediately stops updating and the final update puts it in a compliant state. See the code I used below. The policy remained noncompliant in the UI, and the purgeEvents code never removed the throttle (nor sent a final update). So there is a memory leak. I know the backend is already unbounded in memory usage, but we don't want to make matters any worse. I also wonder if there is potential for a final update to be sent but the compliance to be altered based on statistics, rather than on the last seen value.
  4. There is no connection of the throttleCache to a deletion event. Is it possible that purgeEvents might call cacheResource for a resource that has been deleted?
const THRASH_MAX = 10
const name = 'thrasher-policy'
export function thrashEvents(cacheResource: (arg: IResource) => void) {
  let resourceVersion = 705556
  const thrashPolicy = (i: number, compliant: CompliantType) => {
    resourceVersion = resourceVersion + 1
    const policy = structuredClone(mockedPolicy)
    policy.metadata.name = `${name}-${i + 1}`
    policy.metadata.resourceVersion = resourceVersion.toString()
    policy.metadata.uid = `018a0bd7-9130-4246-ba59-5b6c30a54${i.toString().padStart(3, '0')}`
    policy.status.compliant = compliant
    policy.status.status[0].compliant = compliant
    console.log(`THROTTLE POLICY: ${compliant}`)
    throttleEvents(policy, cacheResource)
  }

  // Original configuration
  // setTimeout(() => {
  //   setInterval(() => {
  //     for (let i = 0; i < 50 * 70; i++) {
  //       thrashPolicy(i, Math.random() < 0.5 && i > 2000 ? 'NonCompliant' : 'Compliant')
  //     }
  //   }, 3000)
  // }, 60000)

  // One policy that thrashes for awhile then stabilizes on Compliant
  let thrashCount = 1
  function nextThrash() {
    thrashPolicy(0, Math.random() < 0.5 ? 'NonCompliant' : 'Compliant')
    thrashCount = thrashCount + 1
    if (thrashCount < THRASH_MAX) {
      setTimeout(nextThrash, 3000)
    } else {
      thrashPolicy(0, 'Compliant')
    }
  }
  function startThrash() {
    thrashPolicy(0, 'NonCompliant')
    setTimeout(nextThrash, 3000)
  }
  setTimeout(startThrash, 60000)
}

@jeswanke
Copy link
Contributor Author

jeswanke commented Feb 4, 2024

ok, how about this:

  1. The UI shows the resource is thrashing--- instead of indicating Compliant/NonCompliant we add Fault or Error to indicate a thrashing policy-- I bet that would have helped out in your case and would definitely help out here since all the SRE can report is 7.5gb of background data over an hour
  2. To be cautious I would start with just policy -- it seems suspicious that the bug mentions policy a lot and that's the. kind of thing that was made to change a lot
    1. and 4. are bugs that can bee fixed

@KevinFCormier
Copy link
Contributor

ok, how about this:

  1. The UI shows the resource is thrashing--- instead of indicating Compliant/NonCompliant we add Fault or Error to indicate a thrashing policy-- I bet that would have helped out in your case and would definitely help out here since all the SRE can report is 7.5gb of background data over an hour
  2. To be cautious I would start with just policy -- it seems suspicious that the bug mentions policy a lot and that's the. kind of thing that was made to change a lot
    1. and 4. are bugs that can bee fixed

@mprahl can we get your thoughts on this? We had a case with a customer where misconfigured policies were thrashing back and forth between compliant and non-compliant. Do you encounter cases like this often? How are they usually discovered? UI symptoms have helped identify thrashing in policies and other resources in the past, so I'm not sure that we want to mask these problems by making the UI tolerate them. John proposes having this surfaced in the UI (for policies, specifically) while avoiding sending every update so that we don't run into performance issues. I think this might be useful, but it would have more user-visible impacts - a new compliance status value that would need to be handled at various places in the UI, documentation on what this status means, etc. I think that would need to be handled as a story, meaning it would have to wait until next release.

My proposal for us to handle this bug was only to stop the ACM console from listening for updates and consuming data when the tab is not in use, something that Chrome has recently implemented anyway. The excessive amount of data I considered a separate issue, that we do not want to mask because a configuration issue like this is something that needs to be solved for the health of the cluster.

@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants