-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
throttle events #3252
base: main
Are you sure you want to change the base?
throttle events #3252
Conversation
Signed-off-by: John Swanke <jswanke@redhat.com>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jeswanke The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jeswanke: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a number of concerns with this approach.
- When resources are thrashing, it is indicative of a bug in controllers or a configuration problem, and it causes over-utilization of cluster resources that should be addressed. It is not ideal that this causes problems in the UI, but on the other hand, symptoms in the UI have often been the canary in the coal mine. Just last week I identified a problem with
ManagedCluster
resources thrashing because of symptoms seen in the UI: https://issues.redhat.com/browse/ACM-9688 - If we decide that the UI should be resilient to thrashing, it should be implemented generally, because any resource thrashing can cause issues. This implementation is only focused on policies, and specifically on the compliance state. I am not 100% certain that compliance was what was changing on the policies for this customer case, and it's not clear to me exactly what happens if some other part of a policy is rapidly changing.
- I am concerned about manipulating the compliance state based on statistical analysis. I modified your simulation code to try something out and found a potential bug. Suppose a policy is thrashing and then the issue is resolved, such that the resource immediately stops updating and the final update puts it in a compliant state. See the code I used below. The policy remained noncompliant in the UI, and the purgeEvents code never removed the throttle (nor sent a final update). So there is a memory leak. I know the backend is already unbounded in memory usage, but we don't want to make matters any worse. I also wonder if there is potential for a final update to be sent but the compliance to be altered based on statistics, rather than on the last seen value.
- There is no connection of the throttleCache to a deletion event. Is it possible that purgeEvents might call cacheResource for a resource that has been deleted?
const THRASH_MAX = 10
const name = 'thrasher-policy'
export function thrashEvents(cacheResource: (arg: IResource) => void) {
let resourceVersion = 705556
const thrashPolicy = (i: number, compliant: CompliantType) => {
resourceVersion = resourceVersion + 1
const policy = structuredClone(mockedPolicy)
policy.metadata.name = `${name}-${i + 1}`
policy.metadata.resourceVersion = resourceVersion.toString()
policy.metadata.uid = `018a0bd7-9130-4246-ba59-5b6c30a54${i.toString().padStart(3, '0')}`
policy.status.compliant = compliant
policy.status.status[0].compliant = compliant
console.log(`THROTTLE POLICY: ${compliant}`)
throttleEvents(policy, cacheResource)
}
// Original configuration
// setTimeout(() => {
// setInterval(() => {
// for (let i = 0; i < 50 * 70; i++) {
// thrashPolicy(i, Math.random() < 0.5 && i > 2000 ? 'NonCompliant' : 'Compliant')
// }
// }, 3000)
// }, 60000)
// One policy that thrashes for awhile then stabilizes on Compliant
let thrashCount = 1
function nextThrash() {
thrashPolicy(0, Math.random() < 0.5 ? 'NonCompliant' : 'Compliant')
thrashCount = thrashCount + 1
if (thrashCount < THRASH_MAX) {
setTimeout(nextThrash, 3000)
} else {
thrashPolicy(0, 'Compliant')
}
}
function startThrash() {
thrashPolicy(0, 'NonCompliant')
setTimeout(nextThrash, 3000)
}
setTimeout(startThrash, 60000)
}
ok, how about this:
|
@mprahl can we get your thoughts on this? We had a case with a customer where misconfigured policies were thrashing back and forth between compliant and non-compliant. Do you encounter cases like this often? How are they usually discovered? UI symptoms have helped identify thrashing in policies and other resources in the past, so I'm not sure that we want to mask these problems by making the UI tolerate them. John proposes having this surfaced in the UI (for policies, specifically) while avoiding sending every update so that we don't run into performance issues. I think this might be useful, but it would have more user-visible impacts - a new compliance status value that would need to be handled at various places in the UI, documentation on what this status means, etc. I think that would need to be handled as a story, meaning it would have to wait until next release. My proposal for us to handle this bug was only to stop the ACM console from listening for updates and consuming data when the tab is not in use, something that Chrome has recently implemented anyway. The excessive amount of data I considered a separate issue, that we do not want to mask because a configuration issue like this is something that needs to be solved for the health of the cluster. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
resolves ACM-8455
the magic of the asm UI architecture is that if there's change in kube, it will be automatically and instantly reflected in the ui. The downside of this is if the UI is monitoring lots of resources and events happen all at once-- or if a resource is thrashing it's results or both
to fix this problem, if an event passes into the throttler:
a. if thrashing, the compliance is set to noncompliant
there is a policy thrasher in eventUtils.ts:
that is accessed through event.ts:
import { WatchEvent, ServerSideEventData, thrashEvents, purgeEvents, throttleEvents } from '../lib/eventUtils'
export function startWatching(): void {
ServerSideEvents.eventFilter = eventFilter
for (const definition of definitions) {
void listAndWatch(definition)
}
// thrashEvents(cacheResource) <<<<<<<<<<<<<< uncomment me
}
Signed-off-by: John Swanke jswanke@redhat.com