Add client side event spam filtering #47367

derekwaynecarr · 2017-06-12T20:40:14Z

What this PR does / why we need it:
Add client side event spam filtering to stop excessive traffic to api-server from internal cluster components.

this pr defines a per source+object event budget of 25 burst with refill of 1 every 5 minutes.

i tested this pr on the following scenarios:

Scenario 1: Node with 50 crash-looping pods

$ create 50 crash-looping pods on a single node
$ kubectl run bad --image=busybox --replicas=50 --command -- derekisbad

Before:

POST events with peak of 1.7 per second, long-tail: 0.2 per second
PATCH events with peak of 5 per second, long-tail: 5 per second

After:

POST events with peak of 1.7 per second, long-tail: 0.2 per second
PATCH events with peak of 3.6 per second, long-tail: 0.2 per second

Observation:

Strip container id from events #47462 capped the number of total events in the long-tail as expected, but did nothing to improve total spam of master.

Scenario 2: replication controller limited by quota

$ kubectl create quota my-quota --hard=pods=1
$ kubectl run nginx --image=nginx --replicas=50

Before:

POST events not relevant as aggregation worked well here.
PATCH events with peak and long-tail of 13.6 per second

After:

POST events not relevant as aggregation worked well here.
PATCH events with peak: .35 per second, and long-tail of 0

Which issue this PR fixes
fixes #47366

Special notes for your reviewer:
this was a significant problem in a kube 1.5 cluster we are running where events were co-located in a single etcd. this cluster was normal to have larger numbers of unhealty pods as well as denial by quota.

Release note:

add support for client-side spam filtering of events

derekwaynecarr · 2017-06-12T20:43:05Z

marking do-not-merge until i add a unit test.

while I think server-side spam filtering is important, this prevents our internal agents from abusing the apiserver. we have experienced a significant amount of abuse in a few clusters both intentional (miners creating replica sets with huge numbers), and unintentional (users pods that are in constant crashloop backoff). in either scenario, if an event keeps recurring, ttl doesn't help, and we need to reduce the frequency of traffic from our agents. this client-side spam detection does that.

/cc @eparis @sjenning @liggitt @smarterclayton

derekwaynecarr · 2017-06-12T20:45:08Z

this is a production stability problem for us right now, so marking 1.7 milestone.

smarterclayton · 2017-06-12T21:47:58Z

Rationale:

In production clusters we see large (>100s) of event sources that persist for multi-days, with only 5 second interruption
The difference between an event you've seen once, and 5 times, is significant. The difference between an event you've seen 6100 times and 6200 times is not.
Uncontrolled pathological event spam is surprisingly common from infra components, so client side normalization is reasonable.
Setting a maximum rate of recurrence is equivalent to pod backoff.

smarterclayton · 2017-06-12T21:48:53Z

staging/src/k8s.io/client-go/tools/record/events_cache.go

@@ -362,11 +471,11 @@ func NewEventCorrelator(clock clock.Clock) *EventCorrelator {

 // EventCorrelate filters, aggregates, counts, and de-duplicates all incoming events
 func (c *EventCorrelator) EventCorrelate(newEvent *v1.Event) (*EventCorrelateResult, error) {
+	aggregateEvent, ckey := c.aggregator.EventAggregate(newEvent)


How expensive is EventAggregate? I assume that's why it was lower before - maybe we should filter twice, instead of once?

event aggregate is cheap. we had it lower before because i had not thought through spam detection well enough when it was first written. we need spam filtering after the event we plan to send to the server is aggregated so spam filtering works on aggregated event itself.

sjenning

looks good. just some readability recommendations.

sjenning · 2017-06-12T22:02:05Z

staging/src/k8s.io/client-go/tools/record/events_cache.go

+	if interval > maxInterval {
+		// enough time has transpired, we create a new record
+		record = spamRecord{}
+	}


This is somewhat confusing. I'd move L138-146 inside the if block at L134 and only overwrite record if interval < maxInterval. That way we don't create a spamRecord on L123, lose it on L135, and make another new one on L145.

Also, if the record is outside the interval, we should remove it from the cache. Unless the eventKey is the same for the new record we are adding, then it'll just be an overwrite. Looking at getEventKey(), it looks like the key would be the same for repeated records. Just need to check.

sjenning · 2017-06-12T22:56:21Z

staging/src/k8s.io/client-go/tools/record/events_cache.go

+
+		maxSyncInterval := time.Duration(f.syncIntervalInSeconds) * time.Second
+		syncInterval := now.Time.Sub(record.lastSynced.Time)
+		if syncInterval > maxSyncInterval {


invert the if condition and set filter = true below and you can remove L158. That way we are setting false, then true, the false again.

derekwaynecarr · 2017-06-13T01:34:52Z

@smarterclayton @sjenning -- updated per review comment.

the values chosen for spam filtering appears to work well with the pathological replication controller scenario. we may need to further tweak the interval when i look at some of the other worst offenders we encountered. we also NEED to stop a replication controller from going from 0-500 in every sync interval concurrently. while those requests do not cause a write when denied by quota, we need to stop making calls just to be told no.

either way, for the pathological scenario:

kubectl create quota my-quota --hard=pods=1
kubectl run nginx --image=nginx --replicas=1000
...wait ~5 minutes...
kubectl get events
1m         7m          4807      nginx-3181653891         ReplicaSet                            Warning   FailedCreate              replicaset-controller   Error creating: pods "nginx-3181653891-" is forbidden: exceeded quota: my-quota, requested: pods=1, used: pods=1, limited: pods=1

previously, this created 4807 PATCH requests to the server, with client side spam filtering:

$ kubectl get --raw /metrics | grep summary_count | grep events
...
apiserver_request_latencies_summary_count{resource="events",subresource="",verb="PATCH"} 24

eparis · 2017-06-13T02:48:15Z

Actual data in one 300 node cluster we see:

Event Type	Events/Sec
`FailedSync`	58.1
`BackOff`	50.5
`Failedmount`	7.5
`FailedCreate`	6.3
`Unhealthy`	3.4

But, over that same time period if I break FailedSync and BackOff down per metadata.name (instead of lumping all of them together) we see much less 'worst case'. (Notice this is per minute while above is per second)

Event Type	Events/Pod/Min
`FailedSync`	4.8
`BackOff`	4.8

So, if I understand correctly, this would never hit either of the top 2 contributors to events in this cluster.

derekwaynecarr · 2017-06-13T03:09:18Z

looking at more of the data that showed our event spam.

nodes are prone to induce event spam:

a pod that always fails its readiness probe (but not liveness probe) appeared to cause us 9k probe failure events per day. at the default 10s interval, this pr reduces the amount of PATCH event traffic for this failure event back to api server by a factor of 6.
a single pod reported a large number of "BackOff" events with reason "Back-off restarting failed docker container" 48k times over ~7d. the MaxContainerBackoff appears to be 300s. I think its worth moving the maxInterval to 300s from 120s to ensure .

derekwaynecarr · 2017-06-13T03:31:33Z

@eparis -- i need to parse through more of the FailedSync events to see what is possible with you tomorrow so i can try and have a good reproduction. i think the Unhealthy events if tied to readiness probes (like the one i was looking at) should benefit from this PR.

for example, i am seeing ~6x reduction in traffic with

apiVersion: v1
kind: Pod
metadata:
  name: goproxy
  labels:
    app: goproxy
spec:
  containers:
  - name: goproxy
    image: gcr.io/google_containers/goproxy:0.1
    ports:
    - containerPort: 8080
    readinessProbe:
      tcpSocket:
        port: 8900
      initialDelaySeconds: 5
      periodSeconds: 10
    livenessProbe:
      tcpSocket:
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 20

derekwaynecarr · 2017-06-13T04:18:25Z

in other news:

$ kubectl run bad --image=busybox --command -- derekisbad

gives you a number of terrifically horrible events reported back now with the CRI.

dissecting them more, it's hard to know what to do here when there are many similar bad pods. deleting the pods in crash loop back-off just causes more of them to come from the backing job. the FailedSync events are the trickiest.

the start (or failed start) of any pod can cause a lot of events.

Events:
  FirstSeen	LastSeen	Count	From			SubObjectPath		Type		Reason			Message
  ---------	--------	-----	----			-------------		--------	------			-------
  1m		1m		1	kubelet, 127.0.0.1				Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-1h1m4" 
  1m		1m		1	default-scheduler				Normal		Scheduled		Successfully assigned bad-2463393876-v8b31 to 127.0.0.1
  1m		1m		1	kubelet, 127.0.0.1	spec.containers{bad}	Normal		Created			Created container with id b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358
  1m		1m		1	kubelet, 127.0.0.1				Warning		FailedSync		Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  1m	1m	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358 with error: rpc error: code = 2 desc = failed to start container "b95f26009dfbbb4a9889a4274f6fef35bf6cd3ef7b2bac4040584ccc0ac51358": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  1m	1m	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id 882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302 with error: rpc error: code = 2 desc = failed to start container "882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  1m	1m	1	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Created		Created container with id 882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302
  1m	1m	1	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "882d555cf0460b1850265a4f0ee66152f908898329319a66a5ac18e3b8b9f302": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  57s	57s	1	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Created		Created container with id 30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a
  57s	57s	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id 30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a with error: rpc error: code = 2 desc = failed to start container "30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  57s	57s	1	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "30c23189223a5c3ac14b47006398b5098616f5798b7a0403786b7094c154337a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  42s	42s	1	kubelet, 127.0.0.1		Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with CrashLoopBackOff: "Back-off 20s restarting failed container=bad pod=bad-2463393876-v8b31_default(1d5ee9b4-4fee-11e7-a69c-c85b76cda386)"

  1m	30s	4	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Pulling		pulling image "busybox"
  1m	29s	4	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Pulled		Successfully pulled image "busybox"
  29s	29s	1	kubelet, 127.0.0.1	spec.containers{bad}	Normal	Created		Created container with id 20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a
  29s	29s	1	kubelet, 127.0.0.1	spec.containers{bad}	Warning	Failed		Failed to start container with id 20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a with error: rpc error: code = 2 desc = failed to start container "20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}
  29s	29s	1	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with rpc error: code = 2 desc = failed to start container "20a4498e1821e27f913f9b1685d79530735dab18150f0d731a9ce4117946374a": Error response from daemon: {"message":"invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"exec: \\\\\\\"derekisbad\\\\\\\": executable file not found in $PATH\\\"\\n\""}: "Start Container Failed"

  42s	2s	4	kubelet, 127.0.0.1	spec.containers{bad}	Warning	BackOff		Back-off restarting failed container
  28s	2s	3	kubelet, 127.0.0.1				Warning	FailedSync	Error syncing pod, skipping: failed to "StartContainer" for "bad" with CrashLoopBackOff: "Back-off 40s restarting failed container=bad pod=bad-2463393876-v8b31_default(1d5ee9b4-4fee-11e7-a69c-c85b76cda386)"

a number of those events just suck. also i am not sure what user will really care about the container id versus just the container name. and man, do those oci errors stink coming back from runc...

i am going to try and see if i can clean this up more. that said, depending on what is in your pod, its normal to get a fair number of events on pod start. thinking for one way to handle this is to try and get the kubelet itself to log fewer events if the pod has a large restart count.

will wait for more sample data from @eparis to compare as that cluster was also on kube 1.5 and may vary as well.

dchen1107 · 2017-06-13T17:01:17Z

We talked about a couple of times on cleaning up those events, together with logging spam from Kubelet, but never get around with it. Introducing a client side event filter / aggregator before sending to the server is a good plan to me.

On another hand, we should re-evaluate today's events exposed by Kubelet. IMHO, we failed to export a lot of valuable events to the users, for example, sys oom killing, etc.

yujuhong · 2017-06-13T17:35:29Z

a number of those events just suck. also i am not sure what user will really care about the container id versus just the container name. and man, do those oci errors stink coming back from runc...

We do report container ID in the ContainerStatus through, and it is useful for debugging in some cases. Maybe we can trim the ID to make it shorter?
I'm not sure what we can do for the oci errors other than doing some regex magic. It's ugly, though it does tell me where the error comes from (which probably is that of interest to the users).

dchen1107 · 2017-06-13T17:56:36Z

re: #47367 (comment)

We just talked about this at sig-node. Can we push containerID to logging, not in event? Event is for the user, not necessary for the debuggers, like us.

yujuhong · 2017-06-13T18:18:13Z

We just talked about this at sig-node. Can we push containerID to logging, not in event? Event is for the user, not necessary for the debuggers, like us.

Then why do we even report them in the ContainerStatus? :-)

derekwaynecarr · 2017-07-26T17:52:17Z

@caesarxuchao - this pr is active and i think the referenced api does not sufficiently fix the situation as i see it, but I have not had sufficient time to fully verify that theory.

/test pull-kubernetes-e2e-gce-etcd3

derekwaynecarr · 2017-07-28T15:10:51Z

@caesarxuchao - the new Event API will help in the long-run, but the existing API will exist for an extended period of time. As a result, I think this change should go in to help protect clusters that will continue to use the existing API. For reference, we have deployed this in a large openshift cluster that had previously suffered against pathological event abuse, and it has made a significant improvement to cluster stability.

mml · 2017-08-02T22:29:03Z

Can we guard this feature behind a flag that is off by default, and then remove this code once the new event API is sufficiently widely rolled out?

derekwaynecarr · 2017-08-12T05:50:29Z

@mml - guarding this behind something like a feature gate is fine in the interim. It let's clusters with the largest scale concerns enable it if need be. Let me see if there is an easy way to plumb that through.

mml · 2017-08-16T22:22:05Z

@derekwaynecarr Awesome, thanks.

derekwaynecarr · 2017-08-24T18:02:26Z

@mml -- it is non-trivial to plumb a kube feature gate at this level of the stack as it seems like bad practice for client-go to reference a kubefeature gate. I do not see a compelling reason for this to not be the default behavior, and it is consistent with converstions/comments discussed here (https://docs.google.com/document/d/13BeJlrEcJhSKgsHOHWmHdJGqXrjSm_o9XxOtwcN6yNg/edit)

liggitt · 2017-08-24T18:13:39Z

it seems like bad practice for client-go to reference a kubefeature gate.

agree

I do not see a compelling reason for this to not be the default behavior, and it is consistent with converstions/comments discussed here

also agree. this is an evolution of a feature already present and intended to prevent event overload, that was not filtering sufficiently

smarterclayton · 2017-08-31T22:38:39Z

Given agreement and that this is a production issue for large clusters, I'm inclined to approve. Given a choice between write exhaustion and dropped events, I'd move to dropped events.

/approve

based on criteria and general consensus on the thread

smarterclayton · 2017-08-31T22:38:54Z

However would still like to see a lgtm from the reviewer

mml · 2017-08-31T22:59:40Z

My concern with this on by default is that there is no easy workaround if it really does create a problem, and such problems will be hard to find with test plans. AFAICT, we don't even expose knobs for the burst or refill rate.

I would really like us to prioritize making this more configurable or less necessary in 1.9. Can we please file followup bugs about that now?

/lgtm

k8s-github-robot · 2017-08-31T23:00:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, mml, smarterclayton

Associated issue: 47366

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~staging/src/k8s.io/client-go/OWNERS~~ [smarterclayton]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

smarterclayton · 2017-08-31T23:41:12Z

Agree this needs a bit more of a knob. There is an issue for per client rate limiting server side that is being worked on #50925. As to soak, we've been running in production with this for a month now on 4 large clusters and have not yet observed lost events (although we do deserve to file an AAR on this that evaluates outgoing vs limited). This reduced event traffic to a negligible concern since the vast majority of events were redundant and repeating.

dims · 2017-09-04T17:37:20Z

/test all

fejta-bot · 2017-09-04T20:51:50Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

dims · 2017-09-04T21:38:35Z

/retest

k8s-github-robot · 2017-09-04T22:25:08Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-ci-robot · 2017-09-04T23:07:22Z

@derekwaynecarr: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-e2e-kops-aws	`b62fa1d`	link	`/test pull-kubernetes-e2e-kops-aws`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-github-robot · 2017-09-04T23:13:46Z

Automatic merge from submit-queue

Automatic merge from submit-queue (batch tested with PRs 51840, 53542, 53857, 53831, 53702). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. kubelet sync pod throws more detailed events **What this PR does / why we need it**: If there are errors in the kubelet sync pod iteration, it is difficult to determine the problem. This provides more specific events for errors that occur in the syncPod iteration to help perform problem isolation. Fixes #53900 **Special notes for your reviewer**: It is safer to dispatch more specific events now that we have an event budget per object enforced via #47367 **Release note**: ```release-note kubelet provides more specific events when unable to sync pod ```

Automatic merge from submit-queue (batch tested with PRs 56308, 54304, 56364, 56388, 55853). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Send events on certain service repair controller errors **What this PR does / why we need it**: This PR enables sending events when the api-server service IP and port allocator repair controllers find an error repairing a cluster ip or a port respectively. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #54303 **Special notes for your reviewer**: In case of an error, events will be emitted [every 3 minutes](https://github.com/kubernetes/kubernetes/blob/master/pkg/master/controller.go#L93) for each failed Service. Even so, event spam protection has been merged (#47367) to mitigate the risk of excessive events. **Release note**: ```release-note api-server provides specific events when unable to repair a service cluster ip or node port ```

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2017

derekwaynecarr changed the title ~~Add client side event spam filtering~~ WIP: Add client side event spam filtering Jun 12, 2017

derekwaynecarr added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jun 12, 2017

derekwaynecarr added this to the v1.7 milestone Jun 12, 2017

k8s-ci-robot requested review from sjenning, liggitt, smarterclayton and eparis June 12, 2017 20:43

derekwaynecarr assigned smarterclayton Jun 12, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 12, 2017

smarterclayton reviewed Jun 12, 2017

View reviewed changes

sjenning reviewed Jun 12, 2017

View reviewed changes

derekwaynecarr force-pushed the event-spam branch from de86cfc to 20cfdb6 Compare June 13, 2017 00:59

derekwaynecarr changed the title ~~WIP: Add client side event spam filtering~~ Add client side event spam filtering Jun 13, 2017

derekwaynecarr removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jun 13, 2017

derekwaynecarr mentioned this pull request Jun 13, 2017

Add some Events when containers die #45682

Closed

derekwaynecarr force-pushed the event-spam branch from 20cfdb6 to 05f5d1c Compare June 13, 2017 03:32

derekwaynecarr mentioned this pull request Jul 26, 2017

Make it possible to tune projection volume refresh period #49650

Closed

mrogers950 mentioned this pull request Aug 21, 2017

Add EventCorrelatorOptions for event rate limiting #51032

Closed

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 31, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 31, 2017

k8s-github-robot merged commit 870406b into kubernetes:master Sep 4, 2017

shyamjvs mentioned this pull request Sep 7, 2017

Seeing OOMs on small nodes (g1-small) for node agents #52107

Closed

derekwaynecarr mentioned this pull request Oct 13, 2017

kubelet sync pod throws more detailed events #53857

Merged

derekwaynecarr mentioned this pull request Oct 18, 2017

Adding eventrecording related to service LB errors #50153

Closed

frodenas mentioned this pull request Oct 20, 2017

Send events on certain service repair controller errors #54304

Merged

likakuli mentioned this pull request Jan 26, 2024

fix: pod event with different fieldPath for different container in same pod will be aggregated and spam which result to event lost #122942

Open

Add client side event spam filtering #47367

Add client side event spam filtering #47367

Conversation

derekwaynecarr commented Jun 12, 2017 • edited

derekwaynecarr commented Jun 12, 2017

derekwaynecarr commented Jun 12, 2017

smarterclayton commented Jun 12, 2017

smarterclayton Jun 12, 2017

Choose a reason for hiding this comment

derekwaynecarr Jun 13, 2017

Choose a reason for hiding this comment

sjenning left a comment

Choose a reason for hiding this comment

sjenning Jun 12, 2017

Choose a reason for hiding this comment

sjenning Jun 12, 2017

Choose a reason for hiding this comment

sjenning Jun 12, 2017

Choose a reason for hiding this comment

derekwaynecarr commented Jun 13, 2017 • edited

eparis commented Jun 13, 2017

derekwaynecarr commented Jun 13, 2017

derekwaynecarr commented Jun 13, 2017 • edited

derekwaynecarr commented Jun 13, 2017

dchen1107 commented Jun 13, 2017

yujuhong commented Jun 13, 2017

dchen1107 commented Jun 13, 2017

yujuhong commented Jun 13, 2017

derekwaynecarr commented Jul 26, 2017

derekwaynecarr commented Jul 28, 2017

mml commented Aug 2, 2017

derekwaynecarr commented Aug 12, 2017

mml commented Aug 16, 2017

derekwaynecarr commented Aug 24, 2017

liggitt commented Aug 24, 2017

smarterclayton commented Aug 31, 2017

smarterclayton commented Aug 31, 2017

mml commented Aug 31, 2017

k8s-github-robot commented Aug 31, 2017

smarterclayton commented Aug 31, 2017

dims commented Sep 4, 2017

fejta-bot commented Sep 4, 2017

dims commented Sep 4, 2017

k8s-github-robot commented Sep 4, 2017

k8s-ci-robot commented Sep 4, 2017 • edited

k8s-github-robot commented Sep 4, 2017

derekwaynecarr commented Jun 12, 2017 •

edited

derekwaynecarr commented Jun 13, 2017 •

edited

derekwaynecarr commented Jun 13, 2017 •

edited

k8s-ci-robot commented Sep 4, 2017 •

edited