etcd watch events starvation / lost multiplexed on a single watch stream #17529

chaochn47 · 2024-03-04T22:36:46Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via security@etcd.io.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Watch events starvation

When there are hundreds of watch initialized on a key range with prevKV option and multiplexed on the same watch stream, the watch response could be delayed as long as several minutes. And the out of sync watcher was not cancelled by providing a WatchResponse with CompactRevision not equal to 0 first.

It was also observed etcd_debugging_mvcc_events_total on one etcd server were accumulated at

etcd_debugging_mvcc_pending_events_total <etcd-0> 8.740897e+06 and

there were 722 slow watchers.

etcd_debugging_mvcc_slow_watcher_total <etcd-0> 722

Watch events lost

moveVictims is to synchronize slow watchers to send out pending events in the watchableStore. The provider of victim watchers are either from synced watcher in notify or syncWatcher background routine due to the <-sws.watchStream.Chan() is clogged and the pending events could not be sent out. The reason of this blocked channel could be the consumer of this channel is slow and cannot kept up with the events generator.

It seems like eb could be lost due to this newVictims only carries the last wb's eb for one specified watch. But this theory needs to be further confirmed with additional logging.

[Edited: The above theory is not correct since victims are added either from synced watchers or unsynced watchers, they are mutually exclusive..]

Environment set up

The etcd clients are running on 2 VMs but they somehow connects the same etcd server but handled by different gRPC server streams. The client runs compaction every minute and the revision in the compact request is one minute old one cached by last compaction response. There are 3 etcd servers.

Log and metrics

I added some custom logging to print the watches that only have progress notify enabled which is what k8s watch cache reflector would use in ListAndWatch. The other pod watches were observed not having progress notify enabled.

It is obvious that the pod deletion event was sent out at 02:22:03 in one watchStream while the same event was sent out at 02:24:53 in the other watchStream.

sh-4.2$ sudo grep "server watch stream sent DELETE event" /var/log/etcd.log | grep "simple-djbbj-28490538-txh64"
{
    "level":"info",
    "ts":"2024-03-03T02:22:03.189388Z",
    "caller":"v3rpc/watch.go:494",
    "msg":"server watch stream sent DELETE event",
    "deleted-object-key":"/registry/pods/kube-stress/simple-djbbj-28490538-txh64",
    "watch-id":0,
    "progress-notify-initialized":true
}
{
    "level":"info",
    "ts":"2024-03-03T02:24:53.568948Z",
    "caller":"v3rpc/watch.go:494",
    "msg":"server watch stream sent DELETE event",
    "deleted-object-key":"/registry/pods/kube-stress/simple-djbbj-28490538-txh64",
    "watch-id":0,
    "progress-notify-initialized":true
}

# Total number of unsynced slow watchers.

etcd_debugging_mvcc_slow_watcher_total <etcd-0> 722
etcd_debugging_mvcc_slow_watcher_total <etcd-1> 0
etcd_debugging_mvcc_slow_watcher_total <etcd-2> 0

# Total number of watchers.

etcd_debugging_mvcc_watcher_total <etcd-0> 3679
etcd_debugging_mvcc_watcher_total <etcd-1> 1468
etcd_debugging_mvcc_watcher_total <etcd-2> 1468

# Total number of watch streams.

etcd_debugging_mvcc_watch_stream_total <etcd-0> 35
etcd_debugging_mvcc_watch_stream_total <etcd-1> 35
etcd_debugging_mvcc_watch_stream_total <etcd-2> 35

# Total number of events sent by this member.

etcd_debugging_mvcc_events_total <etcd-0> 9.28455158e+08 
etcd_debugging_mvcc_events_total <etcd-1> 2.980666813e+09
etcd_debugging_mvcc_events_total <etcd-2> 2.980702668e+09

# Total number of pending events to be sent.

etcd_debugging_mvcc_pending_events_total <etcd-0> 8.740897e+06
etcd_debugging_mvcc_pending_events_total <etcd-1> 17423
etcd_debugging_mvcc_pending_events_total <etcd-2> 11792

What did you expect to happen?

Based on the millions of accumulated pending events on server watchStream, I would like to understand the bottleneck on consuming the watch events, especially on sendLoop which runs on a single thread processing hundreds of watches event generation.

What's the maximum limit of number of watches per watchStream supports? We should clearly document this limit and shard the upcoming watchers to another watchStream if the limit is breached.

How can we reproduce it (as minimally and precisely as possible)?

I don't have a well-written test case to simulate the observed clogged pending events and we should be able to replicate the traffic and have a reproduce with the etcd test framework.

Here are the steps I followed by the reported issue in k8s.

Create a 1.27+ k8s cluster with the largest control plane instance instance type you can have
Launch a 500 worker nodes and wait for them to be ready
Create a 2000 cronjobs to generate object churn (translates to key create, update, delete in etcd), make sure updating your controller manager configuration to keep up with the progress of cronjob.

--kube-api-qps=400
--kube-api-burst=400
--concurrent-job-syncs=400
--concurrent-cron-job-syncs=400
--concurrent-gc-syncs=400

Deploy falco daemon set with version 3.6.0 which is a C++ based library that triggers direct etcd watch.

helm install falco falcosecurity/falco \
     --version 3.6.0 \
    --create-namespace \
    --namespace falco \
    --values falco-chart-values.yaml

Scale the worker nodes from 500 to 800

Anything else we need to know?

Follow up of kubernetes/kubernetes#123072 and possibly related to kubernetes/kubernetes#123448

Etcd version (please run commands below)

$ etcd --version
3.5.10

</details>


### Etcd configuration (command line flags or environment variables)

<details>

### etcd-0
/usr/bin/etcd --name <etcd-0> --data-dir /mnt/xvds/datadir --wal-dir /mnt/xvds/datadir/member/wal --snapshot-count 10000 --initial-cluster-token c5145f33-2671-41ed-a320-25ce17fbb3d3 --advertise-client-urls http://<etcd-0-IP>:2379 --listen-client-urls http://0.0.0.0:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://<etcd-0-IP>:2380 --metrics extensive --quota-backend-bytes 10485760000 --max-request-bytes 1572864 --grpc-keepalive-interval 2h0m0s --pre-vote=true --logger zap --experimental-watch-progress-notify-interval 5s --initial-cluster-state new --initial-cluster <etcd-0>=http://<etcd-0-IP>:2380,<etcd-1>=http://<etcd-1-IP>:2380,<etcd-2>=http://<etcd-2-IP>:2380 --log-outputs /var/log/etcd.log --log-rotation-config-json {"maxsize": 100, "maxage": 0, "maxbackups": 0, "localtime": false, "compress": false}

### etcd-1
/usr/bin/etcd --name <etcd-1> --data-dir /mnt/xvds/datadir --wal-dir /mnt/xvds/datadir/member/wal --snapshot-count 10000 --initial-cluster-token c5145f33-2671-41ed-a320-25ce17fbb3d3 --advertise-client-urls http://<etcd-1-IP>:2379 --listen-client-urls http://0.0.0.0:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://<etcd-1-IP>:2380 --metrics extensive --quota-backend-bytes 10485760000 --max-request-bytes 1572864 --grpc-keepalive-interval 2h0m0s --pre-vote=true --logger zap --experimental-watch-progress-notify-interval 5s --initial-cluster-state new --initial-cluster <etcd-0>=http://<etcd-0-IP>:2380,<etcd-1>=http://<etcd-1-IP>:2380,<etcd-2>=http://<etcd-2-IP>:2380 --log-outputs /var/log/etcd.log --log-rotation-config-json {"maxsize": 100, "maxage": 0, "maxbackups": 0, "localtime": false, "compress": false}

### etcd-2
/usr/bin/etcd --name <etcd-2> --data-dir /mnt/xvds/datadir --wal-dir /mnt/xvds/datadir/member/wal --snapshot-count 10000 --initial-cluster-token c5145f33-2671-41ed-a320-25ce17fbb3d3 --advertise-client-urls http://<etcd-2-IP>:2379 --listen-client-urls http://0.0.0.0:2379 --listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://<etcd-2-IP>:2380 --metrics extensive --quota-backend-bytes 10485760000 --max-request-bytes 1572864 --grpc-keepalive-interval 2h0m0s --pre-vote=true --logger zap --experimental-watch-progress-notify-interval 5s --initial-cluster-state new --initial-cluster <etcd-0>=http://<etcd-0-IP>:2380,<etcd-1>=http://<etcd-1-IP>:2380,<etcd-2>=http://<etcd-2-IP>:2380 --log-outputs /var/log/etcd.log --log-rotation-config-json {"maxsize": 100, "maxage": 0, "maxbackups": 0, "localtime": false, "compress": false}

</details>

### Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

chaochn47 · 2024-03-04T22:44:42Z

May I get some early feedback on this issue if the analysis makes some sense to you? @serathius @ahrtr @fuweid @tjungblu @siyuanfoundation @jmhbnz @MadhavJivrajani while I am still working on the actual root cause.

/cc k8s folks @mengqiy @shyamjvs @dims

chaochn47 · 2024-03-06T01:36:04Z

I have a reliable repro using etcd test framework alone without k8s. #17535...

chaochn47 · 2024-03-06T07:35:08Z

Please use #17535 for reproduce. Here is an example.

etcd server log

{
    "level": "info",
    "ts": "2024-03-06T07:21:38.691929Z",
    "caller": "mvcc/watchable_store.go:371",
    "msg": "watcher has skipped events",
    "watcher-next-rev-to-accept": 2726,
    "watcher-min-rev-of-next-events-batch-to-sent": 184694,
    "watch-id": 0
}

Note: The simulated watch cache client is the first watcher so its watch id should be 0. Also I am using the customized key range start key /registry/pod to locate the watcher on server side.

var watchKeyPrefix = "/registry/pods/"
var watchCacheWatchKeyPrefix = "/registry/pod"

Test log

   2723    2024-03-06T07:21:25.354Z   watch-cache     got watch response      {"event-type": "DELETE", "key": "/registry/pods/17/7", "rev": 2724}
   2724     2024-03-06T07:21:25.354Z  watch-cache     got watch response      {"event-type": "PUT", "key": "/registry/pods/35/7", "rev": 2725}
   2725     2024-03-06T07:21:38.695Z  watch-cache     got watch response      {"event-type": "DELETE", "key": "/registry/pods/64/463", "rev": 184694}
   2726     2024-03-06T07:21:38.695Z  watch-cache     got watch response      {"event-type": "PUT", "key": "/registry/pods/59/463", "rev": 184695}

The problem is:

If the shared watch response channel is full and this watcher next event revision is compacted, this slow/unsync'd watcher would skip all the events pending to send out due to this logic.

etcd/server/mvcc/watcher_group.go

Lines 253 to 260 in 4a5e9d1

    
           if w.minRev < compactRev { 
        
           	select { 
        
           	case w.ch <- WatchResponse{WatchID: w.id, CompactRevision: compactRev}: 
        
           		w.compacted = true 
        
           		wg.delete(w) 
        
           	default: 
        
           		// retry next time 
        
           	}

So I would propose prioritize the watch response with CompactRevision is set over the other watch responses so client is guaranteed to observe this watch terminal error instead of some lost events. This approach aligns with the documented watch API semantics.

Compact_Revision - set to the minimum historical revision available to etcd if a watcher tries watching at a compacted revision. This happens when creating a watcher at a compacted revision or the watcher cannot catch up with the progress of the key-value store. The watcher will be canceled; creating new watches with the same start_revision will fail.

ref. https://etcd.io/docs/v3.5/learning/api/#watch-streams

If this approach makes sense to you, I will create a PR to fix it.

chaochn47 · 2024-03-06T07:39:34Z

Regarding why the channel of watchStream is full, it could be due to consumer of the channel is slower than provider. In this case, it is watchStream sendLoop.

I have dumped the CPU profile of the etcd process, it turns out with enough QPS of mutation requests, the most time-consuming part in the sendLoop (which is single threaded) is prevKV read. In contrast, gRPC send is pretty quick (around 10ms)

The read transaction buffer copy takes half a second if I interpret the graph correctly.

serathius · 2024-03-06T09:14:52Z

Expect copying transaction buffer comes from WithPrevKV #16839 which we have already seen causing problems.

fuweid · 2024-03-06T09:35:34Z

Thanks @chaochn47 added comment in your reproduce pull request #17535 (comment)

So I would propose prioritize the watch response with CompactRevision is set over the other watch responses so client is guaranteed to observe this watch terminal error instead of some lost events.

Agree. Based on my local test, at least we should skip to update minRev if the watcher needs to send ErrCompacted.

// server/mvcc/watchable_store.go

 package mvcc


func (s *watchableStore) syncWatchers() int {
       // ...
       for w := range wg.watchers {
                if w.minRev < compactionRev {
                      // skip it and retry it later since w.ch is full now
                      continue
                }
		w.minRev = curRev + 1
                // ...
       }
}

gRPC send is pretty quick (around 10ms)

gRPC.Send just puts the data in the buffer~.

ahrtr · 2024-03-06T14:56:21Z

If the shared watch response channel is full and this watcher next event revision is compacted, this slow/unsync'd watcher would skip all the events pending to send out due to this logic.

Do you mean it may cause lost of event? Is it possible to create a simple test (probably leverage failpoint?) to reproduce this?

Regarding the etcd watch events starvation due to high load, did not get time to dig it so far. It's a performance enhancement. Per my immediate feeling, it isn't a low hang fruit.

serathius · 2024-03-06T15:37:32Z

I think the underlying issue is not performance based, but the issue of the slow watchers never being dropped as mentioned in kubernetes/kubernetes#123448. A single watcher can just open a watch to all keys and never read a event. This would lead etcd to buffer gigabytes of memory, allowing clients to DOS the etcd.

The performance comes only because of reproductions needing to work in sensible time. I think the issue happens also with low throughput but takes hours instead of minutes.

fuweid · 2024-03-06T16:02:39Z

Do you mean it may cause lost of event?

the syncWatcher updated the w.minRev to the value that is higher than compactionRev.
The syncWatcher doesn't have chance to send compactRevision when channel is full.
And then next round, the w.minRev is higher than compactionRev so that syncWatcher won't send compactRevision again. The current logic doesn't retry to resend compactRevision to client. The watch stream should be canceled but it's still valid.

#17535 already reproduces the issue.

chaochn47 · 2024-03-06T17:40:42Z

the syncWatcher updated the w.minRev to the value that is higher than compactionRev.
The syncWatcher doesn't have chance to send compactRevision when channel is full.
And then next round, the w.minRev is higher than compactionRev so that syncWatcher won't send compactRevision again. The current logic doesn't retry to resend compactRevision to client. The watch stream should be canceled but it's still valid.

Right. The slow watcher should have been cancelled but we coded like it has caught up with the progress of store..

There are two issues.

First is watch starvation for hundreds of watcher multiplexed in the same gRPC stream with prevKV enabled.
Second is the cascading failure caused by the first one, it causes the slow watcher fails to tell client it needs to be dropped if its progress cannot keep up with etcd store progress and also the slow watcher skips events. It would cause cache built on top of etcd watch contains already "DELETED" data, which is APIServer watchcache lost events kubernetes/kubernetes#123072.

To resolve first issue, either shard watchers to another gRPC stream (which is done by kubernetes/kubernetes#123532) or improve the performance of stream watch multiplex with prevKV enabled (tracked in #16839).

To resolve second issue, we should prioritize sending out compacted watch response to client and tracked by current report.

ahrtr · 2024-03-16T11:23:25Z

Reopen to track the backport efforts.

likakuli · 2024-03-20T03:13:25Z

etcd/server/etcdserver/api/v3rpc/watch.go

Lines 382 to 391 in 4a5e9d1

    
           for i := range evs { 
        
           	events[i] = &evs[i] 
        
           	if needPrevKV { 
        
           		opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1} 
        
           		r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt) 
        
           		if err == nil && len(r.KVs) != 0 { 
        
           			events[i].PrevKv = &(r.KVs[0]) 
        
           		} 
        
           	} 
        
           }

Just have some questions about the current logic, can correct me if my understanding is incorrect @fuweid @chaochn47

why not consider the corner case there? e.g. err!=nil || len(r.KVs)==0
why do Range here? if there are 500 watch requests for same key with PrevKV required, then seems the Range logic here will be executed 500 times for a event. why not do this in the notify function of watchableStore

serathius · 2024-03-21T07:31:40Z

@likakuli Please see #16839 for the answer

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

chaochn47 added the type/bug label Mar 4, 2024

This was referenced Mar 7, 2024

Watch starvation can cause OOMs #16839

Open

Fix watch event loss #17555

Merged

serathius pinned this issue Mar 13, 2024

serathius added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 13, 2024

ahrtr closed this as completed in #17555 Mar 16, 2024

ahrtr reopened this Mar 16, 2024

This was referenced Mar 18, 2024

Define an official performance validation suite for etcd #16467

Open

[release-3.4] backport fix watch event loss after compaction #17555 #17610

Merged

[release-3.5] backport fix watch event loss after compaction #17612

Merged

serathius added a commit to serathius/etcd that referenced this issue Apr 1, 2024

Draft reproduce issue etcd-io#17529

a0b9888

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius mentioned this issue Apr 1, 2024

Reproduce #17529 in robustness tests #17680

Draft

serathius added a commit to serathius/etcd that referenced this issue Apr 7, 2024

Draft reproduce issue etcd-io#17529

b8adab8

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

serathius added a commit to serathius/etcd that referenced this issue Apr 7, 2024

Draft reproduce issue etcd-io#17529

1a02e04

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd watch events starvation / lost multiplexed on a single watch stream #17529

etcd watch events starvation / lost multiplexed on a single watch stream #17529

chaochn47 commented Mar 4, 2024 •

edited

chaochn47 commented Mar 4, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

serathius commented Mar 6, 2024

fuweid commented Mar 6, 2024 •

edited

ahrtr commented Mar 6, 2024

serathius commented Mar 6, 2024

fuweid commented Mar 6, 2024

chaochn47 commented Mar 6, 2024 •

edited

ahrtr commented Mar 16, 2024

likakuli commented Mar 20, 2024 •

edited

serathius commented Mar 21, 2024

etcd watch events starvation / lost multiplexed on a single watch stream #17529

etcd watch events starvation / lost multiplexed on a single watch stream #17529

Comments

chaochn47 commented Mar 4, 2024 • edited

Bug report criteria

What happened?

Watch events starvation

Watch events lost

Environment set up

Log and metrics

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Relevant log output

chaochn47 commented Mar 4, 2024 • edited

chaochn47 commented Mar 6, 2024 • edited

chaochn47 commented Mar 6, 2024 • edited

etcd server log

Test log

chaochn47 commented Mar 6, 2024 • edited

serathius commented Mar 6, 2024

fuweid commented Mar 6, 2024 • edited

ahrtr commented Mar 6, 2024

serathius commented Mar 6, 2024

fuweid commented Mar 6, 2024

chaochn47 commented Mar 6, 2024 • edited

ahrtr commented Mar 16, 2024

likakuli commented Mar 20, 2024 • edited

serathius commented Mar 21, 2024

chaochn47 commented Mar 4, 2024 •

edited

chaochn47 commented Mar 4, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

fuweid commented Mar 6, 2024 •

edited

chaochn47 commented Mar 6, 2024 •

edited

likakuli commented Mar 20, 2024 •

edited