grpc-proxy has resolver-ttl broken since 3.4.28 #17887

RodionGork · 2024-04-26T17:38:40Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

When using several instances of etcd grpc-proxy start ... - if one of them goes down, this is supposed to be reflected in etcdctl member list after interval specified by --resolver-ttl elapses. This was broken with changes introduced in v3.4.28 (but worked before quite well).

Please kindly see reproducing steps below, with the attached docker-compose file it should be easy.

Cause

My colleague Peter Zhu identified the problem:

Here in cluster.go we monitor additions and deletions of members, particularly here is deletion
https://github.com/etcd-io/etcd/blob/v3.4.28/proxy/grpcproxy/cluster.go#L110
this function have slightly changed since v3.4.27, but general sense is the same

Real culprit seemingly is where the event is issued:
https://github.com/etcd-io/etcd/blob/v3.4.28/clientv3/naming/endpoints/endpoints_impl.go#L155

you see, iup.Addr is not set here, though used few lines below, unless we are mistaken.

What did you expect to happen?

after TTL is elapsed, all other proxies should show updated list, without the node which was shut down.

How can we reproduce it (as minimally and precisely as possible)?

Consider the attached file which contains simple docker-based setup to reproduce the issue:

unpack zip file - so here are Dockerfile and compose.yaml
run docker-compose up - it will launch etcd-main container and two proxies etcd-proxy-a and etcd-proxy-b

Now enter one of the proxies and list advertised members:

docker exec -it etcd-etcd-proxy-a-1 /bin/bash
root@6b1e5e88e4b9:/# /etcdctl member list

0, started, 72c909135c09, , http://etcd-proxy-b:2379, false
0, started, 6b1e5e88e4b9, , http://etcd-proxy-a:2379, false

Enter another proxy (in different terminal window preferably) and kill it (or shut it down in other way, doesn't matter):

docker exec -it etcd-etcd-proxy-b-1 /bin/bash
root@72c909135c09:/# kill 1

Switch back to the window where the first proxy runs, retry listing members. It will still report 2 proxies, even after the TTL=30 seconds elapses (it is specified in compose file).

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.4.28

Etcd configuration (command line flags or environment variables)

nothing special

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

RodionGork · 2024-04-26T17:39:28Z

File for reproducing the issue

etcd-test.zip

ivanvc · 2024-04-28T04:40:13Z

I could replicate this with any version >= 3.4.28, as @RodionGork stated. Another way of replicating this issue is with a Procfile with:

etcd1: bin/etcd --listen-client-urls=http://0.0.0.0:2379 --listen-peer-urls=http://0.0.0.0:2380 --advertise-client-urls=http://127.0.0.1:2379 --initial-advertise-peer-urls=http://127.0.0.1:2380 --initial-cluster=default=http://127.0.0.1:2380
etcd2: bin/etcd grpc-proxy start --endpoints=127.0.0.1:2379 --listen-addr=0.0.0.0:22379 --advertise-client-url=http://127.0.0.1:22379 --resolver-prefix=etcd-discovery --resolver-ttl=10
etcd3: bin/etcd grpc-proxy start --endpoints=127.0.0.1:2379 --listen-addr=0.0.0.0:23379 --advertise-client-url=http://127.0.0.1:23379 --resolver-prefix=etcd-discovery --resolver-ttl=10

Then, running:

$ goreman run stop etcd3
$ ./bin/etcdctl --endpoints=127.0.0.1:22379 mem l
0, started, <hostname>, , http://127.0.0.1:22379, false
0, started, <nostname>, , http://127.0.0.1:23379, false

After the 10s timeout, it still shows the two hosts.

I think the issue comes from this backport: e61f1d8

The right implementation is the following patch:

diff --git a/proxy/grpcproxy/cluster.go b/proxy/grpcproxy/cluster.go
index 338827d46..cd25e1867 100644
--- a/proxy/grpcproxy/cluster.go
+++ b/proxy/grpcproxy/cluster.go
@@ -105,9 +105,9 @@ func (cp *clusterProxy) monitor(wc endpoints.WatchChannel) {
                        for _, up := range updates {
                                switch up.Op {
                                case endpoints.Add:
-                                       cp.umap[up.Endpoint.Addr] = up.Endpoint
+                                       cp.umap[up.Key] = up.Endpoint
                                case endpoints.Delete:
-                                       delete(cp.umap, up.Endpoint.Addr)
+                                       delete(cp.umap, up.Key)
                                }
                        }
                        cp.umu.Unlock()
@@ -162,12 +162,12 @@ func (cp *clusterProxy) membersFromUpdates() ([]*pb.Member, error) {
        cp.umu.RLock()
        defer cp.umu.RUnlock()
        mbs := make([]*pb.Member, 0, len(cp.umap))
-       for addr, upt := range cp.umap {
+       for _, upt := range cp.umap {
                m, err := decodeMeta(fmt.Sprint(upt.Metadata))
                if err != nil {
                        return nil, err
                }
-               mbs = append(mbs, &pb.Member{Name: m.Name, ClientURLs: []string{addr}})
+               mbs = append(mbs, &pb.Member{Name: m.Name, ClientURLs: []string{upt.Addr}})
        }
        return mbs, nil
 }

After applying this patch, I confirmed that the issue is no longer reproducible. My question to @ahrtr / @serathius is whether we want to write an integration or e2e test along with the fix. If we do, can you advise which one would be?

ahrtr · 2024-04-28T10:37:32Z

Thanks @RodionGork for raising this issue.

@ivanvc we need to backport #15835 to 3.4, basically it's the same as your proposal above. Previously the reason why the PR was't backported to 3.4 was that 3.4 has different implementation in terms of gRPC name resolver/load balancer, refer to #15835 (comment). But the reason isn't valid anymore after #16800 is merged.

ivanvc · 2024-04-29T03:11:35Z

/assign

I'll backport it tomorrow.

ivanvc · 2024-04-29T17:26:52Z

Opened PR #17896. I'll update the CHANGELOG once it's merged.

RodionGork · 2024-05-02T07:07:25Z

Friends, thanks for speedy reaction (and sorry for delay on my side) - your explanations and patch definitely help!

RodionGork added the type/bug label Apr 26, 2024

k8s-ci-robot assigned ivanvc Apr 29, 2024

ivanvc mentioned this issue Apr 29, 2024

[3.4] fix grpc proxy member list not updated with node deletion #17896

Merged

ivanvc mentioned this issue Apr 30, 2024

changelog/3.4: add fix for member list not updated when node goes down #17916

Merged

ahrtr closed this as completed in #17916 May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc-proxy has resolver-ttl broken since 3.4.28 #17887

grpc-proxy has resolver-ttl broken since 3.4.28 #17887

RodionGork commented Apr 26, 2024

RodionGork commented Apr 26, 2024

ivanvc commented Apr 28, 2024

ahrtr commented Apr 28, 2024

ivanvc commented Apr 29, 2024

ivanvc commented Apr 29, 2024

RodionGork commented May 2, 2024

grpc-proxy has resolver-ttl broken since 3.4.28 #17887

grpc-proxy has resolver-ttl broken since 3.4.28 #17887

Comments

RodionGork commented Apr 26, 2024

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

RodionGork commented Apr 26, 2024

ivanvc commented Apr 28, 2024

ahrtr commented Apr 28, 2024

ivanvc commented Apr 29, 2024

ivanvc commented Apr 29, 2024

RodionGork commented May 2, 2024