Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc-proxy has resolver-ttl broken since 3.4.28 #17887

Closed
4 tasks done
RodionGork opened this issue Apr 26, 2024 · 6 comments · Fixed by #17916
Closed
4 tasks done

grpc-proxy has resolver-ttl broken since 3.4.28 #17887

RodionGork opened this issue Apr 26, 2024 · 6 comments · Fixed by #17916
Assignees
Labels

Comments

@RodionGork
Copy link

Bug report criteria

What happened?

When using several instances of etcd grpc-proxy start ... - if one of them goes down, this is supposed to be reflected in etcdctl member list after interval specified by --resolver-ttl elapses. This was broken with changes introduced in v3.4.28 (but worked before quite well).

Please kindly see reproducing steps below, with the attached docker-compose file it should be easy.

Cause

My colleague Peter Zhu identified the problem:

Here in cluster.go we monitor additions and deletions of members, particularly here is deletion
https://github.com/etcd-io/etcd/blob/v3.4.28/proxy/grpcproxy/cluster.go#L110
this function have slightly changed since v3.4.27, but general sense is the same

Real culprit seemingly is where the event is issued:
https://github.com/etcd-io/etcd/blob/v3.4.28/clientv3/naming/endpoints/endpoints_impl.go#L155

you see, iup.Addr is not set here, though used few lines below, unless we are mistaken.

What did you expect to happen?

after TTL is elapsed, all other proxies should show updated list, without the node which was shut down.

How can we reproduce it (as minimally and precisely as possible)?

Consider the attached file which contains simple docker-based setup to reproduce the issue:

  • unpack zip file - so here are Dockerfile and compose.yaml
  • run docker-compose up - it will launch etcd-main container and two proxies etcd-proxy-a and etcd-proxy-b

Now enter one of the proxies and list advertised members:

docker exec -it etcd-etcd-proxy-a-1 /bin/bash
root@6b1e5e88e4b9:/# /etcdctl member list

0, started, 72c909135c09, , http://etcd-proxy-b:2379, false
0, started, 6b1e5e88e4b9, , http://etcd-proxy-a:2379, false

Enter another proxy (in different terminal window preferably) and kill it (or shut it down in other way, doesn't matter):

docker exec -it etcd-etcd-proxy-b-1 /bin/bash
root@72c909135c09:/# kill 1

Switch back to the window where the first proxy runs, retry listing members. It will still report 2 proxies, even after the TTL=30 seconds elapses (it is specified in compose file).

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.4.28

Etcd configuration (command line flags or environment variables)

nothing special

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

No response

@RodionGork
Copy link
Author

File for reproducing the issue

etcd-test.zip

@ivanvc
Copy link
Member

ivanvc commented Apr 28, 2024

I could replicate this with any version >= 3.4.28, as @RodionGork stated. Another way of replicating this issue is with a Procfile with:

etcd1: bin/etcd --listen-client-urls=http://0.0.0.0:2379 --listen-peer-urls=http://0.0.0.0:2380 --advertise-client-urls=http://127.0.0.1:2379 --initial-advertise-peer-urls=http://127.0.0.1:2380 --initial-cluster=default=http://127.0.0.1:2380
etcd2: bin/etcd grpc-proxy start --endpoints=127.0.0.1:2379 --listen-addr=0.0.0.0:22379 --advertise-client-url=http://127.0.0.1:22379 --resolver-prefix=etcd-discovery --resolver-ttl=10
etcd3: bin/etcd grpc-proxy start --endpoints=127.0.0.1:2379 --listen-addr=0.0.0.0:23379 --advertise-client-url=http://127.0.0.1:23379 --resolver-prefix=etcd-discovery --resolver-ttl=10

Then, running:

$ goreman run stop etcd3
$ ./bin/etcdctl --endpoints=127.0.0.1:22379 mem l
0, started, <hostname>, , http://127.0.0.1:22379, false
0, started, <nostname>, , http://127.0.0.1:23379, false

After the 10s timeout, it still shows the two hosts.

I think the issue comes from this backport: e61f1d8

The right implementation is the following patch:

diff --git a/proxy/grpcproxy/cluster.go b/proxy/grpcproxy/cluster.go
index 338827d46..cd25e1867 100644
--- a/proxy/grpcproxy/cluster.go
+++ b/proxy/grpcproxy/cluster.go
@@ -105,9 +105,9 @@ func (cp *clusterProxy) monitor(wc endpoints.WatchChannel) {
                        for _, up := range updates {
                                switch up.Op {
                                case endpoints.Add:
-                                       cp.umap[up.Endpoint.Addr] = up.Endpoint
+                                       cp.umap[up.Key] = up.Endpoint
                                case endpoints.Delete:
-                                       delete(cp.umap, up.Endpoint.Addr)
+                                       delete(cp.umap, up.Key)
                                }
                        }
                        cp.umu.Unlock()
@@ -162,12 +162,12 @@ func (cp *clusterProxy) membersFromUpdates() ([]*pb.Member, error) {
        cp.umu.RLock()
        defer cp.umu.RUnlock()
        mbs := make([]*pb.Member, 0, len(cp.umap))
-       for addr, upt := range cp.umap {
+       for _, upt := range cp.umap {
                m, err := decodeMeta(fmt.Sprint(upt.Metadata))
                if err != nil {
                        return nil, err
                }
-               mbs = append(mbs, &pb.Member{Name: m.Name, ClientURLs: []string{addr}})
+               mbs = append(mbs, &pb.Member{Name: m.Name, ClientURLs: []string{upt.Addr}})
        }
        return mbs, nil
 }

After applying this patch, I confirmed that the issue is no longer reproducible. My question to @ahrtr / @serathius is whether we want to write an integration or e2e test along with the fix. If we do, can you advise which one would be?

@ahrtr
Copy link
Member

ahrtr commented Apr 28, 2024

Thanks @RodionGork for raising this issue.

@ivanvc we need to backport #15835 to 3.4, basically it's the same as your proposal above. Previously the reason why the PR was't backported to 3.4 was that 3.4 has different implementation in terms of gRPC name resolver/load balancer, refer to #15835 (comment). But the reason isn't valid anymore after #16800 is merged.

@ivanvc
Copy link
Member

ivanvc commented Apr 29, 2024

/assign

I'll backport it tomorrow.

@ivanvc
Copy link
Member

ivanvc commented Apr 29, 2024

Opened PR #17896. I'll update the CHANGELOG once it's merged.

@RodionGork
Copy link
Author

Friends, thanks for speedy reaction (and sorry for delay on my side) - your explanations and patch definitely help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

3 participants