New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc-proxy has resolver-ttl broken since 3.4.28 #17887
Comments
File for reproducing the issue |
I could replicate this with any version >= 3.4.28, as @RodionGork stated. Another way of replicating this issue is with a
Then, running:
After the 10s timeout, it still shows the two hosts. I think the issue comes from this backport: e61f1d8 The right implementation is the following patch: diff --git a/proxy/grpcproxy/cluster.go b/proxy/grpcproxy/cluster.go
index 338827d46..cd25e1867 100644
--- a/proxy/grpcproxy/cluster.go
+++ b/proxy/grpcproxy/cluster.go
@@ -105,9 +105,9 @@ func (cp *clusterProxy) monitor(wc endpoints.WatchChannel) {
for _, up := range updates {
switch up.Op {
case endpoints.Add:
- cp.umap[up.Endpoint.Addr] = up.Endpoint
+ cp.umap[up.Key] = up.Endpoint
case endpoints.Delete:
- delete(cp.umap, up.Endpoint.Addr)
+ delete(cp.umap, up.Key)
}
}
cp.umu.Unlock()
@@ -162,12 +162,12 @@ func (cp *clusterProxy) membersFromUpdates() ([]*pb.Member, error) {
cp.umu.RLock()
defer cp.umu.RUnlock()
mbs := make([]*pb.Member, 0, len(cp.umap))
- for addr, upt := range cp.umap {
+ for _, upt := range cp.umap {
m, err := decodeMeta(fmt.Sprint(upt.Metadata))
if err != nil {
return nil, err
}
- mbs = append(mbs, &pb.Member{Name: m.Name, ClientURLs: []string{addr}})
+ mbs = append(mbs, &pb.Member{Name: m.Name, ClientURLs: []string{upt.Addr}})
}
return mbs, nil
} After applying this patch, I confirmed that the issue is no longer reproducible. My question to @ahrtr / @serathius is whether we want to write an integration or e2e test along with the fix. If we do, can you advise which one would be? |
Thanks @RodionGork for raising this issue. @ivanvc we need to backport #15835 to 3.4, basically it's the same as your proposal above. Previously the reason why the PR was't backported to 3.4 was that 3.4 has different implementation in terms of gRPC name resolver/load balancer, refer to #15835 (comment). But the reason isn't valid anymore after #16800 is merged. |
/assign I'll backport it tomorrow. |
Opened PR #17896. I'll update the CHANGELOG once it's merged. |
Friends, thanks for speedy reaction (and sorry for delay on my side) - your explanations and patch definitely help! |
Bug report criteria
What happened?
When using several instances of
etcd grpc-proxy start ...
- if one of them goes down, this is supposed to be reflected inetcdctl member list
after interval specified by--resolver-ttl
elapses. This was broken with changes introduced inv3.4.28
(but worked before quite well).Please kindly see reproducing steps below, with the attached docker-compose file it should be easy.
Cause
My colleague Peter Zhu identified the problem:
Here in
cluster.go
we monitor additions and deletions of members, particularly here is deletionhttps://github.com/etcd-io/etcd/blob/v3.4.28/proxy/grpcproxy/cluster.go#L110
this function have slightly changed since v3.4.27, but general sense is the same
Real culprit seemingly is where the event is issued:
https://github.com/etcd-io/etcd/blob/v3.4.28/clientv3/naming/endpoints/endpoints_impl.go#L155
you see,
iup.Addr
is not set here, though used few lines below, unless we are mistaken.What did you expect to happen?
after TTL is elapsed, all other proxies should show updated list, without the node which was shut down.
How can we reproduce it (as minimally and precisely as possible)?
Consider the attached file which contains simple docker-based setup to reproduce the issue:
docker-compose up
- it will launchetcd-main
container and two proxiesetcd-proxy-a
andetcd-proxy-b
Now enter one of the proxies and list advertised members:
Enter another proxy (in different terminal window preferably) and kill it (or shut it down in other way, doesn't matter):
Switch back to the window where the first proxy runs, retry listing members. It will still report 2 proxies, even after the TTL=30 seconds elapses (it is specified in compose file).
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.4.28
Etcd configuration (command line flags or environment variables)
nothing special
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
No response
Relevant log output
No response
The text was updated successfully, but these errors were encountered: