New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix remove member failed. #17793
Fix remove member failed. #17793
Conversation
Hi @mneverov. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
server/etcdserver/server.go
Outdated
lg.Warn( | ||
"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum", | ||
zap.String("local-member-id", s.MemberID().String()), | ||
zap.String("requested-member-remove", id.String()), | ||
zap.Int("active-peers", active), | ||
zap.Int("active-peers", numConnectedSince(s.r.transport, since, s.MemberID(), s.cluster.Members())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of this calculation because it repeats what is done inside isConnectedToQuorumSince
and the number of connected member may be different at the time of the second calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can reuse isConnectedToQuorumSince
here. The old (active - 1) < 1+((len(m)-1)/2)
ensures that a quorum can still be reached after deleting this member. isConnectedToQuorumSince
only ensures quorum at the moment.
e.g. if numConnectedSince() = 2
for a 3 node cluster, (active - 1) < 1+((len(m)-1)/2) == false
while isConnectedToQuorumSince == true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
fed1bc0
to
67cca55
Compare
Signed-off-by: Max Neverov <neverov.max@gmail.com>
…es quorum after a member is deleted. Signed-off-by: Max Neverov <neverov.max@gmail.com>
77c9801
to
c64c996
Compare
Signed-off-by: Max Neverov <neverov.max@gmail.com>
The test framework waits until etcd server process is started. |
|
Signed-off-by: Max Neverov <neverov.max@gmail.com>
@serathius |
@mneverov Thanks for your work! |
@siyuanfoundation thanks for the suggestion, I have not. Will try and amend the PR |
@siyuanfoundation I ran the test on my machine with |
/retest |
@siyuanfoundation actually, I just ran the test on main branch and it didn't fail either :(. |
/retest |
hmm... I cannot get the failure anymore. I also cannot reproduce the failure on the main branch on my machine either. I guess we can just give it a try. |
Fixes #17653
Why the test failed?
The quorum was not reached because one member was "down".
Why was one node "down"?
It wasn't, actually. The node connected outside
HealthInterval
:2024-03-25T17:44:33.8754342Z {"Time":"2024-03-25T17:44:33.830596472Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/common","Test":"TestMemberRemove/StrictReconfigCheck/WaitForQuorum/PeerAutoTLS","Output":"/home/runner/work/etcd/etcd/bin/etcd (TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0) (58753): {\"level\":\"info\",\"ts\":\"2024-03-25T17:44:33.825435Z\",\"caller\":\"rafthttp/peer_status.go:53\",\"msg\":\"peer became active\",\"peer-id\":\"c61f33e98087dec2\"}\r\n"}
The remove command was issued later than 5 seconds interval:
Another log message:
"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","requested-member-remove":"c61f33e98087dec2","active-peers":2
.Proposed fix consists of two parts:
isLearner: false
.isConnectedToQuorumSince
: 2 >= (3/2)+1 == 2 >= 2 == true, i.e. quorum is reachedwhereas current equation
(active - 1) < 1+((len(m)-1)/2)
: 2 - 1 < 1 + ((3-1)/2) == 1 < 1 + 1 == true, i.e. quorum is not reachedApart from that as members be gone at any moment we can just add retry in test instead of all the above :).