Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix remove member failed. #17793

Merged
merged 4 commits into from Apr 30, 2024
Merged

Fix remove member failed. #17793

merged 4 commits into from Apr 30, 2024

Conversation

mneverov
Copy link
Contributor

@mneverov mneverov commented Apr 15, 2024

Fixes #17653

Why the test failed?
The quorum was not reached because one member was "down".

Why was one node "down"?
It wasn't, actually. The node connected outside HealthInterval:
2024-03-25T17:44:33.8754342Z {"Time":"2024-03-25T17:44:33.830596472Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/common","Test":"TestMemberRemove/StrictReconfigCheck/WaitForQuorum/PeerAutoTLS","Output":"/home/runner/work/etcd/etcd/bin/etcd (TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0) (58753): {\"level\":\"info\",\"ts\":\"2024-03-25T17:44:33.825435Z\",\"caller\":\"rafthttp/peer_status.go:53\",\"msg\":\"peer became active\",\"peer-id\":\"c61f33e98087dec2\"}\r\n"}

The remove command was issued later than 5 seconds interval:

2024-03-25T17:44:39.1538189Z {"Time":"2024-03-25T17:44:39.13219045Z","Action":"output","Package":"go.etcd.io/etcd/tests/v3/common","Test":"TestMemberRemove/StrictReconfigCheck/WaitForQuorum/PeerAutoTLS","Output":"/home/runner/work/etcd/etcd/bin/etcdctl (/home/runner/work/etcd/etcd/bin/etcdctl_--endpoints=http://localhost:20000_member_list_-w_json) (58781):
{\"header\":
  {
    \"cluster_id\":5772825840910905659, \"member_id\":47555275972384877, \"raft_term\":2},
    \"members\":[
      {\"ID\":47555275972384877,\"name\":\"TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0\",\"peerURLs\":[\"https://localhost:20001\"],\"clientURLs\":[\"http://localhost:20000\"]},
      {\"ID\":14276186421764546242,\"name\":\"TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2\",\"peerURLs\":[\"https://localhost:20011\"],\"clientURLs\":[\"http://localhost:20010\"]},
      {\"ID\":15493022389670793866,\"name\":\"TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1\",\"peerURLs\":[\"https://localhost:20006\"],\"clientURLs\":[\"http://localhost:20005\"]}
    ]
  }\r\n"
}

Another log message:
"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum","requested-member-remove":"c61f33e98087dec2","active-peers":2.

Proposed fix consists of two parts:

  1. Reuse of isConnectedToQuorumSince:
  • Another usage of isConnectedToQuorumSince passes all the cluster members, not only voting members. I wonder which is correct. In case of the current test it doesn't matter since all members are joined as isLearner: false.
  • The equation that is used to calculate if the quorum is reached is different. Consider numConnectedSince = 2 and total members = 3 (the situation in the test):
    isConnectedToQuorumSince: 2 >= (3/2)+1 == 2 >= 2 == true, i.e. quorum is reached
    whereas current equation (active - 1) < 1+((len(m)-1)/2): 2 - 1 < 1 + ((3-1)/2) == 1 < 1 + 1 == true, i.e. quorum is not reached
  1. Use additional pause as in snapshot member test
  2. Esoteric case when the check time is exactly `activeSince + HealthInterval, then ... && t.Before(since) returns false, but it should be true.

Apart from that as members be gone at any moment we can just add retry in test instead of all the above :).

@k8s-ci-robot
Copy link

Hi @mneverov. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lg.Warn(
"rejecting member remove request; local member has not been connected to all peers, reconfigure breaks active quorum",
zap.String("local-member-id", s.MemberID().String()),
zap.String("requested-member-remove", id.String()),
zap.Int("active-peers", active),
zap.Int("active-peers", numConnectedSince(s.r.transport, since, s.MemberID(), s.cluster.Members())),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of this calculation because it repeats what is done inside isConnectedToQuorumSince and the number of connected member may be different at the time of the second calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can reuse isConnectedToQuorumSince here. The old (active - 1) < 1+((len(m)-1)/2) ensures that a quorum can still be reached after deleting this member. isConnectedToQuorumSince only ensures quorum at the moment.
e.g. if numConnectedSince() = 2 for a 3 node cluster, (active - 1) < 1+((len(m)-1)/2) == false while isConnectedToQuorumSince == true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

Signed-off-by: Max Neverov <neverov.max@gmail.com>
…es quorum after a member is deleted.

Signed-off-by: Max Neverov <neverov.max@gmail.com>
@mneverov mneverov changed the title Fix remove member failed. [WIP] Fix remove member failed. Apr 17, 2024
@serathius serathius marked this pull request as draft April 17, 2024 08:20
Signed-off-by: Max Neverov <neverov.max@gmail.com>
@mneverov
Copy link
Contributor Author

The test framework waits until etcd server process is started.
Before the embedded server serve Clients (and logs the expected line) it publishes the new member via raft and notify that the server is "ready". Peers then connect to each other via read and write streams which updates peerStatus.since field.
This happens asynchronously (runs in a separate go routines), so when the test sleeps for 5 seconds it is not enough in some cases.
So the fix is just to increase the sleep time as it is done in the integration test.

@mneverov mneverov changed the title [WIP] Fix remove member failed. Fix remove member failed. Apr 18, 2024
@mneverov mneverov marked this pull request as ready for review April 18, 2024 09:28
@serathius
Copy link
Member

2024-04-18T10:24:26.8479182Z --- FAIL: TestMemberRemove (121.89s)
2024-04-18T10:24:26.8480235Z     before.go:36: Changing working directory to: /tmp/TestMemberRemove2003188041/001
2024-04-18T10:24:26.8481790Z     --- FAIL: TestMemberRemove/StrictReconfigCheck/WaitForQuorum/PeerAutoTLS (11.71s)
2024-04-18T10:24:26.8484055Z         logger.go:146: 2024-04-18T10:18:57.388Z	INFO	starting server...	{"name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2"}
2024-04-18T10:24:26.8491941Z         logger.go:146: 2024-04-18T10:18:57.388Z	INFO	starting server...	{"name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0"}
2024-04-18T10:24:26.8508584Z         logger.go:146: 2024-04-18T10:18:57.388Z	INFO	spawning process	{"args": ["/home/runner/actions-runner/_work/etcd/etcd/bin/etcd", "--name=TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2", "--listen-client-urls=http://localhost:20010", "--advertise-client-urls=http://localhost:20010", "--listen-peer-urls=https://localhost:20011", "--initial-advertise-peer-urls=https://localhost:20011", "--initial-cluster-token=new", "--data-dir", "/tmp/TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS2091114865/003", "--snapshot-count=10000", "--peer-auto-tls", "--initial-cluster-token=new", "--initial-cluster=TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0=https://localhost:20001,TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1=https://localhost:20006,TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2=https://localhost:20011", "--initial-cluster-state=new"], "working-dir": "/tmp/TestMemberRemove2003188041/001", "name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2", "environment-variables": ["ETCD_VERIFY=all", "EXPECT_DEBUG=true", "PATH=/home/runner/actions-runner/_work/_tool/go/1.22.2/arm64/bin:/home/runner/go/bin:/home/runner/actions-runner/_work/_tool/go/1.22.2/arm64/bin:/home/runner/.dotnet/tools:/home/runner/.cargo/bin:/home/runner/go/bin:/home/runner/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "ETCD_UNSUPPORTED_ARCH=arm64"]}
2024-04-18T10:24:26.8521844Z         logger.go:146: 2024-04-18T10:18:57.388Z	INFO	starting server...	{"name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1"}
2024-04-18T10:24:26.8538790Z         logger.go:146: 2024-04-18T10:18:57.388Z	INFO	spawning process	{"args": ["/home/runner/actions-runner/_work/etcd/etcd/bin/etcd", "--name=TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1", "--listen-client-urls=http://localhost:20005", "--advertise-client-urls=http://localhost:20005", "--listen-peer-urls=https://localhost:20006", "--initial-advertise-peer-urls=https://localhost:20006", "--initial-cluster-token=new", "--data-dir", "/tmp/TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS2091114865/002", "--snapshot-count=10000", "--peer-auto-tls", "--initial-cluster-token=new", "--initial-cluster=TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0=https://localhost:20001,TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1=https://localhost:20006,TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2=https://localhost:20011", "--initial-cluster-state=new"], "working-dir": "/tmp/TestMemberRemove2003188041/001", "name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1", "environment-variables": ["ETCD_VERIFY=all", "EXPECT_DEBUG=true", "PATH=/home/runner/actions-runner/_work/_tool/go/1.22.2/arm64/bin:/home/runner/go/bin:/home/runner/actions-runner/_work/_tool/go/1.22.2/arm64/bin:/home/runner/.dotnet/tools:/home/runner/.cargo/bin:/home/runner/go/bin:/home/runner/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "ETCD_UNSUPPORTED_ARCH=arm64"]}
2024-04-18T10:24:26.8565661Z         logger.go:146: 2024-04-18T10:18:57.388Z	INFO	spawning process	{"args": ["/home/runner/actions-runner/_work/etcd/etcd/bin/etcd", "--name=TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0", "--listen-client-urls=http://localhost:20000", "--advertise-client-urls=http://localhost:20000", "--listen-peer-urls=https://localhost:20001", "--initial-advertise-peer-urls=https://localhost:20001", "--initial-cluster-token=new", "--data-dir", "/tmp/TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS2091114865/001", "--snapshot-count=10000", "--peer-auto-tls", "--initial-cluster-token=new", "--initial-cluster=TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0=https://localhost:20001,TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1=https://localhost:20006,TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2=https://localhost:20011", "--initial-cluster-state=new"], "working-dir": "/tmp/TestMemberRemove2003188041/001", "name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0", "environment-variables": ["ETCD_VERIFY=all", "EXPECT_DEBUG=true", "PATH=/home/runner/actions-runner/_work/_tool/go/1.22.2/arm64/bin:/home/runner/go/bin:/home/runner/actions-runner/_work/_tool/go/1.22.2/arm64/bin:/home/runner/.dotnet/tools:/home/runner/.cargo/bin:/home/runner/go/bin:/home/runner/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "ETCD_UNSUPPORTED_ARCH=arm64"]}
2024-04-18T10:24:26.8578398Z         logger.go:146: 2024-04-18T10:19:00.355Z	INFO	started server.	{"name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-2", "pid": 82982}
2024-04-18T10:24:26.8581276Z         logger.go:146: 2024-04-18T10:19:00.355Z	INFO	started server.	{"name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-0", "pid": 82984}
2024-04-18T10:24:26.8584141Z         logger.go:146: 2024-04-18T10:19:00.365Z	INFO	started server.	{"name": "TestMemberRemoveStrictReconfigCheckWaitForQuorumPeerAutoTLS-test-1", "pid": 82983}
2024-04-18T10:24:26.8586384Z         testutil.go:56: ---> Test failed: test timed out after 7.02172588s, err: context deadline exceeded

Signed-off-by: Max Neverov <neverov.max@gmail.com>
@mneverov
Copy link
Contributor Author

@serathius
According to logs it took 3 seconds for etcd to start, then test slept for 7 seconds, hence ExecuteUntil failed because the timeout was configured to be 10 seconds.
I increased the timeout.

@siyuanfoundation
Copy link
Contributor

@mneverov Thanks for your work!
But I am still getting failures with the increased timeout. Have you tried to run the tests locally with --count=20 for example?

@mneverov
Copy link
Contributor Author

@mneverov Thanks for your work! But I am still getting failures with the increased timeout. Have you tried to run the tests locally with --count=20 for example?

@siyuanfoundation thanks for the suggestion, I have not. Will try and amend the PR

@mneverov
Copy link
Contributor Author

@siyuanfoundation I ran the test on my machine with count=100 and timeout=3h and it didn't fail. Most of the time the tests were finished in 10.5 seconds.
Is the error you get the same as in the issue? Could you please check if you run the test from my branch?
May assumption that the hardware performance may affect the time to start the cluster and since the timeout is defined for the whole test the slow cluster start may lead to context timeout.
In this case the context for the whole test (including cluster start) may be more generous like 30 second, and testutils.ExecuteUntil may use a child context with it's own timeout (14 seconds as it is now in the current branch). WDYT?

@serathius
Copy link
Member

/retest

@mneverov
Copy link
Contributor Author

@siyuanfoundation actually, I just ran the test on main branch and it didn't fail either :(.
I used PASSES="e2e" ./scripts/test.sh -test.v -test.run TestMemberRemove -test.count=100 -test.failfast -test.timeout=3h.
Could you please attach logs from the failure on your machine?

@serathius
Copy link
Member

/retest

@siyuanfoundation
Copy link
Contributor

hmm... I cannot get the failure anymore. I also cannot reproduce the failure on the main branch on my machine either. I guess we can just give it a try.

@ahrtr ahrtr merged commit d31caae into etcd-io:main Apr 30, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

TestMemberRemove failed for linux-386-e2e
5 participants