Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about LeaseRead #166

Open
boringhello opened this issue Feb 15, 2024 · 9 comments
Open

Question about LeaseRead #166

boringhello opened this issue Feb 15, 2024 · 9 comments
Labels
question Further information is requested

Comments

@boringhello
Copy link

boringhello commented Feb 15, 2024

In the Raft paper, there is a mentioned optimization method for Lease Read using clock + heartbeat. When the leader sends a heartbeat, it first records a timestamp called "start". If the majority of the nodes in the system respond with heartbeat responses, it is assumed that the leader's lease is valid until the time point "start + election timeout / clock drift bound".
But when I read etcd LeaseRead code, it seems that LeaseRead depend on the checkquorum. When the leader receive the major of heartbeat responses in the half of election_timeout, the MsgCheckQuorum will not step the leader down on this election_timeout round. The next round, the leader step down into follower because the major of nodes is not recentActive.This case seems that the leasetime last 1.5 election_timeout.This confused me.

@boringhello
Copy link
Author

boringhello commented Feb 16, 2024

func TestLeaseRead(t *testing.T) {
    n1 := newTestRaft(1, 10, 1, newTestMemoryStorage(withPeers(1, 2, 3)))
    n2 := newTestRaft(2, 10, 1, newTestMemoryStorage(withPeers(1, 2, 3)))
    n3 := newTestRaft(3, 10, 1, newTestMemoryStorage(withPeers(1, 2, 3)))
    n1.checkQuorum = true
    n2.checkQuorum = true
    n3.checkQuorum = true
    n1.readOnly.option = ReadOnlyLeaseBased
    n2.readOnly.option = ReadOnlyLeaseBased
    n3.readOnly.option = ReadOnlyLeaseBased
    nt := newNetwork(n1, n2, n3)
    nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})
    if n1.state != StateLeader {
        t.Errorf("node 1 state: %s, want %s", n1.state, StateLeader)
    }
    if n2.state != StateFollower {
        t.Fatalf("node 2 state: %s, want %s", n2.state, StateFollower)
    }
    if n3.state != StateFollower {
        t.Fatalf("node 3 state: %s, want %s", n3.state, StateFollower)
    }
    nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("foo")}}})
    nt.cut(1, 2)
    nt.cut(1, 3)
    n1.Step(pb.Message{From: 1, Type: pb.MsgReadIndex, Entries: []pb.Entry{{Data: []byte("123")}}})
    msgs := n1.readStates
    if len(msgs) != 1 {
        t.Errorf("len(msgs) = %d, want 1", len(msgs))
    }
    for i := 0; i < 17; i++ {
        n1.tick()
    }
    // wait inLease
    for i := 0; i < n2.electionTimeout; i++ {
        n2.tick()
    }
    for i := 0; i < n3.electionTimeout; i++ {
        n3.tick()
    }
    nt.send(pb.Message{From: 2, To: 2, Type: pb.MsgHup})
    n1.Step(pb.Message{From: 1, Type: pb.MsgReadIndex, Entries: []pb.Entry{{Data: []byte("123")}}})
    msgs = n1.readStates
    if len(msgs) != 2 {
        t.Errorf("len(msgs) = %d, want 2", len(msgs))
    }
    if n2.state == StateLeader {
        t.Errorf("n2 become leader")
    }
    if n3.state == StateLeader {
        t.Errorf("n3 become leader")
    }
}

@pav-kv pav-kv added the question Further information is requested label Feb 16, 2024
@boringhello
Copy link
Author

The test code provided above is a reproduction of the issue I described. However, the observed result where n2 became the leader while n1 still held the lease, allowing for a leaseRead operation. It is also possible that there is an issue with the test I wrote, as I am not familiar with this project. Can someone help me understand this question?

@MrDXY
Copy link
Contributor

MrDXY commented Mar 11, 2024

Hey @boringhello, I found your question very interesting. I'm new to this topic, but I'll share my ideas since I'm also interested in it.
I believe the main issue is having two leaders serving at the same time. If we enable the Leader Lease and a network partition happens, it will take a maximum of two election timeout durations for the leader to realize it has lost the quorum and step down as a follower in the current etcd-raft.
image

Let's look at it from n1's perspective:

  • At time1, n1, n2, and n3 are working fine, with n1 as the leader. However, just as n1 receives HeartBeatResponse from n2 and n3, a network partition occurs.
    • It takes us the first ElectionTimeout to reach time2.
  • At time2, n1 marks n2 and n3 as inactive followers and sends them HeartBeat, waiting for HeartBeatResponse to re-mark them as active ones.
    • It takes us the second ElectionTimeout to reach time3.
  • At time3, n1 checks n2 and n3 and finds out that they are still inactive. At this point, n1 realizes that it has lost the quorum.

Between time2 and time3, there is a possibility of encountering two lease-read results. Even if you implement a new Lease Read algorithm that includes a clock drift bound and allows clients to keep track of the latest index corresponding to the results they have seen, the problem will still persist. This approach only optimizes the situation, it doesn't fix it. However, it can reduce the occurrence of the problem.

I agree that having two ElectionTimeouts doesn't make much sense. And there is room for improvement. I'm not sure if my idea is right, but a possible solution is to track the follower's HeartBeatClock in ProgressTracker. The HeartBeatClock increases with the Leader's clock, and when we receive the follower's HeartBeatResponse, we reset the HeartBeatClock to 0. When the HeartBeatClock reaches the ElectionTimeout, we can mark the corresponding RecentActive as false. Before responding to a LeaseRead request, the leader needs to check the QuorumActive first.

@boringhello
Copy link
Author

boringhello commented Mar 13, 2024

Thanks for your idea. But it seemed that if the responce is delayed in the network, it may be wrong. I am not sure why etcd raft desgin like this.Maybe this is a tradeoff.

@boringhello
Copy link
Author

@pav-kv can you have a look this? It seemed that the LeaseRead's implement have some issues.

@pav-kv
Copy link
Contributor

pav-kv commented Mar 13, 2024

Last time I looked at leases in this repo, they did not appear correct/usable and providing at-most-one guarantee. I don't have time to look at this in detail at the moment.

For leases to work properly, there needs to be some "global" logical time notion shared by all the nodes. There must be a guarantee that any logical timestamp is owned by at most one lease. Ticks are a local heuristic, and ticks are not synchronized across nodes, so it's hard to tell if these heuristics give any guarantees.

@boringhello
Copy link
Author

You say that Leaseread was abandoned in this project?

@pav-kv
Copy link
Contributor

pav-kv commented Mar 13, 2024

I'm not aware who uses it. In CRDB, we have our own leases on top of raft based on hybrid logical clocks.

@boringhello
Copy link
Author

boringhello commented Mar 13, 2024

So why not remove the LeaseRead feature if they are not useful/correct ?

craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Apr 9, 2024
120613: raft: remove unused read-only requests r=nvanbenschoten a=pav-kv

This PR removes the read-only requests from `pkg/raft`. We don't use them in CRDB, and the leases implementation is known to be incorrect (e.g. see etcd-io/raft#166 and etcd-io/etcd#10082).

Epic: none
Release note: none

121956: sql: add accounting for entries in Txn Fingerprint ID cache r=yuzefovich a=yuzefovich

This commit fixes a bug that could previously result in the memory accounting leak that was exposed by 88ebd70. Namely, the problem is that we previously unconditionally grew the memory account in `Add` even though if the ID is already present in the cache, it wouldn't be inserted again. As a result, we'd only shrink the account once but might have grown it any number of times for a particular ID. Now we check whether the ID is present in the cache and only grow the account if not.

Epic: None

Release note: None

122011: backupccl: skip BackuprestoreJobDescription under stress race r=kev-cao a=msbutler

Informs #121927

Release note: none

Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants