Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[specs][doc] FizzBee spec for the Venice's LeaderFollower protocol #958

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jayaprabhakar
Copy link

@jayaprabhakar jayaprabhakar commented Apr 23, 2024

Summary, imperative, start upper case, don't end with a period

FizzBee spec for the Venice's LeaderFollower protocol, a one-to-one translation from the TLA+ syntax to the FizzBee syntax.

How was this PR tested?

Manually run it with the fizz model checker.

fizz specs/fizzbee/LeaderFollower/VeniceLeaderFollower.fizz 

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

This is a literal transilation of TLA+ syntax to FizzBee syntax.
@ZacAttack
Copy link
Contributor

Looks great! Could I also trouble you to include a readme file with the PR for how to run the spec/install fizz?

@jayaprabhakar
Copy link
Author

Sure. Added the README file pointing to the instructions at. https://github.com/fizzbee-io/fizzbee

At present, there is no precompiled binary, I have the full instructions to compile from source that I have tested in MacBook and Ubuntu on EC2.

Copy link
Contributor

@FelixGV FelixGV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing this! It is my first time reading FizzBee (and I am not that familiar with TLA+ either, so please bear with me...). I left a few comments, some of which may be for Zac, others which may be about how to model a given situation in FizzBee itself... I'd be interested to hear your thoughts on these. Thanks again!

Comment on lines 82 to 86
if nodes[id]['state'] == 'LEADER':
if nodes[id]['vtOffset'] >= len(version_topic):
real_time_consume(id)
else:
version_topic_consume(id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to mean: "if the leader's VT offset is below the end offset of the VT, then consume the VT, else consume the RT"

This makes sense I suppose, though in practice (within the Venice code) this particular decision only happens as part of the transition from follower to leader, and once it is determined ("somehow") that the leader can begin consuming from the RT, it will not go back on this decision (unless it is demoted to follower). In this spec, however, I don't see this being a "sticky" decision, it seems to be evaluated every time.

I guess the question is not necessarily for @jayaprabhakar but perhaps for @ZacAttack, since Jay may be just translating what is spelled out in the TLA+ spec. Zac, does the above make sense? Is the spec intentionally modeling a simplification of the real logic, or does this somehow map precisely to the system's behavior, and I'm just missing some detail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above makes sense. The TLA+ spec as written is currently evaluating weather or not it's caught up based on this, and it was written that way at the time as a simplification of the Venice code.

It might be more interesting to instead model it as a state. Like, the state of a node is either caught up or not caught up, and that state flips only after it's read through the VT to topic switch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I noticed this. In the current spec, if [nodeId].rtOffset (nodes[nodeId]['rtOffset']) of the leader is updated but not the [nodeId].vtOffset (nodes[nodeId]['vtOffset']). That would imply, when a leader is demoted, the newly demoted node would take a long time to catchup. So, the two ways to avoid slow and unnecessary catchup, was either to doing it continuously like the current spec, or to use a local state variable. I wasn't sure what the design decision was.

So I decided to translate the TLA+ spec as it is.

From the discussion, I assume, we just need to have a state variable to mark the follower as caught up.
(FYI, changing it by updating vtoffset + local state reduces the possible states and the model checking time significantly)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On other side, when the newly promoted leader catches up with the version topic, it still starts to consume real time topic from where that node left of last when it got demoted, instead of picking the offset from the latest version topic message.

That is clearly impractical. Can I also change that to use the offset from the latest version topic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a state variable is good enough. The way it works in production is the follower just keeps a highwatermark based on what's it's consumed from VT. So the offset from the latest version topic I think is good.

Comment on lines 108 to 112
nodes[nodeId]['rtOffset'] += 1
nodes[nodeId]['persistedRecords'][key] = value

# Append the consumed event to the version topic
version_topic.append((key, value, offset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be is a minor detail, but in today's implementation, the order of operation is: VT append, local persistence, update RT offset. And the reason why we do it this way is so that we have idempotency in case we crash at any point in between the operations (and so, perhaps this function should not have the atomic keyword?)

We are actually considering changing the ordering of these operations (and it might be interesting to model it both ways), but that change (#910) is not integrated (or even necessarily fully agreed upon...) yet.

Copy link
Contributor

@ZacAttack ZacAttack May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually, you touch on a really interesting facet that would probably tell us that we should consider either a refinement of the leader follower spec, or, tweaking it a bit. The way TLA+ works is that you define states. Currently, the model is very simplistic. It defines the leader actions of consuming from RT, updating local state, and producing to VT as a single state. If we wanted to vet the ordering of these actions, and make sure that the order is safe in the face of calamity, we should probably model all of them as separate states in the spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this, but to model it correctly, I need one additional information. How do you handle duplicate message publication to the topic?
Kafka doesn't have a built deduplication like that of SQS or other queueing services. But I know Kafka streams and client side libraries handle it in a way this never becomes an issue.
Let me know, how should I model that? Can I assume the deduplication is somehow handled without actually implementing it?

The scenario:

  1. RealTimeConsume
    i. VT append ((k1,v1,0))
    ii. Crash
  2. RealTimeConsume
    i. VT append (k1,v1,0)
    ii. Crash
    This can go on infinitely (leading to infinite state space)
    So, I could quickly add a way to dedupe if an exact message is sent, the append would ignore and we will proceed with the remaining steps.

Correct me if I missed something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in Venice the way we essentially deal with it is we have write collision resolution (last writer wins). So I think your suggestion of append ignore is the right way to go.

Comment on lines 121 to 122
nodes[nodeId]['vtOffset'] += 1
nodes[nodeId]['persistedRecords'][key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a similar vein as above, we do the persistence first, and the VT offset increment (i.e. checkpointing) after, so that we tolerate crashing in-between the two operations (and we may want to remove atomic...).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above response, because the model here is very simplistic (high level?) the line ordering here doesn't matter too much. It'd be better to split each of these into their own states.

1. Maintain a bool variable newly_promoted to indicate a new leader
   who has not caught up on previous version topics.
2. Continue to update vtOffset and rtOffset for each node irrespective
   of whether the node is a leader or follower.
3. Make the leader and follower consumption serial instead of atomic
   - The leader will first post to version_topic, then, update the local datastore, then update the offsets
   - Similarly, the follower will first update its local datastore and then update the offsets.
   The system can crash at either of these points, and the design is verified to work even if the system crashes in between
@jayaprabhakar
Copy link
Author

@FelixGV @ZacAttack I have made the changes you asked for in this commit: c721c5e


Please take a look and let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants