Skip to content
This repository has been archived by the owner on Jun 18, 2021. It is now read-only.

Towards Jepsen-like tests #162

Open
stevana opened this issue Oct 3, 2017 · 7 comments
Open

Towards Jepsen-like tests #162

stevana opened this issue Oct 3, 2017 · 7 comments

Comments

@stevana
Copy link
Collaborator

stevana commented Oct 3, 2017

Jepsen is a framework for testing distributed systems. It has been used to find
bugs in many systems, e.g. Kafka, Cassandra, RabbitMQ.

It is also based on performing randomly generate actions from multiple threads
and then using linearisation to ensure correctness. However it is different in
that it has a thread, called the Nemesis, which is solely dedicated to fault
injection. Fault injections include skewed clocks, gc/io pauses (killall -s
STOP/CONT), lost packets, timeouts, and network partitions between the
distributed servers.

In order to account for the faults, Jepsen has a richer notion of history than
ours which includes the possibily of operations failing or timing out. When an
action failes, we know for sure that it did not change the state of the system,
where as if an operation timed out we don't know for sure (the state could have
changed, but the server ack that didn't reach us).

It would be neat if we could perform similar tests. Exactly how this is supposed
to work is still not clear. Some answers can perhaps be found in the original
linearizabiliy paper or in the many blog posts and talks by "Aphyr", the author
of Jepsen, some of which are linked to in the README.

Perhaps a good start would be to only focus on failed actions to begin with.
This issue seems to pop up in some of our examples already, see discussion in
#159, and does not require a nemesis thread.

A first step might be to change the type of Semantics to account for possible
failures:

data Result resp err = Ok resp | Fail err

type Semantics act m err = forall resp. act Concrete resp -> m (Result resp err)

(This can later be extended to deal with timeouts.)

The ResponseEvent constructor will need to be changed accordingly, and
linearise as well. I guess the right thing to do in linearise is to not update
the model and not check the post-condition. We could also change the
post-condition make assertions about err in addition to resp, but maybe this
is a refinement we can make when we see the need for it.

Thoughts?

@stevana
Copy link
Collaborator Author

stevana commented Oct 4, 2017

There's also the case when an action that returns a reference fails. In that case the subsequent actions that use the reference and have a precondition that the reference must exist in the model, will fail saying that the precondition was false.

@stevana
Copy link
Collaborator Author

stevana commented Jul 5, 2018

Crashes

If an operation does not complete for some reason (perhaps because it timed out or a critical component crashed) that operation has no completion time, and must, in general, be considered concurrent with every operation after its invocation. It may or may not execute.

A process with an operation is in this state is effectively stuck, and can never invoke another operation again. If it were to invoke another operation, it would violate our single-threaded constraint: processes only do one thing at a time.

Source: https://jepsen.io/consistency

@kderme
Copy link
Collaborator

kderme commented Apr 9, 2019

@stevana my understanding is that this #208 is a prerequisite for Jepsen-like tests right?

@stevana
Copy link
Collaborator Author

stevana commented Apr 9, 2019

@kderme: No, I don't think so. (It might make some tests easier to write, but shouldn't be necessary.)

stevana added a commit that referenced this issue May 31, 2019
stevana added a commit that referenced this issue Jun 11, 2019
stevana added a commit that referenced this issue Jun 19, 2019
@stevana
Copy link
Collaborator Author

stevana commented Jul 25, 2019

The above PR adds the ability to complete histories. To be able to complete a
history is needed when exceptions are thrown in the semantics,
because of e.g. request timeouts (as will happen with partitions). We
don't want to catch those exceptions because this would lead to a
non-deterministic model, because if a timeout happen for a write-like
command then we have no way to tell if it the request to the node
timed out (database didn't change) or if the response back from
the node timed out (database updated). And thus if we would catch this
exception we would need to update the model with both possibilities,
otherwise later read operations would fail.

What completing history does is that it simply appends a response to the
end of the history for all threads/pids/workers that crashed/threw an
exception. In the case of a write-like command the response is a simple
"Ack". Now because of the way of how linearisability works, it will try
all possible interleavings of this write-like command with all the later
read-like commands and check if there's a possible interleaving that is
consistent. So the linearisability checker handles/hides the
non-determinism of the model!

What's still not clear to me is: can we find bugs without completing, e.g.
maybe with partitions and some (simpler) version of accounting for this
non-determinism in the model? Or somehow controlling the fault injection in a more
precise way so that we know exactly if the timeout happened to or from
the node, and thus know if the database was updated or not?

@stevana
Copy link
Collaborator Author

stevana commented Jul 25, 2019

@kderme is currently working on adding an example which uses rqlite (a distributed version of sqlite that uses Raft for consensus), hopefully this can serve as a test bed for experimenting with distributed systems and fault injection. See the following work-in-progress branch.

As a first experiment, the idea is to try to trigger a stale read in the weak read consistency mode by either stopping and restarting nodes or by causing partitions (perhaps using blockade).

@stevana
Copy link
Collaborator Author

stevana commented Jan 26, 2020

What's still not clear to me is: can we find bugs without completing, e.g.
maybe with partitions and some (simpler) version of accounting for this
non-determinism in the model? [...]

I think the answer is yes and that there's a trade-off here:

  1. Simple model that doesn't account for faults, needs completion, will be slow
  2. Complex model that accounts for faults, doesn't need completion, will be fast(er)

I've also learned why Jepsen doesn't have a completion function, if an operation crashes it advances the model purely from the request. This isn't possible in our case as our transition function also involves the response.

Lets make things a bit more concrete with an example, consider a simple counter that starts at 0 and can be incremented:

thread 1, request: increment
thread 2, request: increment
thread 1, response: ok
thread 2, response: timeout

At this point the value of the counter could be 1 or 2 depending on if the request on the second thread timed out while going to the server (the counter didn't get updated), or if the response going back to the client timed out (the counter got updated). Let's continue execution:

thread 1, request: read
thread 3, request: read
thread 1, response: read -> 1
thread 3, response: read -> 2

This seems weird, how can read return two different values, without an increment happening in between? Remember that crashing operations (e.g. the timeout on the second thread) are concurrent with all operations after it, so the effect of the increment could happen between the responses of the reads and thus making the history linearise.

Also note that the second thread cannot be used as that could break the "single-threaded constraint: processes only do one thing at a time", as per the comment above.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants