test: improve `TestConcurrentReadAndWrite` to verify read/write linerizability #455

ahrtr · 2023-04-07T09:12:56Z

Validate the linearizablity: each reading transaction should read the same value written by previous writing transaction on the same key.

ahrtr · 2023-04-07T10:28:29Z

cc @ptabor @serathius

concurrent_test.go

serathius · 2023-04-07T14:29:14Z

concurrent_test.go

+	return rs
+}
+
+func analyzeHistoryRecords(rs historyRecords) error {


This is not a linearization validation as it doesn't account for operation execution time. It depends on Txid for order, making it very limited. For example it doesn't detect stale reads as View can just return old transation id, even though newer transaction was already committed.

I'm not sure what concurrency issues this approach is mean to find or can find any. Please let me know if you need help with integrating linearizability checker https://github.com/anishathalye/porcupine like in robustness tests.

Depending on Txid is enough, because in bboltDB txid is strictly incremental. Any data written by a transaction should be visible to following reading transaction, otherwise it must be an issue. My understanding is that porcupine might generate false alarm, but in this test it will never generate false alarm unless it's a test bug. Also we are testing only local boltDB library instead of a distributed system, so I don't see any reason to use porcupine.

Note that there is no stale read, a read transaction should always see the same data during the transaction lifecycle, and it's exactly what repeatable reading means. Otherwise it must be an issue.

In summary, there are two points:

A read transaction should always see the same data view during its lifecycle.

Any data written by a writing transaction should be visible to any following reading transactions (with txid >= previous writing txid)

Also try to check whether it will result in data corruption under concurrent transaction operations.

You are mistaking serializability and linearizability. https://jepsen.io/consistency
Txid gives you serializability, but linearizability is real time property, not bound to any counter.

My understanding is that porcupine might generate false alarm, but in this test it will never generate false alarm unless it's a test bug

No, depending on history porcupine might take a long time to validate it, causing test to timeout. However, assuming that tests is correctly written and operation history doesn't include a lot of failed request it should not be a problem

You are mistaking serializability and linearizability.

Note that in bboltDB, only one writing transaction is allowed at a time. So in practice serializability is the same thing as linearizability.

Note I am planning to add multiple writing transaction in followup PR.

I can rename all *linearizable* to *serializable* when I start to support multiple writing transactions. thx for pointing this out.

I will not be opposed to testing just serializability, but I think you need to add some additional validation to so it makes sense.

Linearizability bounds function execution by client observed time. Serializability no longer cares about time, but cares about the order. Operations observed by client should be bound by order. Meaning client should observe increase in transaction ID.

I know that you do validation on transaction ID, but not on the order observed by each client. I would recommend adding client side validation that transaction ID is always growing for each client, before we sort it.

I would recommend adding client side validation that transaction ID is always growing for each client, before we sort it.

Agreed. It's a good idea. Actually we also see such issue (txid isn't somehow strictly incremental). see #402 (comment)

Let me do it in a separate PR. This PR is already too big.

Ups, I mistook Serializable vs Sequential. Still validating sequential here would be useful and ok to be done in followup.

ahrtr · 2023-04-08T05:16:52Z

Addressed some your comments, and I will resolve other comments in followup PR, thx. @serathius PTAL

cc @ptabor as well.

…ablity Signed-off-by: Benjamin Wang <wachao@vmware.com>

…ion history locally firstly Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2023-04-14T05:36:31Z

@ptabor @serathius PTAL, thx

concurrent_test.go

serathius · 2023-04-14T09:04:03Z

concurrent_test.go

+		t.Errorf("The history records are not linearizable:\n %v", err)
+	}
+
+	saveDataIfFailed(t, db, records)


defer it to ensure it is executed even on fatal or panic?

If somehow panicking during test, then the records hasn't been collected yet. It needs big change. Let me think about it separately. thx

concurrent_test.go

Signed-off-by: Benjamin Wang <wachao@vmware.com>

concurrent_test.go

serathius

There are a lot of follow up, which is not to my liking.

Still, as whole the PR is an improvement. Please make sure to do followups as postponing work is a bad example for other contributors. We usually ask other contributors to do all cleanups in the PR.

ahrtr · 2023-04-14T10:30:29Z

Thank you!

BoltDB test (to reproduce the data corruption issue) is my top priority. Some followup is already my planned work.

ahrtr · 2023-04-14T11:44:12Z

do followups as postponing work is a bad example

Based on my previous experience, I am definitely not the blocker; instead, lack of active review is.

serathius reviewed Apr 7, 2023

View reviewed changes