Rework `BufferedRandomAccessFile` #907

Calvin-L · 2024-04-10T21:54:18Z

This commit makes major alterations to BufferedRandomAccessFile to bring it in line with the general RandomAccessFile contract:

Invariant V1 has been strengthened to include a contract for the length field.
Invariants V2-V5 have been altered to account for cases where curr > length, which the RandomAccessFile contract allows.
Additionally, invariant V5 has been adjusted to clarify that dirty=true does not imply that there are unflushed bytes. The implication is only one-way: if there are unflushed bytes, then dirty=true.
The hi and maxHi fields have been removed. These largely duplicate information that can be quickly recomputed, meaning they add additional complexity to the class invariants. Without those fields, invariant V6 can be removed.
The close() method now closes the underlying file descriptor even if flush() throws an exception.

Due to the complexity of this change, I have added a TLA+ specification for BufferedRandomAccessFile to enable model checking of its design.

To ensure that the code is correct, this commit also adds a substantial amount of test infrastructure:

Several new regression tests have been added. These are a little cryptic because they were randomly generated, but they do reflect true behaviors of the RandomAccessFile contract.
A fuzz test has been added. This test ensures that many sequences of random operations behave correctly. It also minimizes failures so they can be easily converted to regression tests.

This commit also makes some whitespace and formatting adjustments.

Fixes #835. I have verified that the tests now pass under Java 21.

This commit makes major alterations to `BufferedRandomAccessFile` to bring it in line with the general `RandomAccessFile` contract: - Invariant V1 has been strengthened to include a contract for the `length` field. - Invariants V2-V5 have been altered to account for cases where `curr > length`, which the `RandomAccessFile` contract allows. - Additionally, invariant V5 has been adjusted to clarify that `dirty=true` does not imply that there are unflushed bytes. The implication is only one-way: if there are unflushed bytes, then `dirty=true`. - The `hi` and `maxHi` fields have been removed. These largely duplicate information that can be quickly recomputed, meaning they add additional complexity to the class invariants. Without those fields, invariant V6 can be removed. - The `close()` method now closes the underlying file descriptor even if `flush()` throws an exception. Due to the complexity of this change, I have added a TLA+ specification for `BufferedRandomAccessFile` to enable model checking of its design. To ensure that the code is correct, this commit also adds a substantial amount of test infrastructure: - Several new regression tests have been added. These are a little cryptic because they were randomly generated, but they do reflect true behaviors of the `RandomAccessFile` contract. - A fuzz test has been added. This test ensures that many sequences of random operations behave correctly. It also minimizes failures so they can be easily converted to regression tests. This commit also makes some whitespace and formatting adjustments. Fixes #835. I have verified that the tests now pass under Java 21. Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

lemmy · 2024-04-10T22:00:14Z

Impressive!

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

IamfromSpace · 2024-04-17T02:37:17Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

+ArraySlice(a, startInclusive, endExclusive) == [elems |-> SubSeq(a.elems, startInclusive + 1, endExclusive)]
+ArrayConcat(a1, a2) == [elems |-> a1.elems \o a2.elems]
+
+WriteToFile(file, offset, data_to_write) ==


It is interesting to me that most of these functions are about Arrays. Even WriteToFile and ExtendFile are relatively easy make generic array operations, maybe OverlapAt and TruncateOrExtendWith, where both take a placeholder as an argument. It may simply be worth making this an Array module, and then Common stays more focused on the spec itself WriteToFile(file, offset, data_to_write) == Array.Overlap(file, data_to_write, offset, ArbitarySymbol) (I have no idea why I felt compelled to reverse arguments).

I suppose, otherwise I was going to suggest that we add brief documentation on what the semantics of WriteToFile and ExtendFile were, but in many ways something like ExtendFile == Array.TurncateOrExtendWith(file, new_length, ArbitrarySymbol) does this for you.

I am probably overthinking these utility functions, but it struck me that maybe the most intuitive implementation involves slice & concat.

TruncateOrExtend(a, T, len) == IF len > ArrayLen(a) THEN ArrayConcat(a, Array(T, len - ArrayLen(a))) ELSE ArraySlice(a, 0, len) OverlayAt(a1, a2, T, i) == ArrayConcat(TruncateOrExtend(a2, T, i), a2)

Should any of those utility operators be contributed as an Array module to our CommunityModules?

I added some simple documentation to these operators.

but it struck me that maybe the most intuitive implementation involves slice & concat

I agree that the current definitions are a little ugly. I wrote them after one or two failed attempts to come up with simpler definitions.

The ones you've given are certainly cleaner than my early attempts, and I will adopt your definition for TruncateOrExtend. But I might keep the definition I wrote for WriteToFile. The equivalent definition in terms of other operators is actually fairly cryptic (your definition for OverlayAt needs to include the portion of the contents after the written data if the contents were not written at the end of the array):

WriteToFile(file, offset, data_to_write) == ArrayConcat(ArrayConcat( TruncateOrExtendFile(file, offset), data_to_write), ArraySlice(file, offset + ArrayLen(data_to_write), ArrayLen(file)))

Should any of those utility operators be contributed as an Array module to our CommunityModules?

Yes! I'll look into doing that soon.

your definition for OverlayAt needs to include the portion of the contents after the written data if the contents were not written at the end of the array

Whoops! Easy to make things simpler if you forget the critical parts 😆

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

IamfromSpace · 2024-04-18T02:05:25Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

@@ -0,0 +1,384 @@
+\* Copyright (c) 2024, Oracle and/or its affiliates.
+
+----------------------- MODULE BufferedRandomAccessFile -----------------------


Just wanted to say that I have absolutely found it useful to have a spec of the class for review, and especially as a refinement.

This spec could also be referenced at tlaplus/examples

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

This commit also updates the specification to model EXACTLY what the `RandomAccessFile` docs say about the interaction between `setLength` and the file pointer. Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

Calvin-L · 2024-04-23T21:37:00Z

Thanks a ton @IamfromSpace for the feedback!

I think I've addressed all the major comments so far.

IamfromSpace · 2024-04-24T01:36:09Z

These updates look great! For the sake of thoroughness I still have a few comments coming through on tests, but nothing of real concern. Just wrapped up the unit tests, but it's taking me a bit longer to digest the fuzzing file :)

tlatools/org.lamport.tlatools/test/tlc2/util/BufferedRandomAccessFileTest.java

tlatools/org.lamport.tlatools/test/tlc2/util/BufferedRandomAccessFileFuzzTest.java

IamfromSpace

Alrighty, that's 100% everything from me :) I think there's some test comments that are worth a look over, but there's certainly no deal-breakers in there, so I've got the approval on there now. Not sure if the repo is configured to reset on changes or not, but if so, I'll keep it active for any commits that come in.

Nice work on this!

Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

In particular, "legal" is now "well-defined". Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

lemmy · 2024-05-02T00:37:27Z

I won't have time to review this PR anytime soon and I don't want to be a bottleneck. Given @IamfromSpace's review, I feel confident about this PR. @Calvin-L, feel free to merge or consider inviting additional reviewers at the TLA+ Community Meeting in two weeks. Reviewing the spec offers a great opportunity for those unfamiliar with or not fond of Java to contribute to the TLA+ tools.

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

lemmy · 2024-06-03T20:57:07Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

+        // But do not be tempted!  That call incurs a performance penalty WITHOUT OFFERING A STRONGER CONTRACT!  There
+        // are still cases where we rely on "super" to extend the file---meaning we inherit the weakest possible
+        // contract from RandomAccessFile.  In particular, this happens when extending the file using setLength() or
+        // when the client makes a giant forward seek followed by a write.


This comment could imply a possible completeness bug in DiskFPSet s.t. the arbitrary data matches a fingerprint that is probed. This particular bug would vanish if the content of the buffer was zeroed.

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

lemmy · 2024-06-03T23:03:40Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

+\* is generally restored by calling `seeek(curr)`.  But, that behavior is
+\* difficult to model in straight TLA+ because each method may modify variables
+\* multiple times.  So instead, this spec treats Inv2 as a precondition for the
+\* methods and verifies that it is indeed restored by `Seek(pos)`, which should


Inv2CanAlwaysBeRestored suggests that this should mention FlushBuffer in addition to Seek.

lemmy · 2024-06-03T23:22:38Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

+
+\* Inv2 is a precondition for many actions; it should always be possible to
+\* restore Inv2.
+Inv2CanAlwaysBeRestored ==


I translated the prose comment into the following property:

Inv2CanAlwaysBeRestoredPart2 == [][IF dirty THEN FlushBuffer => Inv2' ELSE Seek(curr) => Inv2']_vars

, which TLC says is violated by the following trace:

Finished computing initial states: 1 distinct state generated at 2024-06-03 15:53:42. Action property Inv2CanAlwaysBeRestored is violated. The behavior up to this point is: 1: <Initial predicate> /\ BuffSz = 2 /\ MaxOffset = 4 /\ dirty = FALSE /\ length = 0 /\ curr = 0 /\ lo = 0 /\ buff = [elems |-> <<ArbitrarySymbol, ArbitrarySymbol>>] /\ diskPos = 0 /\ file_content = [elems |-> <<>>] /\ file_pointer = 0 /\ abstract_contents = [elems |-> <<>>] 2: <WriteAtMost([elems |-> <<A, A>>]) line 193, col 5 to line 202, col 60 of module BufferedRandomAccessFile> /\ BuffSz = 2 /\ MaxOffset = 4 /\ dirty = TRUE /\ length = 2 /\ curr = 2 /\ lo = 0 /\ buff = [elems |-> <<A, A>>] /\ diskPos = 0 /\ file_content = [elems |-> <<>>] /\ file_pointer = 0 /\ abstract_contents = [elems |-> <<A, A>>] 3: <FlushBuffer line 109, col 5 to line 119, col 43 of module BufferedRandomAccessFile> /\ BuffSz = 2 /\ MaxOffset = 4 /\ dirty = FALSE /\ length = 2 /\ curr = 2 /\ lo = 0 /\ buff = [elems |-> <<A, A>>] /\ diskPos = 2 /\ file_content = [elems |-> <<A, A>>] /\ file_pointer = 2 /\ abstract_contents = [elems |-> <<A, A>>]

Revisiting https://github.com/tlaplus/tlaplus/pull/907/files#r1625145343, this property, perhaps, too strong. This hypothesis is supported because the following holds:

Inv2CanAlwaysBeRestoredPart2 == [][Seek(curr) => Inv2']_vars

This is a fantastic point. I will look into strengthening this property on Monday. Perhaps it can be fully specified using \cdot to compose FillBuffer and Seek(curr).

lemmy · 2024-06-03T23:46:25Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

+WriteToFile(file, offset, data_to_write) ==
+    LET file_len == ArrayLen(file) IN
+    LET data_len == ArrayLen(data_to_write) IN
+    LET length == Max(file_len, offset + data_len) IN


Nitpick and you probably know this, but I prefer a single LET/IN:

LET file_len == ArrayLen(file) data_len == ArrayLen(data_to_write) length == Max(file_len, offset + data_len) IN

lemmy · 2024-06-04T03:35:25Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

+    /\ numReadableWithoutSeeking >= 0
+    /\ LET numToRead == Min(ArrayLen(data), numReadableWithoutSeeking) IN
+       LET buffOff == curr - lo IN
+        /\ data = ArraySlice(buff, buffOff, buffOff + numToRead)


Neither Read nor Read1 are enabled if the buffer contains ArbitrarySymbol, i.e., the following invariant holds:

DebugReadsWithArbitrarySymbol == \E i \in 1..Len(buff.elems): buff.elems[i] = ArbitrarySymbol => /\ \A data \in UNION{ Array(Symbols, len): len \in 1..MaxOffset }: ~ENABLED Read(data) /\ \A symbol \in Symbols: ~ENABLED Read1(symbol)

Why this restriction, given that BufferedRandomAccessFile.java does not impose such a restriction?

You are right; I believe that restriction can and should be relaxed. It was not intentional.

@lemmy There is an associativity problem in your definition; I think it should be

DebugReadsWithArbitrarySymbol == (\E i \in 1..Len(buff.elems): buff.elems[i] = ArbitrarySymbol) => \* NOTE parens around \E expression /\ ...

Using that modified definition shows that Read actually is enabled when the read does not cover ArbitrarySymbol, even if there are ArbitrarySymbols in the buffer.

Even so, I do think it makes sense to allow reads of ArbitrarySymbol in the next-relation.

lemmy · 2024-06-04T04:09:15Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla

+BufferedIndexes == lo .. (Min(lo + BuffSz, length) - 1)
+
+Inv1 ==
+    /\ diskPos = file_pointer


This differs from V1 in BufferedRandomAccessFile.java.

FedericoPonzi

Very good job! I haven't got to the end yet, I've found a read after close issue and add some nits.

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

FedericoPonzi · 2024-06-06T17:52:20Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

-      return new BigInteger(b);
+        // If we read past the buffer, restore our internal invariants
+        if (this.curr >= this.lo + BuffSz) {
+            seeek(this.curr);


seek in this case would be better as the return value is not used

I am inclined to factor this out into a helper method with a name like restoreInv2, since an identical check and seeek call appears other places as well. I will make that change soon. (When I do, I will probably keep the call to seeek and not seek, since it avoids an extra call stack entry.)

FedericoPonzi · 2024-06-06T18:20:49Z

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java

-                System.arraycopy(availBuffs, 0, newBuffs, 0, numAvailBuffs);
-                availBuffs = newBuffs;
+        try {
+            // Assert.check(!this.closed);


currently, the implementation allows read after close. We close the underlying file, give back the buffer to the pool but the buffer is not nullified, and subsequent reads that don't access the file will succeed.

Sample test adopted from the testRead (see last three rows)

@Test public void testRead() throws IOException { final File tmpFile = File.createTempFile("BufferedRandomAccessFileTest_testRead", ".bin"); tmpFile.deleteOnExit(); final java.io.RandomAccessFile raf = new BufferedRandomAccessFile(tmpFile, "rw"); for (long i = 0L; i < BufferedRandomAccessFile.BuffSz / 8; i++) { raf.writeLong(i); } raf.seek(0); for (long i = 0l; i < BufferedRandomAccessFile.BuffSz / 8; i++) { assertEquals(i, raf.readLong()); } raf.seek(0); //seek back to zero raf.close(); // closed assertEquals(0, raf.readLong()); // still works }

This could be a problem in case the buffer is reassigned to a different BRAF. To reduce the state, it would be sufficient to call "isClosed" on the underlinying RAF. Because there is no isClosed method, getFilePointer could be used for its side effect of failing if the underlying object is closed.

I have mixed feelings about this.

the implementation allows read after close

Yes and no. The implementation does not throw an exception if you do a read after close, so in that sense it "allows" read after close. But, the implementation also doesn't specify behavior if you read after close, so in that sense it "disallows" read after close. In other words, clients are responsible for ensuring that they do not interact with closed files.

In this case, I would peronsally rather push responsibility to the clients---but it's true that the official RandomAccessFile contract does require IOException to be thrown for some read and write methods if the file is closed (but weirdly, not all of them). In your specific example there is no specified behavior, and therefore the current behavior ("return an arbitrary long without throwing an exception") is legal. But if you replace the final readLong() with read(byte[]), it is indeed a demonstration of a bug, since that method requires an exception if the file is closed.

Also, because you called this out I noticed that this close() method is not idempotent, and it must be since this class implements Closeable. A duplicated close() call will put buff into the availBuffs pool a second time, which could be catastrophic.

I think all this means that there is a strong case for reinstating the commented-out closed field, which would make it very easy to fix these problems. I will look into doing this on Monday.

Aside: it is also irksome to me that the official contract for RandomAccessFile is to throw IOException if the file is closed, when IllegalStateException exists and was designed for exactly this purpose:

Signals that a method has been invoked at an illegal or inappropriate time.

But, to my surprise, RandomAccessFile is older than IllegalStateException, so I suppose it makes sense.

@FedericoPonzi @lemmy On second thought, I do not plan to address this comment in this PR. There is no new bug here, so I think it makes sense to address this in a separate PR.

Signed-off-by: Calvin Loncaric <calvin.loncaric@oracle.com>

Calvin-L force-pushed the cal-gh835-bufferedrandomaccessfile branch from cd7cf92 to 297c4c8 Compare April 10, 2024 21:54

lemmy added bug error, glitch, fault, flaw, ... Tools The command line tools - TLC, SANY, ... labels Apr 10, 2024

ahelwer reviewed Apr 12, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla Show resolved Hide resolved

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla Show resolved Hide resolved

IamfromSpace reviewed Apr 17, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla Show resolved Hide resolved

IamfromSpace reviewed Apr 17, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla Outdated Show resolved Hide resolved

IamfromSpace reviewed Apr 17, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.tla Outdated Show resolved Hide resolved

IamfromSpace reviewed Apr 18, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java Show resolved Hide resolved

IamfromSpace reviewed Apr 18, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java Outdated Show resolved Hide resolved

IamfromSpace reviewed Apr 18, 2024

View reviewed changes

tlatools/org.lamport.tlatools/src/tlc2/util/BufferedRandomAccessFile.java Show resolved Hide resolved

IamfromSpace reviewed Apr 18, 2024

View reviewed changes

IamfromSpace reviewed Apr 19, 2024

View reviewed changes