Added string support and first class characters #685

zwergziege · 2021-11-03T09:06:39Z

Closes #512 and related issues.

Added several todos where I'm uncertain how to implement functions correctly and may raise questions regarding the implementation of other value types

…StringValue to Fcn / Rcd / Tuple

zwergziege · 2021-11-03T09:14:14Z

tlatools/org.lamport.tlatools/src/tlc2/value/impl/CharValue.java

+	  }
+
+	@Override
+	public final boolean mutates() { return false; }


this is according to the doc (it normalization and fingerprinting don't mutate the object) but not in line with the e.g. IntValues...

Adding a new basic data type to TLC introduces problems, a serious one being to modify the fingerprinting code to maintain the extremely small probability of collisions. This is not trivial. The same fingerprinting algorithm was used inside a compiler once. A compiler bug was traced to a collision of fingerprints that should have had an astronomically small probability. It was discovered that this was the result of a bug in the algorithm's implementation.

One possibility might be to make characters typed model values, using as type a character that is not currently allowed.

I have no experience in this matter, but I fail to see why characters can not be handled in a similar way as integers since both are essentially numbers. In fact, it seems to me that your suggestion would have a very similar result, except for substituting the char kind byte with the model value kind byte and the character value with the unique id of the character.

I do not understand the fingerprinting algorithm. But I know that some smart people (researchers at one of the world's leading research labs), working from a description of the algorithm, produced an incorrect program. This tells me that modifying code that implements the algorithm, without fully understanding the code or the algorithm, is risky.

Fingerprinting is tricky. You're mapping numbers, sets of number, functions on sets of numbers, etc. all into strings of bits. For any single one of those kinds of objects, it's not too hard to find some mapping that has a small probability of collisions. But how do you know that there is a small probability of a collision between two different kinds of value? It's not easy. Saying characters are like numbers so you just treat them like numbers sounds like a recipe for producing collisions between characters and numbers.

As far as I can see, the code uses a one-way compression function that produces a 64bit tag from a tag and a byte (FP64.Extend). To address the problem you mentioned, every class first extends the fingerprint with a byte unique to the class, followed by the data. This leads to different sequences of bytes fed to the compression function even if values of different types share the same binary representation (at least for primitives/types of constant length).

Speaking of collisions: As far as I can tell the code of FP64.Extend treats chars as a single byte. Strings that contain characters outside the range 0..2^8-1 are probably rare but would result in colliding fingerprints.

That sounds good. I believe the problem solved by the general fingerprinting algorithm involved fingerprinting values from an unbounded set of types. With values from a fixed set of types, the problem does seem simple.

Thanks for pointing out the problem that can arise if TLA+ is extended to allow non-ASCII characters in strings. It would be nice if it could be documented somewhere in case that extension is ever implemented, but I don't know how to add documentation for code that no one has begun writing.

zwergziege · 2021-11-03T09:28:07Z

tlatools/org.lamport.tlatools/src/tlc2/value/impl/CharValue.java

+
+	@Override
+	final boolean assignable(Value val) {    
+		try {


This method seems to be unused.

zwergziege · 2021-11-03T09:41:38Z

On a general note: When I wrote this back in July I noticed that the code base of the different value types is rather convoluted and contains a lot of code duplication. Also, it seems as if somebody wanted to extract interfaces but stopped in the middle of the refactoring process. Is a continuation of this process/clean-up desired? If so, I could probably spend some time on it.

xxyzzn · 2021-11-03T09:53:41Z

TLC was originally written by one person in an amazingly short time. As a result, the code was not well planned. Any improvements to it are most welcome. However, the poor structuring makes it very hard to make any changes without introducing bugs. So, as with doctors, the prime commandment for anyone making changes is: Do no harm. This means that anyone who makes changes must commit to doing a lot of thinking and testing. There are engineers who depend on TLC for designing important systems.

zwergziege · 2021-11-03T13:32:59Z

The isAtom() method of the IValue interface should probably be changed to include the new char value type (and exclude strings?). The corresponding code appears to be dead though.

xxyzzn · 2021-11-03T13:40:16Z

Strings should be included in the elementary data types of TLC. Only a tiny fraction of specs will ever care that strings are made up of characters, and you should not slow down the evaluation of operations on strings in all specs just for the benefit of that tiny fraction. That means that whenever TLC represents a value in terms of elementary data types, it will have to check every sequence to see if it's a string. (TLC already has to check every function to see if it's a sequence.)

zwergziege · 2021-11-03T16:30:55Z

I don't think that the added functionality would slow down existing specs since basically all new code paths would previously have resulted in errors if I'm not mistaken.

xxyzzn · 2021-11-03T17:39:37Z

I just remembered that, because string handling in Java was slow at the time, TLC creates a unique ID for each string and uses it to check for equality of strings. I don't remember if those IDs are assigned by the parser, but new ones are probably produced if strings are dynamically created, which they can be. Anyway, you need to be aware of the use of those IDs and make sure things work properly when creating new strings in previously impossible ways.

If you remove strings as elementary data types, this will change fingerprinting for all specs. I don't know if that will significantly affect performance.

lemmy · 2021-11-03T17:56:01Z

The corresponding class is util.UniqueString. Related: Care must be taken because util.InternTable is guarded by a global lock, severely limiting concurrency if updated during model-checking.

Overall, it would be good to get rid of UniqueString because modern Java no longer has problems with strings. However, UniqueString appears everywhere in TLC. Thus, such refactoring has super high-risk.

zwergziege · 2021-11-04T10:50:26Z

I just remembered that, because string handling in Java was slow at the time, TLC creates a unique ID for each string and uses it to check for equality of strings.

Thanks for pointing that out. This is indeed a problem because the unique id is basically the index in the internal table. Thus in my implementation comparing strings depends on whether are not they are represented as StringValue. A potential solution would be to compare on a character basis even if both values are StringValues and accept the entailed slow-down.

That means that whenever TLC represents a value in terms of elementary data types, it will have to check every sequence to see if it's a string. (TLC already has to check every function to see if it's a sequence.)

If you remove strings as elementary data types, this will change fingerprinting for all specs. I don't know if that will significantly affect performance.

I assume that every TLA+ value must result in the same fingerprint regardless of its representation? In this case fingerprinting would indeed have to be adapted in order to guarantee accurate fingerprints for specs using the new feature, leading to a slow down for old specs. Instead of checking each sequence for string-ness a better option would probably be to treat StringValues as if they were FcnRcdValues when fingerprinting. For some reason I do not comprehend so far the fingerprinting method of StringValue does not use the unique id but instead feeds the whole string to the compression function thus compressing one byte per character. Using the proposed method should require compressing 7 bytes per character (1+4 for the index and 1+1 for the character). Noteworthy: The StringValue fingerprinting code is duplicated in RecordValue.fingerprint (any other potential candidates?) which will naturally be slowed down as well.

Should I implement what I outlined above (i.e. is it worth the cost)? Do you see any other issues?

Overall, it would be good to get rid of UniqueString because modern Java no longer has problems with strings. However, UniqueString appears everywhere in TLC. Thus, such refactoring has super high-risk.

Getting rid of UniqueString seems to be a giant refactoring. The class even contains information on state variables and operators. Removing it from StringValue and other Value classes seems possible at first glance.

Tangential question: Fingerprinting FcnLambdaValues with infinite domains is impossible, right? I see they are converted to FcnRcdValues. In this process, the size of the domains (multi-argument functions) is queried. If one of those is a UserValue (which afaik are the only value type that can be inherently infinite) an exception is triggered.

xxyzzn · 2021-11-04T14:15:41Z

A potential solution would be to compare on a character basis even if
both values are StringValues and accept the entailed slow-down.

Anything that significantly slows down the evaluation of specs that do
not use the new feature is unacceptable. The only way this would be
acceptable if there's a TLC option to turn on the feature, so the
slowdown occurs only for specs that use the new feature. Instead of
the option, a standard module could be added with useful operators
that depend on the feature, and the user would have to extend that
module when using the feature.

I assume that every TLA+ value must result in the same fingerprint
regardless of its representation?

The same value having different fingerprints slows down model
checking. It doesn't introduce any errors for safety checking, I'm
not sure about liveness checking.

Should I implement what I outlined above (i.e. is it worth the cost)?
Do you see any other issues?

My opinion has been that such a small percentage of TLA+ specs would
use full string processing that it's not worth slowing down other
specs to add it. I have seen nothing to change that opinion.

I do have one question: To do what you want to do with first-class
strings, do you have to use TLA+ strings? Suppose TLA+ were augmented
with a new elementary datatype char and the TLA+ syntax were augmented
so that xyz' is a sequence of three chars that equals xy' \o `z'. A
standard module could provide operators to map strings to char
sequences and vice-versa. Would there then be any reason why you
would want the feature you're implementing?

zwergziege · 2021-11-04T15:59:52Z

Maybe the following would be acceptable (I don't really like it though)?
When comparing strings use the per character approach, as this would be required when switching from UniqueString normal strings anyways.
When fingerprinting opt for ambiguous fingerprints, which should only occur in specs using the feature.

I do not depend on this feature. I simply tend to stumble upon corner cases. When I started out with TLA+ about a year ago I read "specifying systems" and set out to do things a certain way. It happened more than once that my ideas would not work because of some restrictions in TLC or TLAPM. All of it could be worked around, but especially concerning strings I saw #512 and thought that maybe I could contribute to it, especially since there seemed to be a plan on what to do. But I do see the downsides of it now.

I don't think that anyone would benefit from adding new syntax except if it were very visibly documented. One of my main issues with TLA and its tools is its lack of central documentation. If you want to look something up you often have to consult several documents and even then you might still get unexpected errors. I wanted to counteract this by making TLC support more of TLA+, but as the downsides seem to outweigh the benefits I think we can close this PR.

xxyzzn · 2021-11-04T16:59:41Z

Thank you for trying to help. I'm afraid one of the things that aren't properly documented is how best to help.

You wrote that a major problem for you has been the lack of central documentation. That is something we can certainly use help with. Neither Markus Kuppe nor I have the time to devote to it. I would be happy to advise on such a task. As is true of any effort to improve TLA+ or its tools, one should talk to us before implementing anything to make sure that it will be as useful as possible.

Leslie

zwergziege · 2021-11-08T08:57:54Z

I think it would take at least one person who has both a decent amount of time to spare and a clue of what they are doing to realize a truly unified documentation of TLA+ and its tools. I fear I fulfill neither of those requirements. It is one thing to spend two days patching together a feature that is ultimately rejected because of a lack of consideration and another to create an overarching compendium / reference. Anything less than that would just be one more resource to consider and thus counterproductive.
There is a dilemma here since anyone proficient enough in the ways of TLA+ has probably better things to do than writing yet another documentation. Maybe the community approach can achieve the goal given enough time if it is easy enough to contribute. This would still require someone with a good overview to set the course which probably requires a large initial effort.

Benjamin von Berg added 4 commits November 2, 2021 15:42

created new TLCValue type CharValue and added conversion methods for …

453effb

…StringValue to Fcn / Rcd / Tuple

more string stuff

0a3d7f3

Added test cases

e25c625

added todos for my uncertainties

f937d67

zwergziege mentioned this pull request Nov 3, 2021

application and DOMAIN of strings in TLC #512

Closed

zwergziege commented Nov 3, 2021

View reviewed changes

zwergziege mentioned this pull request Nov 30, 2021

Proposal to add Unicode support to the TLA+ language standard and tools tlaplus/rfcs#5

Closed

lemmy force-pushed the master branch from a37df47 to 9310ee7 Compare January 24, 2022 03:31

zwergziege mentioned this pull request Feb 17, 2022

WIP: Purge UniqueString #716

Open

lemmy force-pushed the master branch 2 times, most recently from bd71c07 to 0a543cc Compare April 6, 2022 05:46

lemmy force-pushed the master branch from 8a0da69 to 6dfec66 Compare December 20, 2022 22:52

lemmy force-pushed the master branch from 5e075ac to 6a73df7 Compare April 20, 2023 22:28

lemmy force-pushed the master branch from c6a8452 to ddec63c Compare August 25, 2023 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added string support and first class characters #685

Added string support and first class characters #685

zwergziege commented Nov 3, 2021

zwergziege Nov 3, 2021

xxyzzn Nov 3, 2021

zwergziege Nov 3, 2021

xxyzzn Nov 3, 2021

zwergziege Nov 3, 2021

xxyzzn Nov 3, 2021

zwergziege Nov 3, 2021 •

edited

zwergziege commented Nov 3, 2021

xxyzzn commented Nov 3, 2021

zwergziege commented Nov 3, 2021

xxyzzn commented Nov 3, 2021

zwergziege commented Nov 3, 2021 •

edited

xxyzzn commented Nov 3, 2021

lemmy commented Nov 3, 2021

zwergziege commented Nov 4, 2021

xxyzzn commented Nov 4, 2021

zwergziege commented Nov 4, 2021

xxyzzn commented Nov 4, 2021

zwergziege commented Nov 8, 2021

Added string support and first class characters #685

Are you sure you want to change the base?

Added string support and first class characters #685

Conversation

zwergziege commented Nov 3, 2021

zwergziege Nov 3, 2021

Choose a reason for hiding this comment

xxyzzn Nov 3, 2021

Choose a reason for hiding this comment

zwergziege Nov 3, 2021

Choose a reason for hiding this comment

xxyzzn Nov 3, 2021

Choose a reason for hiding this comment

zwergziege Nov 3, 2021

Choose a reason for hiding this comment

xxyzzn Nov 3, 2021

Choose a reason for hiding this comment

zwergziege Nov 3, 2021 • edited

Choose a reason for hiding this comment

zwergziege commented Nov 3, 2021

xxyzzn commented Nov 3, 2021

zwergziege commented Nov 3, 2021

xxyzzn commented Nov 3, 2021

zwergziege commented Nov 3, 2021 • edited

xxyzzn commented Nov 3, 2021

lemmy commented Nov 3, 2021

zwergziege commented Nov 4, 2021

xxyzzn commented Nov 4, 2021

zwergziege commented Nov 4, 2021

xxyzzn commented Nov 4, 2021

zwergziege commented Nov 8, 2021

zwergziege Nov 3, 2021 •

edited

zwergziege commented Nov 3, 2021 •

edited