Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added string support and first class characters #685

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

zwergziege
Copy link

Closes #512 and related issues.

Added several todos where I'm uncertain how to implement functions correctly and may raise questions regarding the implementation of other value types

}

@Override
public final boolean mutates() { return false; }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is according to the doc (it normalization and fingerprinting don't mutate the object) but not in line with the e.g. IntValues...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new basic data type to TLC introduces problems, a serious one being to modify the fingerprinting code to maintain the extremely small probability of collisions. This is not trivial. The same fingerprinting algorithm was used inside a compiler once. A compiler bug was traced to a collision of fingerprints that should have had an astronomically small probability. It was discovered that this was the result of a bug in the algorithm's implementation.

One possibility might be to make characters typed model values, using as type a character that is not currently allowed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no experience in this matter, but I fail to see why characters can not be handled in a similar way as integers since both are essentially numbers. In fact, it seems to me that your suggestion would have a very similar result, except for substituting the char kind byte with the model value kind byte and the character value with the unique id of the character.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the fingerprinting algorithm. But I know that some smart people (researchers at one of the world's leading research labs), working from a description of the algorithm, produced an incorrect program. This tells me that modifying code that implements the algorithm, without fully understanding the code or the algorithm, is risky.

Fingerprinting is tricky. You're mapping numbers, sets of number, functions on sets of numbers, etc. all into strings of bits. For any single one of those kinds of objects, it's not too hard to find some mapping that has a small probability of collisions. But how do you know that there is a small probability of a collision between two different kinds of value? It's not easy. Saying characters are like numbers so you just treat them like numbers sounds like a recipe for producing collisions between characters and numbers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, the code uses a one-way compression function that produces a 64bit tag from a tag and a byte (FP64.Extend). To address the problem you mentioned, every class first extends the fingerprint with a byte unique to the class, followed by the data. This leads to different sequences of bytes fed to the compression function even if values of different types share the same binary representation (at least for primitives/types of constant length).

Speaking of collisions: As far as I can tell the code of FP64.Extend treats chars as a single byte. Strings that contain characters outside the range 0..2^8-1 are probably rare but would result in colliding fingerprints.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good. I believe the problem solved by the general fingerprinting algorithm involved fingerprinting values from an unbounded set of types. With values from a fixed set of types, the problem does seem simple.

Thanks for pointing out the problem that can arise if TLA+ is extended to allow non-ASCII characters in strings. It would be nice if it could be documented somewhere in case that extension is ever implemented, but I don't know how to add documentation for code that no one has begun writing.


@Override
final boolean assignable(Value val) {
try {
Copy link
Author

@zwergziege zwergziege Nov 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems to be unused.

@zwergziege
Copy link
Author

On a general note: When I wrote this back in July I noticed that the code base of the different value types is rather convoluted and contains a lot of code duplication. Also, it seems as if somebody wanted to extract interfaces but stopped in the middle of the refactoring process. Is a continuation of this process/clean-up desired? If so, I could probably spend some time on it.

@xxyzzn
Copy link

xxyzzn commented Nov 3, 2021

TLC was originally written by one person in an amazingly short time. As a result, the code was not well planned. Any improvements to it are most welcome. However, the poor structuring makes it very hard to make any changes without introducing bugs. So, as with doctors, the prime commandment for anyone making changes is: Do no harm. This means that anyone who makes changes must commit to doing a lot of thinking and testing. There are engineers who depend on TLC for designing important systems.

@zwergziege
Copy link
Author

The isAtom() method of the IValue interface should probably be changed to include the new char value type (and exclude strings?). The corresponding code appears to be dead though.

@xxyzzn
Copy link

xxyzzn commented Nov 3, 2021

Strings should be included in the elementary data types of TLC. Only a tiny fraction of specs will ever care that strings are made up of characters, and you should not slow down the evaluation of operations on strings in all specs just for the benefit of that tiny fraction. That means that whenever TLC represents a value in terms of elementary data types, it will have to check every sequence to see if it's a string. (TLC already has to check every function to see if it's a sequence.)

@zwergziege
Copy link
Author

zwergziege commented Nov 3, 2021

I don't think that the added functionality would slow down existing specs since basically all new code paths would previously have resulted in errors if I'm not mistaken.

@xxyzzn
Copy link

xxyzzn commented Nov 3, 2021

I just remembered that, because string handling in Java was slow at the time, TLC creates a unique ID for each string and uses it to check for equality of strings. I don't remember if those IDs are assigned by the parser, but new ones are probably produced if strings are dynamically created, which they can be. Anyway, you need to be aware of the use of those IDs and make sure things work properly when creating new strings in previously impossible ways.

If you remove strings as elementary data types, this will change fingerprinting for all specs. I don't know if that will significantly affect performance.

@lemmy
Copy link
Member

lemmy commented Nov 3, 2021

The corresponding class is util.UniqueString. Related: Care must be taken because util.InternTable is guarded by a global lock, severely limiting concurrency if updated during model-checking.

Overall, it would be good to get rid of UniqueString because modern Java no longer has problems with strings. However, UniqueString appears everywhere in TLC. Thus, such refactoring has super high-risk.

@zwergziege
Copy link
Author

I just remembered that, because string handling in Java was slow at the time, TLC creates a unique ID for each string and uses it to check for equality of strings.

Thanks for pointing that out. This is indeed a problem because the unique id is basically the index in the internal table. Thus in my implementation comparing strings depends on whether are not they are represented as StringValue. A potential solution would be to compare on a character basis even if both values are StringValues and accept the entailed slow-down.

That means that whenever TLC represents a value in terms of elementary data types, it will have to check every sequence to see if it's a string. (TLC already has to check every function to see if it's a sequence.)

If you remove strings as elementary data types, this will change fingerprinting for all specs. I don't know if that will significantly affect performance.

I assume that every TLA+ value must result in the same fingerprint regardless of its representation? In this case fingerprinting would indeed have to be adapted in order to guarantee accurate fingerprints for specs using the new feature, leading to a slow down for old specs. Instead of checking each sequence for string-ness a better option would probably be to treat StringValues as if they were FcnRcdValues when fingerprinting. For some reason I do not comprehend so far the fingerprinting method of StringValue does not use the unique id but instead feeds the whole string to the compression function thus compressing one byte per character. Using the proposed method should require compressing 7 bytes per character (1+4 for the index and 1+1 for the character). Noteworthy: The StringValue fingerprinting code is duplicated in RecordValue.fingerprint (any other potential candidates?) which will naturally be slowed down as well.

Should I implement what I outlined above (i.e. is it worth the cost)? Do you see any other issues?

Overall, it would be good to get rid of UniqueString because modern Java no longer has problems with strings. However, UniqueString appears everywhere in TLC. Thus, such refactoring has super high-risk.

Getting rid of UniqueString seems to be a giant refactoring. The class even contains information on state variables and operators. Removing it from StringValue and other Value classes seems possible at first glance.

Tangential question: Fingerprinting FcnLambdaValues with infinite domains is impossible, right? I see they are converted to FcnRcdValues. In this process, the size of the domains (multi-argument functions) is queried. If one of those is a UserValue (which afaik are the only value type that can be inherently infinite) an exception is triggered.

@xxyzzn
Copy link

xxyzzn commented Nov 4, 2021

A potential solution would be to compare on a character basis even if
both values are StringValues and accept the entailed slow-down.

Anything that significantly slows down the evaluation of specs that do
not use the new feature is unacceptable. The only way this would be
acceptable if there's a TLC option to turn on the feature, so the
slowdown occurs only for specs that use the new feature. Instead of
the option, a standard module could be added with useful operators
that depend on the feature, and the user would have to extend that
module when using the feature.

I assume that every TLA+ value must result in the same fingerprint
regardless of its representation?

The same value having different fingerprints slows down model
checking. It doesn't introduce any errors for safety checking, I'm
not sure about liveness checking.

Should I implement what I outlined above (i.e. is it worth the cost)?
Do you see any other issues?

My opinion has been that such a small percentage of TLA+ specs would
use full string processing that it's not worth slowing down other
specs to add it. I have seen nothing to change that opinion.

I do have one question: To do what you want to do with first-class
strings, do you have to use TLA+ strings? Suppose TLA+ were augmented
with a new elementary datatype char and the TLA+ syntax were augmented
so that xyz' is a sequence of three chars that equals xy' \o `z'. A
standard module could provide operators to map strings to char
sequences and vice-versa. Would there then be any reason why you
would want the feature you're implementing?

@zwergziege
Copy link
Author

Maybe the following would be acceptable (I don't really like it though)?
When comparing strings use the per character approach, as this would be required when switching from UniqueString normal strings anyways.
When fingerprinting opt for ambiguous fingerprints, which should only occur in specs using the feature.

I do not depend on this feature. I simply tend to stumble upon corner cases. When I started out with TLA+ about a year ago I read "specifying systems" and set out to do things a certain way. It happened more than once that my ideas would not work because of some restrictions in TLC or TLAPM. All of it could be worked around, but especially concerning strings I saw #512 and thought that maybe I could contribute to it, especially since there seemed to be a plan on what to do. But I do see the downsides of it now.

I don't think that anyone would benefit from adding new syntax except if it were very visibly documented. One of my main issues with TLA and its tools is its lack of central documentation. If you want to look something up you often have to consult several documents and even then you might still get unexpected errors. I wanted to counteract this by making TLC support more of TLA+, but as the downsides seem to outweigh the benefits I think we can close this PR.

@xxyzzn
Copy link

xxyzzn commented Nov 4, 2021

Thank you for trying to help. I'm afraid one of the things that aren't properly documented is how best to help.

You wrote that a major problem for you has been the lack of central documentation. That is something we can certainly use help with. Neither Markus Kuppe nor I have the time to devote to it. I would be happy to advise on such a task. As is true of any effort to improve TLA+ or its tools, one should talk to us before implementing anything to make sure that it will be as useful as possible.

Leslie

@zwergziege
Copy link
Author

I think it would take at least one person who has both a decent amount of time to spare and a clue of what they are doing to realize a truly unified documentation of TLA+ and its tools. I fear I fulfill neither of those requirements. It is one thing to spend two days patching together a feature that is ultimately rejected because of a lack of consideration and another to create an overarching compendium / reference. Anything less than that would just be one more resource to consider and thus counterproductive.
There is a dilemma here since anyone proficient enough in the ways of TLA+ has probably better things to do than writing yet another documentation. Maybe the community approach can achieve the goal given enough time if it is easy enough to contribute. This would still require someone with a good overview to set the course which probably requires a large initial effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

application and DOMAIN of strings in TLC
3 participants