Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify string descriptions #875

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

bendyarm
Copy link
Contributor

Restructure the list of disallowed code points so the control characters that are disallowed
in all types of strings are listed at the beginning, and just provide the differences for the various types.
Clarify that backslash and quotation mark can occur literally, but only as part of an escape sequence.
Also put newlines in a couple of other places where the lines exceeded 80 characters.

@marzer
Copy link
Contributor

marzer commented Feb 11, 2022

I don't believe these changes make the spec any any clearer. In fact I'd argue it's a clear regression in clarity; aggregating the string information for all strings and listing exceptions requires the reader to do the heavy lifting. Repeatedly listing it for each string type is redundant, yes, but expects less cognitive load from the reader and fits the flow of the document better.

toml.md Outdated
Comment on lines 264 to 270
All strings must contain only valid UTF-8 encoded characters as is the case for
the TOML document as a whole. Certain control characters are not allowed to
occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E
to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in
literal strings or multi-line literal strings, those control characters can be
described with escapes as specified below. Additional restrictions are described
below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do persist with this change, then I'd simplify this paragraph. There's just too much here.

Suggested change
All strings must contain only valid UTF-8 encoded characters as is the case for
the TOML document as a whole. Certain control characters are not allowed to
occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E
to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in
literal strings or multi-line literal strings, those control characters can be
described with escapes as specified below. Additional restrictions are described
below.
Strings must contain only valid UTF-8 encoded characters. Certain control characters are not allowed to occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, and U+007F.

The point about the basic strings supporting escaped control characters is already covered in the "basic strings" section.

(An argument can be made that the list of disallowed characters should be represented as an actual bullet-point list, though that's a matter of taste, and beyond the scope of this PR since it wasn't that way to begin with.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the "Any Unicode character may be used except [..]" one paragraph up makes sense. Now it just looks like it applies only to basic strings, rather than all strings.

Can then also remove the same text for multi-line strings and "Control characters other than tab are not permitted in a literal string" at the end of the "Multi-line literal strings" section.

I'd write it as something like:

There are four ways to express strings: basic, multi-line basic, literal, and multi-line literal. All strings must be encoded as valid UTF-8, and can contain any codepoint except control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F). Multi-line strings can also contain newlines (U+000A) and carriage returns (U+000D).

This way you have "what bytes/characters can be in a string?" in a single concise paragraph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arp242 I like that wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arp242 Thanks, that is an improvement. Putting those details at the beginning rather than at the end of each section makes all the sections easier to understand. I have also reworded the last paragraph in strings to make it clear it is not a another part of the spec but just advice (paragraph starting "Because most control characters are not permitted...").

@ChristianSi
Copy link
Contributor

I would avoid the phrase "All strings must be encoded as valid UTF-8" since it suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole document is encoded as UTF-8; so, when looking at strings, we can only see them as series of Unicode codepoints, not of bytes. So, instead of

"All strings must be encoded as valid UTF-8, and can contain any codepoint except ..."

I'd propose

"Strings can contain any valid Unicode codepoint except ..."

@bendyarm
Copy link
Contributor Author

@ChristianSi Good idea! I also simplified the language to avoid the double negative, clarifying the situation with tab.

This is what the paragraph looks like in this PR now:

There are four ways to express strings: basic, multi-line basic, literal, and
multi-line literal. Strings can contain any valid Unicode codepoint except the
following control characters: U+0000 to U+0008, U+000A to U+001F, and
U+007F. Note that tab (U+0009) is allowed. Multi-line strings can also contain
newlines (U+000A) and carriage returns (U+000D).

@arp242 What do you think about this change to your rewording?
@abravalheri Does this rewording fix the issue about tabs implicit in PR #878 ?

multi-line literal. Strings can contain any valid Unicode codepoint except the
following control characters: U+0000 to U+0008, U+000A to U+001F, and
U+007F. Note that tab (U+0009) is allowed. Multi-line strings can also contain
newlines (U+000A) and carriage returns (U+000D).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that saying that U+000A and U+000D are not allowed first1 and then adding an exception for multi-line strings is kind of a double negative (an exception of the previous exception)...

I would recommend restricting the code point ranges/enumeration to the ones that are allowed in all types of strings.

Then I would add a second (separated) statement specifically saying that "basic" and "literal" strings (single-line) don't allow newlines/carriage returns.

For example, something like:

Strings can contain any valid Unicode codepoint except the following control characters:
U+0000 to U+0008, U+000B, U+000C, U+000E, U+001F, and U+007F.
Note that tab (U+0009) is allowed.
Newlines (U+000A) and carriage returns (U+000D) are allowed in multi-line strings
but forbidden in basic and literal strings.

Footnotes

  1. U+000A and U+000D are elements of the previously mentioned character ranges/enumeration

@abravalheri
Copy link

Does this rewording fix the issue about tabs implicit in PR #878 ?

Yes! Thank you @bendyarm, I will close that PR in favour of this one.

Comment on lines -408 to +410
Control characters other than tab are not permitted in a literal string. Thus,
for binary data, it is recommended that you use Base64 or another suitable ASCII
or UTF-8 encoding. The handling of that encoding will be application-specific.
Because most control characters are not permitted even in literal and multi-line
literal strings, these literal strings are not suited for representing blobs of
binary data. It is recommended that you use Base64 or another suitable ASCII or
UTF-8 encoding. The handling of that encoding will be application-specific.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an alternative paragraph that expresses these same sentiments in #929.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants