clarify string descriptions #875

bendyarm · 2022-02-11T00:33:08Z

Restructure the list of disallowed code points so the control characters that are disallowed
in all types of strings are listed at the beginning, and just provide the differences for the various types.
Clarify that backslash and quotation mark can occur literally, but only as part of an escape sequence.
Also put newlines in a couple of other places where the lines exceeded 80 characters.

marzer · 2022-02-11T10:22:37Z

I don't believe these changes make the spec any any clearer. In fact I'd argue it's a clear regression in clarity; aggregating the string information for all strings and listing exceptions requires the reader to do the heavy lifting. Repeatedly listing it for each string type is redundant, yes, but expects less cognitive load from the reader and fits the flow of the document better.

marzer · 2022-02-11T11:20:22Z

toml.md

+All strings must contain only valid UTF-8 encoded characters as is the case for
+the TOML document as a whole.  Certain control characters are not allowed to
+occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E
+to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in
+literal strings or multi-line literal strings, those control characters can be
+described with escapes as specified below. Additional restrictions are described
+below.


If you do persist with this change, then I'd simplify this paragraph. There's just too much here.

Suggested change

All strings must contain only valid UTF-8 encoded characters as is the case for

the TOML document as a whole. Certain control characters are not allowed to

occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E

to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in

literal strings or multi-line literal strings, those control characters can be

described with escapes as specified below. Additional restrictions are described

below.

Strings must contain only valid UTF-8 encoded characters. Certain control characters are not allowed to occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, and U+007F.

The point about the basic strings supporting escaped control characters is already covered in the "basic strings" section.

(An argument can be made that the list of disallowed characters should be represented as an actual bullet-point list, though that's a matter of taste, and beyond the scope of this PR since it wasn't that way to begin with.)

I think moving the "Any Unicode character may be used except [..]" one paragraph up makes sense. Now it just looks like it applies only to basic strings, rather than all strings.

Can then also remove the same text for multi-line strings and "Control characters other than tab are not permitted in a literal string" at the end of the "Multi-line literal strings" section.

I'd write it as something like:

There are four ways to express strings: basic, multi-line basic, literal, and multi-line literal. All strings must be encoded as valid UTF-8, and can contain any codepoint except control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F). Multi-line strings can also contain newlines (U+000A) and carriage returns (U+000D).

This way you have "what bytes/characters can be in a string?" in a single concise paragraph.

@arp242 I like that wording.

@arp242 Thanks, that is an improvement. Putting those details at the beginning rather than at the end of each section makes all the sections easier to understand. I have also reworded the last paragraph in strings to make it clear it is not a another part of the spec but just advice (paragraph starting "Because most control characters are not permitted...").

ChristianSi · 2022-02-13T14:44:26Z

I would avoid the phrase "All strings must be encoded as valid UTF-8" since it suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole document is encoded as UTF-8; so, when looking at strings, we can only see them as series of Unicode codepoints, not of bytes. So, instead of

"All strings must be encoded as valid UTF-8, and can contain any codepoint except ..."

I'd propose

"Strings can contain any valid Unicode codepoint except ..."

bendyarm · 2022-02-14T03:41:29Z

@ChristianSi Good idea! I also simplified the language to avoid the double negative, clarifying the situation with tab.

This is what the paragraph looks like in this PR now:

There are four ways to express strings: basic, multi-line basic, literal, and
multi-line literal. Strings can contain any valid Unicode codepoint except the
following control characters: U+0000 to U+0008, U+000A to U+001F, and
U+007F. Note that tab (U+0009) is allowed. Multi-line strings can also contain
newlines (U+000A) and carriage returns (U+000D).

@arp242 What do you think about this change to your rewording?
@abravalheri Does this rewording fix the issue about tabs implicit in PR #878 ?

abravalheri · 2022-02-24T11:04:34Z

toml.md

+multi-line literal. Strings can contain any valid Unicode codepoint except the
+following control characters: U+0000 to U+0008, U+000A to U+001F, and
+U+007F. Note that tab (U+0009) is allowed. Multi-line strings can also contain
+newlines (U+000A) and carriage returns (U+000D).


I think that saying that U+000A and U+000D are not allowed first¹ and then adding an exception for multi-line strings is kind of a double negative (an exception of the previous exception)...

I would recommend restricting the code point ranges/enumeration to the ones that are allowed in all types of strings.

Then I would add a second (separated) statement specifically saying that "basic" and "literal" strings (single-line) don't allow newlines/carriage returns.

For example, something like:

Strings can contain any valid Unicode codepoint except the following control characters: U+0000 to U+0008, U+000B, U+000C, U+000E, U+001F, and U+007F. Note that tab (U+0009) is allowed. Newlines (U+000A) and carriage returns (U+000D) are allowed in multi-line strings but forbidden in basic and literal strings.

Footnotes

U+000A and U+000D are elements of the previously mentioned character ranges/enumeration ↩

abravalheri · 2022-02-24T11:07:38Z

Does this rewording fix the issue about tabs implicit in PR #878 ?

Yes! Thank you @bendyarm, I will close that PR in favour of this one.

eksortso · 2022-11-10T22:46:50Z

toml.md

-Control characters other than tab are not permitted in a literal string. Thus,
-for binary data, it is recommended that you use Base64 or another suitable ASCII
-or UTF-8 encoding. The handling of that encoding will be application-specific.
+Because most control characters are not permitted even in literal and multi-line
+literal strings, these literal strings are not suited for representing blobs of
+binary data.  It is recommended that you use Base64 or another suitable ASCII or
+UTF-8 encoding. The handling of that encoding will be application-specific.


We have an alternative paragraph that expresses these same sentiments in #929.

clarify string descriptions

7023cb1

marzer suggested changes Feb 11, 2022

View reviewed changes

make string spec wording more concise while keeping precision

4d9eba6

reword four ways of strings

fa22931

ChristianSi approved these changes Feb 15, 2022

View reviewed changes

abravalheri reviewed Feb 24, 2022

View reviewed changes

abravalheri mentioned this pull request Feb 24, 2022

Clarify tab can be escaped (or not) in basic strings #878

Closed

ChristianSi mentioned this pull request Oct 28, 2022

TOML 1.1.0 #928

Open

eksortso reviewed Nov 10, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clarify string descriptions #875

clarify string descriptions #875

bendyarm commented Feb 11, 2022

marzer commented Feb 11, 2022

marzer Feb 11, 2022

arp242 Feb 11, 2022

marzer Feb 11, 2022

bendyarm Feb 12, 2022

ChristianSi commented Feb 13, 2022

bendyarm commented Feb 14, 2022

abravalheri Feb 24, 2022 •

edited

abravalheri commented Feb 24, 2022

eksortso Nov 10, 2022

clarify string descriptions #875

Are you sure you want to change the base?

clarify string descriptions #875

Conversation

bendyarm commented Feb 11, 2022

marzer commented Feb 11, 2022

marzer Feb 11, 2022

Choose a reason for hiding this comment

arp242 Feb 11, 2022

Choose a reason for hiding this comment

marzer Feb 11, 2022

Choose a reason for hiding this comment

bendyarm Feb 12, 2022

Choose a reason for hiding this comment

ChristianSi commented Feb 13, 2022

bendyarm commented Feb 14, 2022

abravalheri Feb 24, 2022 • edited

Choose a reason for hiding this comment

Footnotes

abravalheri commented Feb 24, 2022

eksortso Nov 10, 2022

Choose a reason for hiding this comment

abravalheri Feb 24, 2022 •

edited