Excluding LF in 'allowed-comment-char' is confusing #995

arp242 · 2023-09-30T00:10:27Z

I'm adding a test to toml-test for "Relax comment parsing; most control characters are again permitted.", and implementing it in my parser, and I'm rather confused what 0x0a (LF) is excluded. You need to support this if you want CRLF, no?

There's a long thread in #924 about this, but to be honest I stopped reading after a few comments – people shouldn't need to read entire threads to understand the specification.

I think the intent is that "CR" and "CRLF" are allowed, but "LF" isn't? That is what it does, I think, but it's really not very obvious IMHO.

At the very least there should be a comment in the ABNF, and the phrasing in toml.md ("that could cause problems during editing or processing: U+0000, and U+000A to U+000D") could also be improved. Now it's too easy to be misinterpreted as "only CR newlines are allowed and LF is forbidden, full stop", which isn't what was intended.

arp242 · 2023-09-30T00:19:47Z

Actually 0x0d (CR) is also excluded ... this is even more confusing. What's the point of excluding that?

arp242 · 2023-09-30T00:32:22Z

And a whole bunch of tests fail because now the allowed character set for strings and comments differ, and there was always the assumption those were the same.

I really wish we'd have "fix toml-test" as a requirement for any change.

None of the hypotheticals are even solved, because you still need to parse-by-character and exclude 0x00, 0x0b, 0x0c, you can still invalidate documents with a control character.

All for a problem no one actually reported. IMHO it's a completely pointless change that doesn't change anything for anyone, doesn't solve any practical problem, and is just ivory tower wankery and argueing for the point of argueing.

Literally no one using TOML will ever notice this change was made.

But I'm the one mucking about with toml-test and dealing with the churn (16 failing tests...)

And all the other implementations will need to be updated too.

So great, thanks.

This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I fix it. No one reports a problem? There is nothing to fix so I go drink beer. No one really reported this as a problem, so there isn't anything to fix. But it *does* introduce entirely needless churn for all TOML implementations. Do we need to forbid *anything* in comments? Probably not. In strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. And [none of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... another (smaller) set. That's not really a substantial change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this. --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of extensive comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only superficial bearing to actual pragmatic reality. Fixes toml-lang#995

ChristianSi · 2023-10-01T15:08:35Z

I think the intent is that "CR" and "CRLF" are allowed, but "LF" isn't?

No, CR and CRLF are not allowed in a comment, they end the line and hence the comment.

As for why LF, though not a valid line terminator, is not allowed in comments, i think this comment explains it best: #924 (comment).

This reverts commit ab74958. I'm a simple guy. Someone reports a problem, I drink coffee and fix it. No one reports a problem? There is nothing to fix and I go drink beer. No one really reported this as a problem, but it *does* introduce needless churn for all TOML implementations and the test suite. Do we need to forbid *anything* in comments? Probably not, and in strings we probably only need to forbid \x00. But at least before it was consistent with strings, and more importantly, what everyone wrote code for, which is tested, and already works. [None of the hypotheticals](toml-lang#567 (comment)) on why this is "needed" are practical issues people reported, and most aren't even fixed: a comment can still invalidate the file, you must still parse each character in a comment as some are still forbidden, the performance benefits are very close to zero they might as well be zero, and you still can't "dump whatever you like" in comments. So it doesn't *actually* change anything, it just changes "disallow this set of control characters" to ... "disallow this set of control characters" (but for a different set). That's not really a substantial or meaningful change. The only (minor) real-world issue that was reported (from the person doing the Java implementation) was that "it's substantially more complicated to parse out control characters in comments and raise an error, and this kind of strictness provides no real advantage to users". And that's not addressed at all with this, so... --- And while I'm at it, let me have a complaint about how this was merged: 1. Two people, both of whom actually maintain implementations, say they don't like this change. 2. This is basically ignored. 3. Three people continue written a fairly large number of large comments, so anyone who wasn't already interested in this change unsubscribes and/or goes 🤷 4. "Consensus". Sometimes I feel TOML attracts people who like to argue things from a mile-high ivory tower with abstract arguments that have only passing familiarity with any actual pragmatic reality. Fixes toml-lang#995

ChristianSi · 2023-10-01T15:12:56Z

As for being as permissive as possible in comments and not erroring out on users because of harmless control characters in comments, I think that's a good thing and certainly not something we should apologize for.

arp242 · 2023-10-01T15:13:49Z

Yeah, but the way it's phrased now is confusing, IMHO. Or at least: I was confused by it, but maybe that's just on my 🙃

Anyway, I just so happened to submit a PR to revert this; while this is entirely a fixable issue as such, I don't see the point in spending any time on this as I don't see how it fixes anything.

ChristianSi · 2023-10-01T15:15:03Z

Hmm, if the language is confusing, suggestions to improve it are certainly welcome!

ChristianSi · 2023-10-01T15:18:54Z

As for the "doesn't fix anything", I don't think that's true? Consider the case where somebody includes a harmless control char (say an ASCII bell) in a comment, and the TOML parser balk out on them with "invalid document, can't process!" That's unexpected and not particularly user-friendly, I'd say, considering that comments are supposed to be ignored anyway.

So I'd doing away with such error messages as much as possible is a net win, isn't it?

LF is probably the only control char that's not "harmless", since it could possibly lead to the user seeing a different document than the parser, which is why the special treatment for it makes sense (and was adapted).

arp242 · 2023-10-01T15:35:34Z

LF was already forbidden, so this PR didn't change anything about that:

Control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F) are not permitted in comments.

ChristianSi · 2023-10-01T16:51:43Z

It didn't change that, and quite wisely so, I'd say, but it did allow the other control characters in, thus making TOML more robust and user-friendly.

arp242 · 2023-10-01T18:42:26Z

thus making TOML more robust and user-friendly.

Who is running in to problems with this? Have people reported problems?

arp242 linked a pull request Oct 1, 2023 that will close this issue

Revert "Permit more control characters in comments (#924)" #996

Open

pradyunsg added this to the 1.1.0-rc0 milestone Apr 7, 2024

pradyunsg added the clarification label Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excluding LF in 'allowed-comment-char' is confusing #995

Excluding LF in 'allowed-comment-char' is confusing #995

arp242 commented Sep 30, 2023

arp242 commented Sep 30, 2023 •

edited

arp242 commented Sep 30, 2023 •

edited

ChristianSi commented Oct 1, 2023 •

edited

ChristianSi commented Oct 1, 2023 •

edited

arp242 commented Oct 1, 2023

ChristianSi commented Oct 1, 2023

ChristianSi commented Oct 1, 2023 •

edited

arp242 commented Oct 1, 2023

ChristianSi commented Oct 1, 2023 •

edited

arp242 commented Oct 1, 2023

Excluding LF in 'allowed-comment-char' is confusing #995

Excluding LF in 'allowed-comment-char' is confusing #995

Comments

arp242 commented Sep 30, 2023

arp242 commented Sep 30, 2023 • edited

arp242 commented Sep 30, 2023 • edited

ChristianSi commented Oct 1, 2023 • edited

ChristianSi commented Oct 1, 2023 • edited

arp242 commented Oct 1, 2023

ChristianSi commented Oct 1, 2023

ChristianSi commented Oct 1, 2023 • edited

arp242 commented Oct 1, 2023

ChristianSi commented Oct 1, 2023 • edited

arp242 commented Oct 1, 2023

arp242 commented Sep 30, 2023 •

edited

arp242 commented Sep 30, 2023 •

edited

ChristianSi commented Oct 1, 2023 •

edited

ChristianSi commented Oct 1, 2023 •

edited

ChristianSi commented Oct 1, 2023 •

edited

ChristianSi commented Oct 1, 2023 •

edited