New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document: don't allow characters with unicode property Bidi_Class in source files #10197
Conversation
Welcome!
What you've done seems to me like the good choice under the circumstances. References to the long names are few. If we need to grow the names again in the future we could always reconsider at that point.
Yes
I'd say the PR would be mergeable either way. If you do decide to include a link and/or explanatory wording, it can be brief, as the topic is well covered elsewhere. (But note that from the spec, we don't link to particular scala/scala or scala/bug issues or PRs.)
Though we do accept PRs against the 2.12 spec, we don't really care about the 2.12 spec anymore. The 2.13 spec is the one we actually care about. And the 2.13 spec should not include historical information about earlier versions; that would be considered clutter. If you care to take the time to adjust both the 2.12 and 2.13 versions separately, fine, but we would also be perfectly content if you simply targeted the 2.13.x branch only. |
14c1934
to
2c59f07
Compare
I adjusted the PR for 2.13. Please adjust the milestone tag, etc. |
2e8382c
to
88ae672
Compare
spec/01-lexical-syntax.md
Outdated
@@ -25,6 +25,9 @@ classes (Unicode general category given in parentheses): | |||
1. Operator characters. These consist of all printable ASCII characters | |||
(`\u0020` - `\u007E`) that are in none of the sets above, mathematical | |||
symbols (`Sm`) and other symbols (`So`). | |||
1. [Bidirectional explicit formatting](https://www.unicode.org/reports/tr9/#Bidirectional_Character_Types) | |||
characters. The nine characters `\u202a - \u202e` and `\u2066 - \u2069`, | |||
inclusive. These are forbidden from appearing in character and string literals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not correct, the characters are not allowed at all in source files. In string / character literals, the characters can be included as unicode escapes, but not literally.
In 2.12 (just for the record), the characters are also not allowed at all. The difference is that in unicode escapes (of these forbidden characters) are allowed everywhere (not only in character / string literals) in 2.12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got this wrong; the commit to 2.13 was different from the one to 2.12 and I thought 2.13 was more permissive, but it's the other way around. Fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed this. It now seems a bit odd that this is in a list prefaced by saying "To construct tokens...". Do you think it should be moved to the first line of the section ("Scala source code consists of Unicode text" [and these characters are allowed])?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think the spec only needs to be changed in the first section. Something like this:
The nine Bidirectional explicit formatting characters
\u202a - \u202e
and\u2066 - \u2069
(inclusive) are forbidden from appearing in source files. Note that they can be represented using unicode escapes in string and character literals.
Paragraphs or sentences starting with "note" in the spec are clarifications that don't introduce new information. I think that's all that's needed here, no changes are needed in character / string literal sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then do you think the other file (13-syntax-summary.md) doesn't need to be changed at all, and adding this note in this file is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me that would be enough, but maybe others would prefer also a not in the syntax summary file? Things are not very precise already, for example stringElement ::= charNoDoubleQuoteOrNewline | escapeSeq
but charNoDoubleQuoteOrNewline
is not actually defined in the spec as far as I can tell...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I noted the implicitly-defined xxxOrYyy constructions in the syntax summary (there are several of them) in a previous comment.
FWIW I would prefer a note in the syntax summary; that's where I looked first (and I was surprised that the summary is not an exact subset of the full grammar's productions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a new commit that makes only that change to both files and does not change any grammar productions. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me that looks good, thank you!
2abe09c
to
4501c90
Compare
…source files Update documentation with changes made in scala#10017
4501c90
to
d874f79
Compare
Thanks, I agree it didn't feel right as a production issue. I recall that the deceptively simple "Scala source code consists of Unicode text." was subject to long discussions. BTW, I just noticed that JLS uses |
Are there any other Unicode codepoints that are not allowed in Scala source code? Or did the discussions agree on allowing all of Unicode, and then this bidi issue came up? |
That change wasn't so long ago (mid-pandemic): b124a54 I only remember that sjrd stepped in with clarifying wording. Yes, all of Unicode, but then this security issue came up, as the spirits do on Halloween. 🎃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks you @danarmak
Update documentation with changes made in #10017. Previously discussed on scala-contributors.
This is my first PR and there are some things I'm not sure about: