Accept supplementary Unicode characters in identifiers #9687

som-snytt · 2021-07-04T10:04:52Z

Previously the lexer only accepted identifiers drawn from the basic multilingual plane.

build.sbt

src/compiler/scala/tools/nsc/ast/parser/Scanners.scala

martijnhoekstra · 2021-07-08T08:58:42Z

If I read this correctly, the biggest change is in plain identifiers, which now allow non-BMP characters if allowed by the identifier grammar. Strings literals, interpolations, back quoted identifiers, char literals and other literals are all unchanged: even where the spec didn't allow supplementary characters, the implementation did.

Changed:

Plain identifiers:
- non-BMP characters are allowed (if allowed by the identifier grammar) (new in this PR)
- Invalid codepoints or code sequences are not allowed in any way, shape or form (unchanged)
~~Identifiers in string interpolations~~
- but maybe that is wrong! ✅
Chars literals:
- non-BMP characters are consumed but not allowed (they don't fit)

Unchanged:

Strings:
- non-BMP characters are accepted (no change)
- Invalid codepoints are allowed (no change)
- Invalid code sequences (unpaired/mismatched surrogates) are allowed if they're written as Unicode escapes (no change)
- invalid code sequences (unpaired/mismatched surrogates) are allowed if they somehow made it to the scanner (no change). Unsure how/if that's possible in different source implementations.
Interpolations: same as strings, but without interpreting escapes (which is up to the interpolator) (no change)
Chars literals:
- Invalid code sequences (unpaired surrogates, other non-characters like U+FEFF) are allowed as unicode escapes
- The same invalid code sequences would be allowed if they make it to the scanner, which may be impossible
Other literals:
- moot, it's not like, say, integer literals would accept any non-BMP characters anyway.

How are back-quoted identifiers handled here? What do they accept and does that break on any backend? -- edit: I checked this, they are properly encoded.

Implementation wise it's really nice.

This change is not yet reflected in the spec. Maybe spec work can be done separately from this PR? That's up to the lightbend team to decide I think. Same for scalameta and scala 3 parity.

lrytz · 2021-07-08T13:53:33Z

Thanks for the overview @martijnhoekstra

From what I can tell this LGTM, it aligns with Java. Agree that it's important to align Scala 3, scalameta, ping IntelliJ.

This change is not yet reflected in the spec.

Are you sure? Isn't this PR fixing the implementation to match the spec?

SethTisue · 2021-07-08T14:38:52Z

@som-snytt are you interested in forward-porting to Scala 3?

martijnhoekstra · 2021-07-08T15:09:42Z

Are you sure? Isn't this PR fixing the implementation to match the spec?

Yes, on the very first line :)

Scala programs are written using the Unicode Basic Multilingual Plane (BMP) character set; Unicode supplementary characters are not presently supported.

Which could be outright deleted or maybe replaced with "Scala programs are written using the Unicode character set"

scalac has been more lenient than the spec mandates in it's input for as long as I've known and accepted non-BMP characters in strings for example.

class AlwaysHasBeen {
  final val `𐐀` = "𐐀 has always been allowed"
  final val always: "𐐀 has always been allowed" = "\ud801\udc00 has always been allowed"
}

was already accepted (and still is).

martijnhoekstra · 2021-07-08T15:23:55Z

scala also accepts (and has accepted) surrogate characters in UTF-8. UTF-8 has no surrogate characters, and these code sequences are not "real" Unicode. So if you're a stickler for both the scala spec and the Unicode spec, you can argue that the spec is incorrect because of that, and in many places where it says it accepts an Unicode character, it actually accepts a codepoint that doesn't encode an unicode Character.

The problem is there isn't anyone you can argue that to, because by the time you're halfway to the point you're trying to make, their eyes have long glazed over, which I suppose is completely justified.

som-snytt · 2021-07-08T15:58:52Z

@martijnhoekstra That's right, anything going through getLitChar just takes a char as data. I didn't even need that residual call to isSupplementary there, so I will remove it and squash those two commits after your previous review. (To recap: my first pass disallowed "orphan surrogates" in literals.)

I'll also add test for string interpolation with supplementary char in constant part; if parser handles the interpolated identifier, then it just vacuums up the constants.

I'll update the spec.

@SethTisue yes, I will forward port, now that Martijn has explained to me what I did. Also, I just right-clicked to add Martijn to the dictionary and get rid of the red squiggle.

som-snytt · 2021-07-08T16:06:26Z

To the last mindbending point, all this is an artifact of JVM support. If we were ingesting codepoints, the only check would be that we read a valid codepoint, as we must do when assembling a supplementary character from surrogates.

I was going to tweet that day that I tried to think of "high surrogate" and came up with "high suffragette".

On "non-Unicode" or "broken" characters, I'm still wondering if it's worth accepting non-chars in unicode escape form only. I will follow up later if that is deemed useful.

martijnhoekstra · 2021-07-08T17:18:54Z

Accepting broken characters in source form is pretty hard: by the time you go from a source encoding to an Array[Char], the encoder should have already rejected the bad sequences.

src/compiler/scala/tools/nsc/ast/parser/Scanners.scala

som-snytt · 2021-07-08T21:29:27Z

By broken char, I meant run/t1406b, where we read a high surrogate only (not a unicode escape). As you point out, it's just a non-character, so is it more special than other non-character characters?

som-snytt · 2021-07-09T00:45:13Z

Touched up the spec. The first sentence really needs to grab the reader.

Noticed that it would be nice to support varid starting with supplementary char, as that is defined in terms of letter and lower. As noted in the review, supplementary chars are encoded, and it's probably excessive work to change that, though not necessarily undesirable. The minimal change is for isVariableName to decode if necessary.

som-snytt · 2021-07-09T01:15:06Z

Retouched up the spec.

spec/01-lexical-syntax.md

som-snytt · 2021-07-09T09:20:52Z

"Supplimentary" is when you try to be nice and supportive, but also suave and supple.

Edit: suppliments are the complimentary supplements they give out on race day.

spec/01-lexical-syntax.md

SethTisue · 2021-07-22T20:47:53Z

I suggest we hold off on merging this until the Scala 3 forward-port has landed. (I don't like to put the Scala 3 folks on the hook for something without their consent, and also the forward-port might produce additional useful review feedback.)

som-snytt · 2021-07-22T20:54:06Z

OK, I'll go ahead and attempt to forward port. I think the metaphor for "on the hook" here is "catch and release". You put the immature feature back into the ecosystem until it is big enough for frying.

Also accept supplementary characters in simple identifiers in string interpolation, and as leading characters of varids. Reject a supplementary character as a char literal. Clarify Unicode support in the spec. Clarify letter and special operator symbols in precedence table.

som-snytt · 2021-07-23T04:08:35Z

Had to move a dotty test out of the way, born of that necessity was

tests/neg/sorrygates.scala

SethTisue · 2021-08-02T23:57:20Z

seems clear the Scala 3 forward port will land, so...

SethTisue · 2021-08-02T23:57:52Z

Thank you @som-snytt! Let a thousand monstrous, bizarre identifier names bloom! 👹

unkarjedy · 2021-11-02T16:01:15Z

IntelliJ issue https://youtrack.jetbrains.com/issue/SCL-19655

som-snytt · 2021-11-04T14:56:52Z

adriaanm or retronym would have said but now you'll have to notify the IDEs, which used to mean the eclipse plugin first of all. tpolecat already complained that this repo does not sync with its tooling consumers on any level. This was my fault, as I am lazy and not on a first name basis with any maintainers.

Edit: "seems clear the Scala 3 forward port will land, so..." or maybe not?

martijnhoekstra · 2021-11-04T16:22:12Z

Everyone wants to be on a first name basis with you, Som. I knew a guy named Chip, he had a cat named Munk. Anyway, which first names should we have pinged?

SethTisue · 2022-11-02T16:44:57Z

followup PR landed in 2.13.9: #9805

scala-jenkins added this to the 2.13.7 milestone Jul 4, 2021

som-snytt force-pushed the issue/1406 branch 2 times, most recently from 74de6d5 to e632844 Compare July 4, 2021 21:00

som-snytt commented Jul 4, 2021

View reviewed changes

build.sbt Outdated Show resolved Hide resolved

martijnhoekstra reviewed Jul 5, 2021

View reviewed changes

src/compiler/scala/tools/nsc/ast/parser/Scanners.scala Outdated Show resolved Hide resolved

som-snytt force-pushed the issue/1406 branch 3 times, most recently from cf8c066 to 468dd35 Compare July 6, 2021 19:09

dwijnand requested a review from martijnhoekstra July 7, 2021 16:13

SethTisue added the release-notes worth highlighting in next release notes label Jul 7, 2021

martijnhoekstra approved these changes Jul 8, 2021

View reviewed changes

src/compiler/scala/tools/nsc/ast/parser/Scanners.scala Outdated Show resolved Hide resolved

src/compiler/scala/tools/nsc/ast/parser/Scanners.scala Show resolved Hide resolved

som-snytt force-pushed the issue/1406 branch from 468dd35 to 5528f53 Compare July 8, 2021 20:24

som-snytt commented Jul 8, 2021

View reviewed changes

src/compiler/scala/tools/nsc/ast/parser/Scanners.scala Show resolved Hide resolved

som-snytt force-pushed the issue/1406 branch from 5528f53 to 6c1bb5b Compare July 8, 2021 21:21

sjrd reviewed Jul 9, 2021

View reviewed changes

spec/01-lexical-syntax.md Outdated Show resolved Hide resolved

som-snytt force-pushed the issue/1406 branch from de95577 to 28efec2 Compare July 9, 2021 07:57

lrytz reviewed Jul 9, 2021

View reviewed changes

spec/01-lexical-syntax.md Outdated Show resolved Hide resolved

magnolia-k mentioned this pull request Jul 11, 2021

Scala3 should support Unicode supplementary characters scala/scala3#13045

Closed

som-snytt force-pushed the issue/1406 branch from 28efec2 to b124a54 Compare July 22, 2021 21:08

som-snytt mentioned this pull request Jul 23, 2021

Accept supplementary characters scala/scala3#13136

Merged

SethTisue changed the title ~~Accept supplementary characters~~ Accept supplementary Unicode characters in source code Aug 2, 2021

SethTisue merged commit 3f06ac7 into scala:2.13.x Aug 2, 2021

som-snytt deleted the issue/1406 branch August 3, 2021 02:00

SethTisue changed the title ~~Accept supplementary Unicode characters in source code~~ Accept supplementary Unicode characters in identifiers Oct 29, 2021

unkarjedy mentioned this pull request Nov 4, 2021

supplementary characters from Unicode_So category are not parsed in operator identifier scala/bug#12482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept supplementary Unicode characters in identifiers #9687

Accept supplementary Unicode characters in identifiers #9687

som-snytt commented Jul 4, 2021 •

edited by SethTisue

martijnhoekstra commented Jul 8, 2021 •

edited by som-snytt

lrytz commented Jul 8, 2021

SethTisue commented Jul 8, 2021

martijnhoekstra commented Jul 8, 2021

martijnhoekstra commented Jul 8, 2021

som-snytt commented Jul 8, 2021

som-snytt commented Jul 8, 2021 •

edited

martijnhoekstra commented Jul 8, 2021

som-snytt commented Jul 8, 2021

som-snytt commented Jul 9, 2021

som-snytt commented Jul 9, 2021

som-snytt commented Jul 9, 2021 •

edited

SethTisue commented Jul 22, 2021

som-snytt commented Jul 22, 2021

som-snytt commented Jul 23, 2021

SethTisue commented Aug 2, 2021

SethTisue commented Aug 2, 2021 •

edited

unkarjedy commented Nov 2, 2021

som-snytt commented Nov 4, 2021 •

edited

martijnhoekstra commented Nov 4, 2021

SethTisue commented Nov 2, 2022

Accept supplementary Unicode characters in identifiers #9687

Accept supplementary Unicode characters in identifiers #9687

Conversation

som-snytt commented Jul 4, 2021 • edited by SethTisue

martijnhoekstra commented Jul 8, 2021 • edited by som-snytt

lrytz commented Jul 8, 2021

SethTisue commented Jul 8, 2021

martijnhoekstra commented Jul 8, 2021

martijnhoekstra commented Jul 8, 2021

som-snytt commented Jul 8, 2021

som-snytt commented Jul 8, 2021 • edited

martijnhoekstra commented Jul 8, 2021

som-snytt commented Jul 8, 2021

som-snytt commented Jul 9, 2021

som-snytt commented Jul 9, 2021

som-snytt commented Jul 9, 2021 • edited

SethTisue commented Jul 22, 2021

som-snytt commented Jul 22, 2021

som-snytt commented Jul 23, 2021

SethTisue commented Aug 2, 2021

SethTisue commented Aug 2, 2021 • edited

unkarjedy commented Nov 2, 2021

som-snytt commented Nov 4, 2021 • edited

martijnhoekstra commented Nov 4, 2021

SethTisue commented Nov 2, 2022

som-snytt commented Jul 4, 2021 •

edited by SethTisue

martijnhoekstra commented Jul 8, 2021 •

edited by som-snytt

som-snytt commented Jul 8, 2021 •

edited

som-snytt commented Jul 9, 2021 •

edited

SethTisue commented Aug 2, 2021 •

edited

som-snytt commented Nov 4, 2021 •

edited