Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept supplementary Unicode characters in identifiers #9687

Merged
merged 1 commit into from Aug 2, 2021

Conversation

som-snytt
Copy link
Contributor

@som-snytt som-snytt commented Jul 4, 2021

Previously the lexer only accepted identifiers drawn from the basic multilingual plane.

Fixes scala/bug#1406

@scala-jenkins scala-jenkins added this to the 2.13.7 milestone Jul 4, 2021
@som-snytt som-snytt force-pushed the issue/1406 branch 2 times, most recently from 74de6d5 to e632844 Compare July 4, 2021 21:00
build.sbt Outdated Show resolved Hide resolved
@som-snytt som-snytt force-pushed the issue/1406 branch 3 times, most recently from cf8c066 to 468dd35 Compare July 6, 2021 19:09
@SethTisue SethTisue added the release-notes worth highlighting in next release notes label Jul 7, 2021
@martijnhoekstra
Copy link
Contributor

martijnhoekstra commented Jul 8, 2021

If I read this correctly, the biggest change is in plain identifiers, which now allow non-BMP characters if allowed by the identifier grammar. Strings literals, interpolations, back quoted identifiers, char literals and other literals are all unchanged: even where the spec didn't allow supplementary characters, the implementation did.

Changed:

  • Plain identifiers:
    • non-BMP characters are allowed (if allowed by the identifier grammar) (new in this PR)
    • Invalid codepoints or code sequences are not allowed in any way, shape or form (unchanged)
  • Identifiers in string interpolations
    • but maybe that is wrong! ✅
  • Chars literals:
    • non-BMP characters are consumed but not allowed (they don't fit)

Unchanged:

  • Strings:

    • non-BMP characters are accepted (no change)
    • Invalid codepoints are allowed (no change)
    • Invalid code sequences (unpaired/mismatched surrogates) are allowed if they're written as Unicode escapes (no change)
    • invalid code sequences (unpaired/mismatched surrogates) are allowed if they somehow made it to the scanner (no change). Unsure how/if that's possible in different source implementations.
  • Interpolations: same as strings, but without interpreting escapes (which is up to the interpolator) (no change)

  • Chars literals:

    • Invalid code sequences (unpaired surrogates, other non-characters like U+FEFF) are allowed as unicode escapes
    • The same invalid code sequences would be allowed if they make it to the scanner, which may be impossible
  • Other literals:

    • moot, it's not like, say, integer literals would accept any non-BMP characters anyway.

How are back-quoted identifiers handled here? What do they accept and does that break on any backend? -- edit: I checked this, they are properly encoded.

Implementation wise it's really nice.

This change is not yet reflected in the spec. Maybe spec work can be done separately from this PR? That's up to the lightbend team to decide I think. Same for scalameta and scala 3 parity.

@lrytz
Copy link
Member

lrytz commented Jul 8, 2021

Thanks for the overview @martijnhoekstra

From what I can tell this LGTM, it aligns with Java. Agree that it's important to align Scala 3, scalameta, ping IntelliJ.

This change is not yet reflected in the spec.

Are you sure? Isn't this PR fixing the implementation to match the spec?

@SethTisue
Copy link
Member

@som-snytt are you interested in forward-porting to Scala 3?

@martijnhoekstra
Copy link
Contributor

Are you sure? Isn't this PR fixing the implementation to match the spec?

Yes, on the very first line :)

Scala programs are written using the Unicode Basic Multilingual Plane (BMP) character set; Unicode supplementary characters are not presently supported.

Which could be outright deleted or maybe replaced with "Scala programs are written using the Unicode character set"

scalac has been more lenient than the spec mandates in it's input for as long as I've known and accepted non-BMP characters in strings for example.

class AlwaysHasBeen {
  final val `𐐀` = "𐐀 has always been allowed"
  final val always: "𐐀 has always been allowed" = "\ud801\udc00 has always been allowed"
}

was already accepted (and still is).

@martijnhoekstra
Copy link
Contributor

scala also accepts (and has accepted) surrogate characters in UTF-8. UTF-8 has no surrogate characters, and these code sequences are not "real" Unicode. So if you're a stickler for both the scala spec and the Unicode spec, you can argue that the spec is incorrect because of that, and in many places where it says it accepts an Unicode character, it actually accepts a codepoint that doesn't encode an unicode Character.

The problem is there isn't anyone you can argue that to, because by the time you're halfway to the point you're trying to make, their eyes have long glazed over, which I suppose is completely justified.

@som-snytt
Copy link
Contributor Author

@martijnhoekstra That's right, anything going through getLitChar just takes a char as data. I didn't even need that residual call to isSupplementary there, so I will remove it and squash those two commits after your previous review. (To recap: my first pass disallowed "orphan surrogates" in literals.)

I'll also add test for string interpolation with supplementary char in constant part; if parser handles the interpolated identifier, then it just vacuums up the constants.

I'll update the spec.

@SethTisue yes, I will forward port, now that Martijn has explained to me what I did. Also, I just right-clicked to add Martijn to the dictionary and get rid of the red squiggle.

@som-snytt
Copy link
Contributor Author

som-snytt commented Jul 8, 2021

To the last mindbending point, all this is an artifact of JVM support. If we were ingesting codepoints, the only check would be that we read a valid codepoint, as we must do when assembling a supplementary character from surrogates.

I was going to tweet that day that I tried to think of "high surrogate" and came up with "high suffragette".

On "non-Unicode" or "broken" characters, I'm still wondering if it's worth accepting non-chars in unicode escape form only. I will follow up later if that is deemed useful.

@martijnhoekstra
Copy link
Contributor

Accepting broken characters in source form is pretty hard: by the time you go from a source encoding to an Array[Char], the encoder should have already rejected the bad sequences.

@som-snytt
Copy link
Contributor Author

By broken char, I meant run/t1406b, where we read a high surrogate only (not a unicode escape). As you point out, it's just a non-character, so is it more special than other non-character characters?

@som-snytt
Copy link
Contributor Author

Touched up the spec. The first sentence really needs to grab the reader.

Noticed that it would be nice to support varid starting with supplementary char, as that is defined in terms of letter and lower. As noted in the review, supplementary chars are encoded, and it's probably excessive work to change that, though not necessarily undesirable. The minimal change is for isVariableName to decode if necessary.

@som-snytt
Copy link
Contributor Author

Retouched up the spec.

spec/01-lexical-syntax.md Outdated Show resolved Hide resolved
@som-snytt
Copy link
Contributor Author

som-snytt commented Jul 9, 2021

"Supplimentary" is when you try to be nice and supportive, but also suave and supple.

Edit: suppliments are the complimentary supplements they give out on race day.

spec/01-lexical-syntax.md Outdated Show resolved Hide resolved
@SethTisue
Copy link
Member

I suggest we hold off on merging this until the Scala 3 forward-port has landed. (I don't like to put the Scala 3 folks on the hook for something without their consent, and also the forward-port might produce additional useful review feedback.)

@som-snytt
Copy link
Contributor Author

OK, I'll go ahead and attempt to forward port. I think the metaphor for "on the hook" here is "catch and release". You put the immature feature back into the ecosystem until it is big enough for frying.

Also accept supplementary characters in simple identifiers
in string interpolation, and as leading characters of varids.

Reject a supplementary character as a char literal.

Clarify Unicode support in the spec.

Clarify letter and special operator symbols in precedence table.
@som-snytt
Copy link
Contributor Author

Had to move a dotty test out of the way, born of that necessity was

tests/neg/sorrygates.scala

@SethTisue SethTisue changed the title Accept supplementary characters Accept supplementary Unicode characters in source code Aug 2, 2021
@SethTisue
Copy link
Member

seems clear the Scala 3 forward port will land, so...

@SethTisue SethTisue merged commit 3f06ac7 into scala:2.13.x Aug 2, 2021
@SethTisue
Copy link
Member

SethTisue commented Aug 2, 2021

Thank you @som-snytt! Let a thousand monstrous, bizarre identifier names bloom! 👹

@som-snytt som-snytt deleted the issue/1406 branch August 3, 2021 02:00
@SethTisue SethTisue changed the title Accept supplementary Unicode characters in source code Accept supplementary Unicode characters in identifiers Oct 29, 2021
@unkarjedy
Copy link
Contributor

IntelliJ issue https://youtrack.jetbrains.com/issue/SCL-19655

@som-snytt
Copy link
Contributor Author

som-snytt commented Nov 4, 2021

adriaanm or retronym would have said but now you'll have to notify the IDEs, which used to mean the eclipse plugin first of all. tpolecat already complained that this repo does not sync with its tooling consumers on any level. This was my fault, as I am lazy and not on a first name basis with any maintainers.

Edit: "seems clear the Scala 3 forward port will land, so..." or maybe not?

@martijnhoekstra
Copy link
Contributor

Everyone wants to be on a first name basis with you, Som. I knew a guy named Chip, he had a cat named Munk. Anyway, which first names should we have pinged?

@SethTisue
Copy link
Member

followup PR landed in 2.13.9: #9805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes worth highlighting in next release notes
Projects
None yet
7 participants