New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept supplementary Unicode characters in identifiers #9687
Conversation
74de6d5
to
e632844
Compare
cf8c066
to
468dd35
Compare
If I read this correctly, the biggest change is in plain identifiers, which now allow non-BMP characters if allowed by the identifier grammar. Strings literals, interpolations, back quoted identifiers, char literals and other literals are all unchanged: even where the spec didn't allow supplementary characters, the implementation did. Changed:
Unchanged:
How are back-quoted identifiers handled here? What do they accept and does that break on any backend? -- edit: I checked this, they are properly encoded. Implementation wise it's really nice. This change is not yet reflected in the spec. Maybe spec work can be done separately from this PR? That's up to the lightbend team to decide I think. Same for scalameta and scala 3 parity. |
Thanks for the overview @martijnhoekstra From what I can tell this LGTM, it aligns with Java. Agree that it's important to align Scala 3, scalameta, ping IntelliJ.
Are you sure? Isn't this PR fixing the implementation to match the spec? |
@som-snytt are you interested in forward-porting to Scala 3? |
Yes, on the very first line :)
Which could be outright deleted or maybe replaced with "Scala programs are written using the Unicode character set" scalac has been more lenient than the spec mandates in it's input for as long as I've known and accepted non-BMP characters in strings for example.
was already accepted (and still is). |
scala also accepts (and has accepted) surrogate characters in UTF-8. UTF-8 has no surrogate characters, and these code sequences are not "real" Unicode. So if you're a stickler for both the scala spec and the Unicode spec, you can argue that the spec is incorrect because of that, and in many places where it says it accepts an Unicode character, it actually accepts a codepoint that doesn't encode an unicode Character. The problem is there isn't anyone you can argue that to, because by the time you're halfway to the point you're trying to make, their eyes have long glazed over, which I suppose is completely justified. |
@martijnhoekstra That's right, anything going through I'll also add test for string interpolation with supplementary char in constant part; if parser handles the interpolated identifier, then it just vacuums up the constants. I'll update the spec. @SethTisue yes, I will forward port, now that Martijn has explained to me what I did. Also, I just right-clicked to add Martijn to the dictionary and get rid of the red squiggle. |
To the last mindbending point, all this is an artifact of JVM support. If we were ingesting codepoints, the only check would be that we read a valid codepoint, as we must do when assembling a supplementary character from surrogates. I was going to tweet that day that I tried to think of "high surrogate" and came up with "high suffragette". On "non-Unicode" or "broken" characters, I'm still wondering if it's worth accepting non-chars in unicode escape form only. I will follow up later if that is deemed useful. |
Accepting broken characters in source form is pretty hard: by the time you go from a source encoding to an Array[Char], the encoder should have already rejected the bad sequences. |
By broken char, I meant run/t1406b, where we read a high surrogate only (not a unicode escape). As you point out, it's just a non-character, so is it more special than other non-character characters? |
Touched up the spec. The first sentence really needs to grab the reader. Noticed that it would be nice to support varid starting with supplementary char, as that is defined in terms of letter and lower. As noted in the review, supplementary chars are encoded, and it's probably excessive work to change that, though not necessarily undesirable. The minimal change is for isVariableName to decode if necessary. |
Retouched up the spec. |
"Supplimentary" is when you try to be nice and supportive, but also suave and supple. Edit: suppliments are the complimentary supplements they give out on race day. |
I suggest we hold off on merging this until the Scala 3 forward-port has landed. (I don't like to put the Scala 3 folks on the hook for something without their consent, and also the forward-port might produce additional useful review feedback.) |
OK, I'll go ahead and attempt to forward port. I think the metaphor for "on the hook" here is "catch and release". You put the immature feature back into the ecosystem until it is big enough for frying. |
Also accept supplementary characters in simple identifiers in string interpolation, and as leading characters of varids. Reject a supplementary character as a char literal. Clarify Unicode support in the spec. Clarify letter and special operator symbols in precedence table.
Had to move a dotty test out of the way, born of that necessity was
|
seems clear the Scala 3 forward port will land, so... |
Thank you @som-snytt! Let a thousand monstrous, bizarre identifier names bloom! 👹 |
IntelliJ issue https://youtrack.jetbrains.com/issue/SCL-19655 |
adriaanm or retronym would have said but now you'll have to notify the IDEs, which used to mean the eclipse plugin first of all. tpolecat already complained that this repo does not sync with its tooling consumers on any level. This was my fault, as I am lazy and not on a first name basis with any maintainers. Edit: "seems clear the Scala 3 forward port will land, so..." or maybe not? |
Everyone wants to be on a first name basis with you, Som. I knew a guy named Chip, he had a cat named Munk. Anyway, which first names should we have pinged? |
followup PR landed in 2.13.9: #9805 |
Previously the lexer only accepted identifiers drawn from the basic multilingual plane.
Fixes scala/bug#1406