Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode escapes are now ordinary escape sequences (not processed early) #8282

Merged
merged 8 commits into from Mar 3, 2020

Conversation

martijnhoekstra
Copy link
Contributor

@martijnhoekstra martijnhoekstra commented Jul 30, 2019

Unicode escapes are handled like any other escape, in any context a normal escape could also work.

That means Unicode escapes get expanded in

  • Char literals
  • String literals
  • s and f interpolations
  • back quoted identifiers

Examples:

'\u0061' == 'a'
"\u0061a" == "aa"
s"\u0061$a" == "aa"
f"\u0061$a" == "aa"
val `snowy_\u2603\u2603` = 17
snowy_☃☃ == 17

You can now use a Unicode escape sequence to represent a single quote in a single quoted s or f interpolation, in other words, s"\u0022foo\u0022" is now a valid way to represent "\"foo\""

Triple quoted strings and raw interpolations don't allow normal escapes, but have allowed unicode escapes, since those were handled earlier, in the scanner. In this PR doing so emits a deprecation warning, unless under -xSource:2.14, where the Unicode escapes are simply unprocessed

Examples:

"""triple \u0061""" == "triple a" // emits deprecation warning
raw"raw \u0061" == "raw a" // emits deprecation warning

Using Unicode escapes anywhere else will no longer cause them to be processed. In all those cases the alternative is to use the literal character represented by the escape sequence.

This includes interpolations. If your custom interpolation needs to process unicode escapes, it should do so from within the interpolator rather than relying on the parser to process these escapes.

example

mycustominterpolator"not \u0061n escape"
//interpolator will see the literal sequence \u0061

Examples that no longer compile (with no deprecation):

val \u0061 = 'a'
va\u0072 x = 7

Fixes scala/bug#3220

(Curious why it ever worked the old way? Because Java did it, as explained at https://www.sitepoint.com/java-unicode-mysterious-compile-error/)

@scala-jenkins scala-jenkins added this to the 2.13.1 milestone Jul 30, 2019
@martijnhoekstra martijnhoekstra force-pushed the lostEscape branch 2 times, most recently from 220fc45 to 8b06af7 Compare August 2, 2019 08:51
@martijnhoekstra
Copy link
Contributor Author

martijnhoekstra commented Aug 2, 2019

Regarding -X-no-uescape, this PR gives it the axe (-X?) in the spirit of scala/scala-dev#430

The setting was passed to JavaByteArrayReader, but not escaping Unicode escapes in java source was never valid in the first place per jls 3.3.

Removing that parameter outright is both source and binary breaking though, and probably not worth the deprecation cycle (though I'm happy to accommodate whatever if greater minds think unlike)

@martijnhoekstra
Copy link
Contributor Author

Linking scala/bug#3220 which this closes.

@SethTisue could you kick off a community build for this branch?

@SethTisue
Copy link
Member

SethTisue commented Aug 3, 2019

testing 2.13.1-bin-19737c1-SNAPSHOT: https://scala-ci.typesafe.com/job/scala-2.13.x-integrate-community-build/2432/

@SethTisue
Copy link
Member

[scala-parser-combinators] [error] scala.StringContext$InvalidEscapeException: invalid escape '\[' not one of [\b, \t, \n, \f, \r, \, \", \', \uxxxx] at index 37 in "([^"\x00-\x1F\x7F\\]|\\[\\'"bfnrt]|\\u[a-fA-F0-9]{4})*". Use \\ for literal \.
[fast-string-interpolator] [info] - should don't compile in case of escaping error *** FAILED *** (11 milliseconds)
[fast-string-interpolator] [info]   "Expected no compiler error, but got the following type error: "invalid escape '\d' not one of [\b, \t, \n, \f, \r, \, \", \', \uxxxx] at index 0 in "\d". Use \\ for literal \.", for code:  fs"\d" " did not contain "invalid escape '\d' not one of [\b, \t, \n, \f, \r, \\, \", \'] at index 0 in "\d". Use \\ for literal \." (FastStringInterpolatorSpec.scala:34)

@martijnhoekstra martijnhoekstra force-pushed the lostEscape branch 4 times, most recently from f5c2800 to dbb8483 Compare August 6, 2019 18:59
@martijnhoekstra
Copy link
Contributor Author

@SethTisue could you start another community build? The parser combinators failures are hopefully fixed now (and the other one was spurious)

@SethTisue
Copy link
Member

queued: https://scala-ci.typesafe.com/job/scala-2.13.x-integrate-community-build/2439/

@martijnhoekstra
Copy link
Contributor Author

I'm seeing a lot of community build stuff going on, but I can't figure out where, if anywhere, this branch is running now. Could you take a look and squeeze one in if you're not too swamped (or the servers) with the build itself?

@SethTisue
Copy link
Member

oops, I shouldn't have canceled 2439, I got confused as well. (Jenkins doesn't let you attach a description to a run until the run actually starts)

queued: https://scala-ci.typesafe.com/job/scala-2.13.x-integrate-community-build/2449/

@SethTisue
Copy link
Member

consider that a green run — the handful of failures are unrelated to your changes

@martijnhoekstra
Copy link
Contributor Author

martijnhoekstra commented Aug 9, 2019

No longer WIP. Not sure whom to request for review, there is a bit of spec, a bit of parser, a bit of interpolation.

@martijnhoekstra martijnhoekstra changed the title WIP: Un-special-case unicode escapes Un-special-case unicode escapes Aug 9, 2019
@martijnhoekstra martijnhoekstra force-pushed the lostEscape branch 3 times, most recently from 711a638 to cf783ec Compare August 9, 2019 15:03
@martijnhoekstra
Copy link
Contributor Author

martijnhoekstra commented Aug 17, 2019 via email

@martijnhoekstra
Copy link
Contributor Author

Rebased for conflicts on mima exceptions.

@volth see a83a517 for a tes of that behaviour -- it works.

@martijnhoekstra
Copy link
Contributor Author

/rebuild

@martijnhoekstra
Copy link
Contributor Author

rebased again for mima exceptions.

@szeiger
Copy link
Member

szeiger commented Sep 3, 2019

What's the state of this in Dotty? I'd almost call this an uncontroversial bug fix that could probably be done without a SIP but we should at least have Dotty on board with this change. Rescheduling for 2.13.2 for now.

@szeiger szeiger removed this from the 2.13.1 milestone Sep 3, 2019
@SethTisue SethTisue modified the milestones: 2.13.3, 2.13.2 Feb 25, 2020
@SethTisue
Copy link
Member

SethTisue commented Feb 25, 2020

I've put this back on the 2.13.2 milestone, because it seems nearly (entirely?) ready for merge, and based on Lukas's feedback so far.

(it will need squashing before final merge, of course)

@SethTisue SethTisue added the release-notes worth highlighting in next release notes label Feb 25, 2020
@lrytz
Copy link
Member

lrytz commented Mar 2, 2020

I support merging this for 2.13.2. To summarize the breaking changes:

  • Interpolations outside strings, char literals and backquoted identifiers are no longer supported. val \u0061 = 'a' will no longer compile, there's no deprecation.
  • Some breaking changes in corner cases because unicode escapes are expanded at a later stage. For example, "\u0022" used to be """, now it's "\"". This enables a new workaround for \" escape does not work with string interpolation bug#6476.

There's currently no migration tool (?) - would it be hard to write a scalafix rule? Can we include a one-liner in the release notes that fixes arbitrary code bases?

@martijnhoekstra
Copy link
Contributor Author

There is a third and somewhat insidious (source, but not binary) breaking change: If you have a custom interpolator, unicode is no longer escaped in the interpolator.

I don't know about a scalafix. Looking at https://astexplorer.net/#/gist/ec56167ffafb20cbd8d68f24a37043a9/a34662ddd426356d811d29103e557e608c0e9f14 it is plausible it's not so easy to detect with scalameta.

@SethTisue
Copy link
Member

SethTisue commented Mar 3, 2020

as I said at the team meeting today, I'm comfortable with this change for 2.13.2 despite the string interpolator thing, because:

  • we're epoch-major-minor not major-minor-point. a change like this is totally fair game IMO
  • grepping a codebase for unicode escapes is easy, we can put a command in the release notes
  • I strongly suspect most codebases have few or no such escapes, and manually reviewing them is little burden, so a scalafix seems like a would-maybe-be-nice to me rather than something we should feel obligated to provide
  • this is already a biggish release with a lot of potentially breaking changes, it's not like including this will make the difference between "no-brainer upgrade" and "upgrade, but be careful", it's already the latter

@lrytz lrytz merged commit ee8c1ef into scala:2.13.x Mar 3, 2020
@lrytz
Copy link
Member

lrytz commented Mar 3, 2020

🎉 epochal, scala/bug#3220 was submitted on Mar 27, 2010.

@dwijnand
Copy link
Member

dwijnand commented Mar 3, 2020

  • Interpolations outside strings, char literals and backquoted identifiers are no longer supported. val \u0061 = 'a' will no longer compile, there's no deprecation.

Just catching up, thank god sbt is 2.12 only 😄 https://github.com/sbt/sbt/blob/0b12862caf63d1c23449391be6329d947b474157/main-settings/src/main/scala/sbt/std/InputWrapper.scala#L26-L31

@martijnhoekstra
Copy link
Contributor Author

martijnhoekstra commented Mar 3, 2020

The lines you link wouldn't be a problem (they're in strings), but the methods themselves should then be literal or backquoted in this PR -- either

def `wrapTask_\u2603\u2603` =

or

def wrapTask_☃☃ = 

That wouldn't be that terrible a migration now, right?

@smarter
Copy link
Member

smarter commented Mar 3, 2020

We upgrade the scala-library we depend on every time there's a release, so once this is part of a release the compiler changes can be done in dotty.

BTW, you could also already open a PR against dotty that uses a nightly build of scala-library, so we can review it in time for 2.13.2.

@Jasper-M
Copy link
Member

Jasper-M commented Mar 4, 2020

we're epoch-major-minor not major-minor-point.

Strictly speaking, isn't Scala epoch-major-point?

@SethTisue
Copy link
Member

SethTisue commented Mar 4, 2020

Strictly speaking, isn't Scala epoch-major-point?

Good question!

If you mean the standard library and the ABI and nothing else, then yes.

But considering the releases as wholes, no. We often add features in minor releases. So that already makes them "minor" and not "point".

Admittedly, it is rarer that we break backwards source compatibility, as we are doing here. It's not something we should be cavalier about, for sure, but I've already listed some mitigating factors, and I'll add one more: there is no longer any 2.14 release to defer these sorts of changes to, so I think that argues for opening the door a little wider on allowable 2.13.x changes. This series will be with us for a long time.

(If this is turning into a broader discussion about the 2.13.x release philosophy, we should probably move to contributors.scala-lang.org. But if there are further comments that still relate directly to this change in particular, that's fine here.)

@SethTisue SethTisue changed the title Un-special-case unicode escapes Unicode escapes are now ordinary escape sequences (not processed early) Mar 6, 2020
@SethTisue
Copy link
Member

@martijnhoekstra can I ask you to revise the PR description so it contains all of the need-to-know information for users in one place? in particular, the info on breaking changes that came out in later comments

@martijnhoekstra
Copy link
Contributor Author

Sure!

isomarcte added a commit to isomarcte/metals that referenced this pull request Dec 20, 2020
The regular expressions for `CodeBlockStartRegex` and `CodeBlockEndRegex` both contain two instances of the Unicode escape `\u000E`. This is the "Shift In" character. I expect that it was inserted as part of a copy/paste error.

Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the `<pre>` tag. For example,

```scala
import scala.util.matching._

object Main {

  val doc0: String =
    """
    | /** A foo is a bar, for example.
    |   *
    |   * {{{
    |   * val foo: String = "bar"
    |   * }}}
    |   *
    |   * <pre>
    |   * val bar: String = "baz
    |   * </pre>
    |   */""".stripMargin

  val CodeBlockStartRegex =
    new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""")

  val CodeBlockStartRegex0 =
    new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""")

  def matchInfo(regex: Regex, value: CharSequence): Unit = {
    println(s"\nTarget: ${value}")
    println(s"Regex: ${regex}")
    val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList
    println(s"Match Count: ${matches.size}")
    println(s"Matches: ${matches}")
  }

  def main(args: Array[String]): Unit = {
    matchInfo(CodeBlockStartRegex, doc0)
    matchInfo(CodeBlockStartRegex0, doc0)
  }
}
```

When run with 2.13.4 yields this result,

```shell
warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details
1 warning
Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*)
Match Count: 1
Matches: List(   * {{{)

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)
Match Count: 2
Matches: List(   * {{{,    * <pre>)
```

Note how the first output only found one match, the `{{{` based one, but the second one found both.

Finally, a small test was added to ensure that the change does not break comment parsing.
isomarcte added a commit to isomarcte/metals that referenced this pull request Dec 20, 2020
The regular expressions for `CodeBlockStartRegex` and `CodeBlockEndRegex` both contain two instances of the Unicode escape `\u000E`. This is the "Shift In" character. I expect that it was inserted as part of a copy/paste error.

Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the `<pre>` tag. For example,

```scala
import scala.util.matching._

object Main {

  val doc0: String =
    """
    | /** A foo is a bar, for example.
    |   *
    |   * {{{
    |   * val foo: String = "bar"
    |   * }}}
    |   *
    |   * <pre>
    |   * val bar: String = "baz
    |   * </pre>
    |   */""".stripMargin

  val CodeBlockStartRegex =
    new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""")

  val CodeBlockStartRegex0 =
    new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""")

  def matchInfo(regex: Regex, value: CharSequence): Unit = {
    println(s"\nTarget: ${value}")
    println(s"Regex: ${regex}")
    val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList
    println(s"Match Count: ${matches.size}")
    println(s"Matches: ${matches}")
  }

  def main(args: Array[String]): Unit = {
    matchInfo(CodeBlockStartRegex, doc0)
    matchInfo(CodeBlockStartRegex0, doc0)
  }
}
```

When run with 2.13.4 yields this result,

```shell
warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details
1 warning
Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*)
Match Count: 1
Matches: List(   * {{{)

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)
Match Count: 2
Matches: List(   * {{{,    * <pre>)
```

Note how the first output only found one match, the `{{{` based one, but the second one found both.

Finally, a small test was added to ensure that the change does not break comment parsing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes worth highlighting in next release notes
Projects
None yet
8 participants