Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336

isomarcte · 2020-12-20T17:47:23Z

The regular expressions for CodeBlockStartRegex and CodeBlockEndRegex both contain two instances of the Unicode escape \u000E. This is the "Shift Out" character. I expect that it was inserted as part of a copy/paste error.

Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the <pre> tag. For example,

import scala.util.matching._

object Main {

  val doc0: String =
    """
    | /** A foo is a bar, for example.
    |   *
    |   * {{{
    |   * val foo: String = "bar"
    |   * }}}
    |   *
    |   * <pre>
    |   * val bar: String = "baz
    |   * </pre>
    |   */""".stripMargin

  val CodeBlockStartRegex =
    new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""")

  val CodeBlockStartRegex0 =
    new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""")

  def matchInfo(regex: Regex, value: CharSequence): Unit = {
    println(s"\nTarget: ${value}")
    println(s"Regex: ${regex}")
    val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList
    println(s"Match Count: ${matches.size}")
    println(s"Matches: ${matches}")
  }

  def main(args: Array[String]): Unit = {
    matchInfo(CodeBlockStartRegex, doc0)
    matchInfo(CodeBlockStartRegex0, doc0)
  }
}

When run with 2.13.4 yields this result,

warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details
1 warning
Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*)
Match Count: 1
Matches: List(   * {{{)

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)
Match Count: 2
Matches: List(   * {{{,    * <pre>)

Note how the first output only found one match, the {{{ based one, but the second one found both.

Finally, a small test was added to ensure that the change does not break comment parsing.

The regular expressions for `CodeBlockStartRegex` and `CodeBlockEndRegex` both contain two instances of the Unicode escape `\u000E`. This is the "Shift In" character. I expect that it was inserted as part of a copy/paste error. Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the `<pre>` tag. For example, ```scala import scala.util.matching._ object Main { val doc0: String = """ | /** A foo is a bar, for example. | * | * {{{ | * val foo: String = "bar" | * }}} | * | * <pre> | * val bar: String = "baz | * </pre> | */""".stripMargin val CodeBlockStartRegex = new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""") val CodeBlockStartRegex0 = new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""") def matchInfo(regex: Regex, value: CharSequence): Unit = { println(s"\nTarget: ${value}") println(s"Regex: ${regex}") val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList println(s"Match Count: ${matches.size}") println(s"Matches: ${matches}") } def main(args: Array[String]): Unit = { matchInfo(CodeBlockStartRegex, doc0) matchInfo(CodeBlockStartRegex0, doc0) } } ``` When run with 2.13.4 yields this result, ```shell warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details 1 warning Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false Target: /** A foo is a bar, for example. * * {{{ * val foo: String = "bar" * }}} * * <pre> * val bar: String = "baz * </pre> */ Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*) Match Count: 1 Matches: List( * {{{) Target: /** A foo is a bar, for example. * * {{{ * val foo: String = "bar" * }}} * * <pre> * val bar: String = "baz * </pre> */ Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*) Match Count: 2 Matches: List( * {{{, * <pre>) ``` Note how the first output only found one match, the `{{{` based one, but the second one found both. Finally, a small test was added to ensure that the change does not break comment parsing.

SethTisue · 2020-12-20T18:46:07Z

just removing the control characters didn't work over at scala/scala#9359 , it turned out they were there for a reason -- but maybe the context here is somehow different, idk

tgodzik

@olafurpg do you remember why that unicode character could have been added?

I think this should be good to merge, probably it would be best to use the scalameta scaladoc parser itself and remove the one here, but that would require much more work than this.

tgodzik · 2020-12-21T11:17:31Z

@olafurpg do you remember why that unicode character could have been added?

I think this should be good to merge, probably it would be best to use the scalameta scaladoc parser itself and remove the one here, but that would require much more work than this.

It looks like it was copied from scala/scala, I am up for removing it, I don't really understand what that doesn, but I feel like this should not be needed in Metals.

olafurpg · 2020-12-21T12:08:58Z

The upstream Scalameta Scaladoc parser is less accurate than the one in Metals so I would keep it unchanged. I think the code was copied verbatim from scala/scala and the special unicode character is still present in the latest version of the parser (see here https://github.com/scala/scala/blob/0f64691c952bddc37e3196fd8ef7f7cd23214ccb/src/scaladoc/scala/tools/nsc/doc/base/CommentFactoryBase.scala#L169-L186). @isomarcte do you think it's possible to keep the special character but interpolate it similar to how the scala/scala implementation does? I think it's important that we avoid diverging from the upstream scaladoc parser.

isomarcte · 2020-12-21T15:56:01Z

@olafurpg @tgodzik @SethTisue

@isomarcte do you think it's possible to keep the special character but interpolate it similar to how the scala/scala implementation does?

Yeah, but I want to dig in a bit and understand what this character is being used for upstream, no matter what we do here. I'll get back to everyone here shortly.

Thanks for the reviews!

ckipp01 · 2021-07-05T08:28:45Z

I'm going to go ahead and close this seeing that it's pretty stale. @isomarcte if you plan on digging back into this, feel free to just ping me and I can reopen.

tgodzik reviewed Dec 21, 2020

View reviewed changes

tgodzik approved these changes Dec 21, 2020

View reviewed changes

olafurpg mentioned this pull request Dec 21, 2020

MethodNotFoundError in worksheet when using latest Scalameta release #2337

Closed

ckipp01 marked this pull request as draft January 22, 2021 18:20

ckipp01 closed this Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336

Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336

isomarcte commented Dec 20, 2020 •

edited

SethTisue commented Dec 20, 2020

tgodzik left a comment

tgodzik commented Dec 21, 2020

olafurpg commented Dec 21, 2020

isomarcte commented Dec 21, 2020

ckipp01 commented Jul 5, 2021

Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336

Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336

Conversation

isomarcte commented Dec 20, 2020 • edited

SethTisue commented Dec 20, 2020

tgodzik left a comment

Choose a reason for hiding this comment

tgodzik commented Dec 21, 2020

olafurpg commented Dec 21, 2020

isomarcte commented Dec 21, 2020

ckipp01 commented Jul 5, 2021

isomarcte commented Dec 20, 2020 •

edited