Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove The Unicode Escape \u000E In Scaladoc Code Comment Parsing #2336

Closed

Conversation

isomarcte
Copy link
Contributor

@isomarcte isomarcte commented Dec 20, 2020

The regular expressions for CodeBlockStartRegex and CodeBlockEndRegex both contain two instances of the Unicode escape \u000E. This is the "Shift Out" character. I expect that it was inserted as part of a copy/paste error.

Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the <pre> tag. For example,

import scala.util.matching._

object Main {

  val doc0: String =
    """
    | /** A foo is a bar, for example.
    |   *
    |   * {{{
    |   * val foo: String = "bar"
    |   * }}}
    |   *
    |   * <pre>
    |   * val bar: String = "baz
    |   * </pre>
    |   */""".stripMargin

  val CodeBlockStartRegex =
    new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""")

  val CodeBlockStartRegex0 =
    new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""")

  def matchInfo(regex: Regex, value: CharSequence): Unit = {
    println(s"\nTarget: ${value}")
    println(s"Regex: ${regex}")
    val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList
    println(s"Match Count: ${matches.size}")
    println(s"Matches: ${matches}")
  }

  def main(args: Array[String]): Unit = {
    matchInfo(CodeBlockStartRegex, doc0)
    matchInfo(CodeBlockStartRegex0, doc0)
  }
}

When run with 2.13.4 yields this result,

warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details
1 warning
Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*)
Match Count: 1
Matches: List(   * {{{)

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)
Match Count: 2
Matches: List(   * {{{,    * <pre>)

Note how the first output only found one match, the {{{ based one, but the second one found both.

Finally, a small test was added to ensure that the change does not break comment parsing.

The regular expressions for `CodeBlockStartRegex` and `CodeBlockEndRegex` both contain two instances of the Unicode escape `\u000E`. This is the "Shift In" character. I expect that it was inserted as part of a copy/paste error.

Unicode escapes in triple quote strings are deprecated as of 2.13.2 (scala/scala#8282). Further, this character actually makes the regular expression invalid if it is interpreted. This isn't a big deal right now, as it appears to be ignored on Scala 2.12.x, but on Scala 2.13.x this will cause the regular expressions to fail for Scaladoc using the `<pre>` tag. For example,

```scala
import scala.util.matching._

object Main {

  val doc0: String =
    """
    | /** A foo is a bar, for example.
    |   *
    |   * {{{
    |   * val foo: String = "bar"
    |   * }}}
    |   *
    |   * <pre>
    |   * val bar: String = "baz
    |   * </pre>
    |   */""".stripMargin

  val CodeBlockStartRegex =
    new Regex("""(.*?)((?:\{\{\{)|(?:\u000E<pre(?: [^>]*)?>\u000E))(.*)""")

  val CodeBlockStartRegex0 =
    new Regex("""(.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)""")

  def matchInfo(regex: Regex, value: CharSequence): Unit = {
    println(s"\nTarget: ${value}")
    println(s"Regex: ${regex}")
    val matches: List[Regex.Match] = regex.findAllMatchIn(value).toList
    println(s"Match Count: ${matches.size}")
    println(s"Matches: ${matches}")
  }

  def main(args: Array[String]): Unit = {
    matchInfo(CodeBlockStartRegex, doc0)
    matchInfo(CodeBlockStartRegex0, doc0)
  }
}
```

When run with 2.13.4 yields this result,

```shell
warning: 1 deprecation (since 2.13.2); re-run with -deprecation for details
1 warning
Picked up JAVA_TOOL_OPTIONS: -Dsbt.supershell=false

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:�<pre(?: [^>]*)?>�))(.*)
Match Count: 1
Matches: List(   * {{{)

Target:
 /** A foo is a bar, for example.
   *
   * {{{
   * val foo: String = "bar"
   * }}}
   *
   * <pre>
   * val bar: String = "baz
   * </pre>
   */
Regex: (.*?)((?:\{\{\{)|(?:<pre(?: [^>]*)?>))(.*)
Match Count: 2
Matches: List(   * {{{,    * <pre>)
```

Note how the first output only found one match, the `{{{` based one, but the second one found both.

Finally, a small test was added to ensure that the change does not break comment parsing.
@SethTisue
Copy link
Contributor

just removing the control characters didn't work over at scala/scala#9359 , it turned out they were there for a reason -- but maybe the context here is somehow different, idk

Copy link
Contributor

@tgodzik tgodzik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olafurpg do you remember why that unicode character could have been added?

I think this should be good to merge, probably it would be best to use the scalameta scaladoc parser itself and remove the one here, but that would require much more work than this.

@tgodzik
Copy link
Contributor

tgodzik commented Dec 21, 2020

@olafurpg do you remember why that unicode character could have been added?

I think this should be good to merge, probably it would be best to use the scalameta scaladoc parser itself and remove the one here, but that would require much more work than this.

It looks like it was copied from scala/scala, I am up for removing it, I don't really understand what that doesn, but I feel like this should not be needed in Metals.

@olafurpg
Copy link
Member

The upstream Scalameta Scaladoc parser is less accurate than the one in Metals so I would keep it unchanged. I think the code was copied verbatim from scala/scala and the special unicode character is still present in the latest version of the parser (see here https://github.com/scala/scala/blob/0f64691c952bddc37e3196fd8ef7f7cd23214ccb/src/scaladoc/scala/tools/nsc/doc/base/CommentFactoryBase.scala#L169-L186). @isomarcte do you think it's possible to keep the special character but interpolate it similar to how the scala/scala implementation does? I think it's important that we avoid diverging from the upstream scaladoc parser.

@isomarcte
Copy link
Contributor Author

@olafurpg @tgodzik @SethTisue

@isomarcte do you think it's possible to keep the special character but interpolate it similar to how the scala/scala implementation does?

Yeah, but I want to dig in a bit and understand what this character is being used for upstream, no matter what we do here. I'll get back to everyone here shortly.

Thanks for the reviews!

@ckipp01 ckipp01 marked this pull request as draft January 22, 2021 18:20
@ckipp01
Copy link
Member

ckipp01 commented Jul 5, 2021

I'm going to go ahead and close this seeing that it's pretty stale. @isomarcte if you plan on digging back into this, feel free to just ping me and I can reopen.

@ckipp01 ckipp01 closed this Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants