Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalac should remove comments before translating unicode escapes #3220

Closed
scabug opened this issue Mar 27, 2010 · 27 comments · Fixed by scala/scala#8282
Closed

Scalac should remove comments before translating unicode escapes #3220

scabug opened this issue Mar 27, 2010 · 27 comments · Fixed by scala/scala#8282
Assignees
Milestone

Comments

@scabug
Copy link

scabug commented Mar 27, 2010

Consider this code:

object LinePrinter {
  def main(args: Array[String]){
    // Note: \u000A is Unicode representation of linefeed (LF)
    val c: Char = 0x000A
    print(c)
  }
}

This code does not compile, giving the following error method:

LinePrinter.scala:3: error: not found: value is
  // Note: \u000A is Unicode representation of linefeed (LF)
                  ^
one error found

The problem is that the compiler transforms unicode escapes before removing the comments, leading to the following code given to the next compile phase:

object LinePrinter {
  def main(args: Array[String]){
    // Note:
 is Unicode representation of linefeed (LF)
    val c: Char = 0x000A
    print(c)
  }
}

This feature/bug is described in the Java book "Java Puzzlers" (see the sampler, it is Puzzle 2) as "thoroughly confusing". Corresponding Java spec is [JLS 3.4].

I would like to suggest removing lines of comments before converting escapes, because how it works now is highly surprising and unexpected.

@scabug
Copy link
Author

scabug commented Mar 27, 2010

Imported From: https://issues.scala-lang.org/browse/SI-3220?orig=1
Reporter: @soc

@scabug
Copy link
Author

scabug commented Mar 30, 2010

@ingoem said:
Thanks for reporting, but we don't try to improve on these basic Java specifications.

@scabug
Copy link
Author

scabug commented Mar 30, 2010

@soc said:
Hi imaier,

sorry for reopening, but after re-reading the spec, I'm quite sure that the procedure suggested is well within the bounds of the specification.
Imho the removal of the //-comments especially fulfills the rule that:

This transformed version is equally acceptable to a compiler for the Java programming language ("Java compiler") and represents the exact same program.

The second half of [JLS 2.2] reinforces my impression that it is not intended that parts of the comments should survive their removal before the program source enters the next phase, where only syntactic grammar is allowed:

These input elements, with white space (�3.6)  and comments (�3.7)  discarded, form the terminal symbols for the syntactic grammar for the Java programming language and are called tokens (�3.5).

To wrap up my arguments, I think that is a good idea to remove comments before entering the next phase of compilation because:

  • It stays within the Java language specification
  • A comment should not be able to change the behavior of a program
  • It behaves like detailed in the ECMA-334 spec (tested with Mono) which is the "other platform" Scala wants to run on
  • The removal is sensible, because it reduces unexpected side-effects
  • The 'feature' never had - to my knowledge - any support of an IDE, a text editor nor a syntax highlighter
  • I never have seen this behavior in any reasonable use, except to present this behavior to a unsuspecting audience of course :-)

An additional problem with the current situation is that Scala will not even be syntax-compatible between the JVM- and the .NET-platform.

If, in your opinion, the current behavior is the only valid one according to the JLS, it won't be possible to both fulfill the JLS and the ECMA standards. Having to choose between which spec to slightly 'adjust', I would suggest choosing the one which is more sensible (the ECMA one imho, which discards the comments).

Eventually I want to point out that this is not a VM spec which we closely have to follow, but the spec of the Java Language, so we should have a bit more freedom to improve upon it in the Scala language.

I would be really happy if you would kindly reconsider your decision based on the additional arguments and thoughts I tried to provide on that matter.

Thanks!

@scabug
Copy link
Author

scabug commented Mar 31, 2010

@soc said:
After speaking to other people I have to disclose that I'm in the minority with my stance that this is can be solved completely JLS-compatible. :-)

But the fundamental problem persists:

As Scala wants to support multiple platforms (JVM, .NET) it might be necessary to abstract a tiny bit from the specific platform behavior.

It would be wise to choose a model which makes sure that a JVM-compatible source file is a .NET-compatible source file, too.

In my opinion it is more consistent and understandable to remove the comments before converting the the Unicode escapes to raw Unicode.

Personal opinion: I think that Java's behavior is an oversight, I can't imagine anyone saying "let's to it this way" on purpose. :-)

@scabug
Copy link
Author

scabug commented Apr 7, 2010

@dubochet said:
This is a case where we don't try to improve over Java. It may be imperfect, and may be incoherent with c#, but being coherent with Java has a value, and we don't believe at this point that another solution would be better.

@scabug scabug closed this as completed May 18, 2011
@scabug
Copy link
Author

scabug commented Jul 12, 2015

@som-snytt said:
Since string literals tolerate the unicode escape, so should comments.

scala> "\u000A"  // \u000A is LF

Somehow, Scala progressed over Java.

@SethTisue
Copy link
Member

this has been raised again at https://ashannon.us/articles/scala-comment-exec.html

I think if we went around on this again in 2018, the outcome might be different than the 2010 outcome. personally, I'd support a deprecate-then-remove on this.

@som-snytt
Copy link

I didn't read the long parts, but \\ is the dual of //

trait T {
  def s = "\u000A"    // \\u000A is the bomb
}

@martijnhoekstra
Copy link

martijnhoekstra commented May 11, 2018

What's the process here @SethTisue? SIP? It needs a spec change and I guess a deprecation cycle.

@Ichoran
Copy link

Ichoran commented May 11, 2018

I think it should just be done away with, with no deprecation, and only a migration tool.

As if comments weren't enough of a problem, there are a variety of other surprising behaviors and undesirable possibilities.

Problems include:

First, the feature enables difficult-to-understand variants of common code patterns:

val sq = (x: Int) \u21D2 x * x
val i\u006ec = (x: Int) => x + 1
va\u0072 x = 7  // Am I mutable?

None of these add any value at all. If they are there, it's probably a mistake.

Secondly, we have the comment issue, as previously addressed:

// Maybe you should try \u000a println("Ha ha, you already tried it!")

Thirdly, string literals have rather non-intuitive behavior (meaning that they're not really literal):

scala> """\n"""
res11: String = \n

scala> """\u000A"""
res12: String =
"
"scala> """\\u000A"""
res13: String = \\u000A

scala> """\\\u000A"""
res14: String =
"\\
"

Note that this behavior means you can't actually write the text of a unicode escape into a raw string in any straightforward way, which is especially weird since you'd expect it to be easiest there.

Fourth, unlike decades ago, UTF-8 is ubiquitous. You can just insert the character you care about in the code if you need it, greatly assisting readability.

// Which is better?
val jérôme = 7
val j\u00e9r\u00f4me = 7

Fifth, there is no reason why late interpolation can't insert as many non-ASCII unicode characters as your heart desires into strings, characters, and backticked identifier names, and any existing instances should be able to be automatically translated.

So I don't see any reason to keep it as it is for any cycles at all. It facilitates unclear code, presents difficulties for catching malicious code, creates irregularities in raw string processing, is useless in its full generality, and can be automatically converted anywhere UTF-8 or UTF-16 are allowed (which is approximately everywhere--and a compiler flag could be used to rescue people who are irreversibly attached to Windows-1252).

@SethTisue
Copy link
Member

SethTisue commented May 11, 2018

What's the process here @SethTisue? SIP? It needs a spec change

When something is a lot of work to implement, then involving the SIP committee first can save work, if something is just going to end up being rejected. But waiting for the committee also costs time. Anyway, this is really minor stuff around the edges of the language.

This one probably isn't much work to implement, so I'd suggest that someone simply submit a pull request (and yes, best if it includes a spec change). If the pull request becomes contentious, a SIP might eventually be asked for. Often that doesn't happen and the PR can just be merged.

Even if a SIP does end up being asked for, it's still good to have the pull request in hand, because every SIP ultimately needs an implementation, and anyway having a green pull request to refer to makes any discussion so much more grounded and concrete.

@SethTisue SethTisue reopened this May 11, 2018
@SethTisue
Copy link
Member

SethTisue commented May 11, 2018

Calling attention to the issue on https://contributors.scala-lang.org can help rally support. Or unexpected opposition, but if that's coming, you might as well find out :-)

@SethTisue SethTisue added this to the Backlog milestone May 11, 2018
@SethTisue
Copy link
Member

fyi @adamdecaf

@adamdecaf
Copy link

Thanks for bumping the bug related to this. I couldn't quite find one. As @Ichoran pointed out unicode escapes often cause more problems than they fix and as we've seen can lead to problematic code.

I'll help where I can to get this pushed through. Just let me know!

@martijnhoekstra
Copy link

Calling attention to the issue on https://contributors.scala-lang.org can help rally support. Or unexpected opposition, but if that's coming, you might as well find out :-)

https://contributors.scala-lang.org/t/escaping-unicode/1894

@SethTisue SethTisue modified the milestones: 2.13.0-RC1, 2.13.1 Feb 20, 2019
@SethTisue
Copy link
Member

@martijnhoekstra I think this can go forward if you're still interested

(if you don't plan to tackle it, also let us know, since perhaps someone else would step up)

@ashawley
Copy link
Member

I ran in to this and added some tests to document the status quo scala/scala#6501

I didn't know about this ticket. I guess the files should be renamed to t3220, since u000a is cryptic.

@martijnhoekstra
Copy link

Oh dear, I missed the ping on this one. I'll get back on this

@martijnhoekstra
Copy link

Status update: Needs to be ported to dotty too, to prevent progression regressing again when dotty becomes scala 3.

@szeiger szeiger modified the milestones: 2.13.1, 2.13.2 Sep 9, 2019
@som-snytt
Copy link

since u000a is cryptic.

It's u for unit test. Let's rename all the tests from t to u. And convert decimal bug numbers to hex, which will make them much easier to read. PR soon.

I'll also look for any old tests named bugNNNN.

@ashawley
Copy link
Member

ashawley commented Sep 9, 2019

Needs to be ported to dotty too, to prevent progression regressing again when dotty becomes scala 3.

There's isn't a broader initiative to update Dotty with Scala 2.13 tests? I know they just updated the collections in Dotty with what's in 2.13.0. Hopefully, the tests will get refreshed as well at some point.

@ashawley
Copy link
Member

ashawley commented Sep 9, 2019

Let's rename all the tests from t to u. And convert decimal bug numbers to hex, which will make them much easier to read.

I agree. This is what the people want.

@martijnhoekstra
Copy link

@ashawley it's the fix that needs to be ported to dotty, not the test.

Well, also the tests, I suppose.

@som-snytt
Copy link

smarter recently made a plea for ongoing syntax tweaks in Scala 2 to get fixes in dotty. We might consider an umbrella ticket in dotty to track these differences?

@martijnhoekstra
Copy link

Tracking them is nice, but someone still needs to implement. A situation where migrating to scala 3 means regressing in recent changes in scala 2 makes for a terrible migration story. Being able to point at an umbrella ticket doesn't change that.

Dotty being the ruggedly handsome swashbuckling research compiler making bold forays into uncharted territory may need a lot of convincing that these kinds of boring syntax changes are worthy of it's time.

@SethTisue SethTisue modified the milestones: 2.13.2, Backlog Feb 6, 2020
@lrytz lrytz modified the milestones: Backlog, 2.13.2 Mar 3, 2020
@eed3si9n
Copy link
Member

Did we make

scala> def \u0061 = "foo"
a: String

illegal in a patch release? Until 2.13.2, this was the second paragraph of the Scala Language Spec:

In Scala mode, Unicode escapes are replaced by the corresponding Unicode character with the given hexadecimal code.

@SethTisue
Copy link
Member

SethTisue commented Jan 14, 2021

@eed3si9n yes, scala/scala#8282 landed in 2.13.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants