Scalac should remove comments before translating unicode escapes #3220

scabug · 2010-03-27T18:36:24Z

Consider this code:

object LinePrinter {
  def main(args: Array[String]){
    // Note: \u000A is Unicode representation of linefeed (LF)
    val c: Char = 0x000A
    print(c)
  }
}

This code does not compile, giving the following error method:

LinePrinter.scala:3: error: not found: value is
  // Note: \u000A is Unicode representation of linefeed (LF)
                  ^
one error found

The problem is that the compiler transforms unicode escapes before removing the comments, leading to the following code given to the next compile phase:

object LinePrinter {
  def main(args: Array[String]){
    // Note:
 is Unicode representation of linefeed (LF)
    val c: Char = 0x000A
    print(c)
  }
}

This feature/bug is described in the Java book "Java Puzzlers" (see the sampler, it is Puzzle 2) as "thoroughly confusing". Corresponding Java spec is [JLS 3.4].

I would like to suggest removing lines of comments before converting escapes, because how it works now is highly surprising and unexpected.

scabug · 2010-03-27T18:36:24Z

Imported From: https://issues.scala-lang.org/browse/SI-3220?orig=1
Reporter: @soc

scabug · 2010-03-30T11:23:48Z

@ingoem said:
Thanks for reporting, but we don't try to improve on these basic Java specifications.

scabug · 2010-03-30T19:36:02Z

@soc said:
Hi imaier,

sorry for reopening, but after re-reading the spec, I'm quite sure that the procedure suggested is well within the bounds of the specification.
Imho the removal of the //-comments especially fulfills the rule that:

This transformed version is equally acceptable to a compiler for the Java programming language ("Java compiler") and represents the exact same program.

The second half of [JLS 2.2] reinforces my impression that it is not intended that parts of the comments should survive their removal before the program source enters the next phase, where only syntactic grammar is allowed:

These input elements, with white space (�3.6)  and comments (�3.7)  discarded, form the terminal symbols for the syntactic grammar for the Java programming language and are called tokens (�3.5).

To wrap up my arguments, I think that is a good idea to remove comments before entering the next phase of compilation because:

It stays within the Java language specification
A comment should not be able to change the behavior of a program
It behaves like detailed in the ECMA-334 spec (tested with Mono) which is the "other platform" Scala wants to run on
The removal is sensible, because it reduces unexpected side-effects
The 'feature' never had - to my knowledge - any support of an IDE, a text editor nor a syntax highlighter
I never have seen this behavior in any reasonable use, except to present this behavior to a unsuspecting audience of course :-)

An additional problem with the current situation is that Scala will not even be syntax-compatible between the JVM- and the .NET-platform.

If, in your opinion, the current behavior is the only valid one according to the JLS, it won't be possible to both fulfill the JLS and the ECMA standards. Having to choose between which spec to slightly 'adjust', I would suggest choosing the one which is more sensible (the ECMA one imho, which discards the comments).

Eventually I want to point out that this is not a VM spec which we closely have to follow, but the spec of the Java Language, so we should have a bit more freedom to improve upon it in the Scala language.

I would be really happy if you would kindly reconsider your decision based on the additional arguments and thoughts I tried to provide on that matter.

Thanks!

scabug · 2010-03-31T14:32:11Z

@soc said:
After speaking to other people I have to disclose that I'm in the minority with my stance that this is can be solved completely JLS-compatible. :-)

But the fundamental problem persists:

As Scala wants to support multiple platforms (JVM, .NET) it might be necessary to abstract a tiny bit from the specific platform behavior.

It would be wise to choose a model which makes sure that a JVM-compatible source file is a .NET-compatible source file, too.

In my opinion it is more consistent and understandable to remove the comments before converting the the Unicode escapes to raw Unicode.

Personal opinion: I think that Java's behavior is an oversight, I can't imagine anyone saying "let's to it this way" on purpose. :-)

scabug · 2010-04-07T13:17:47Z

@dubochet said:
This is a case where we don't try to improve over Java. It may be imperfect, and may be incoherent with c#, but being coherent with Java has a value, and we don't believe at this point that another solution would be better.

scabug · 2015-07-12T15:09:56Z

@som-snytt said:
Since string literals tolerate the unicode escape, so should comments.

scala> "\u000A"  // \u000A is LF

Somehow, Scala progressed over Java.

SethTisue · 2018-05-11T19:26:43Z

this has been raised again at https://ashannon.us/articles/scala-comment-exec.html

I think if we went around on this again in 2018, the outcome might be different than the 2010 outcome. personally, I'd support a deprecate-then-remove on this.

som-snytt · 2018-05-11T19:58:30Z

I didn't read the long parts, but \\ is the dual of //

trait T {
  def s = "\u000A"    // \\u000A is the bomb
}

martijnhoekstra · 2018-05-11T20:18:53Z

What's the process here @SethTisue? SIP? It needs a spec change and I guess a deprecation cycle.

Ichoran · 2018-05-11T20:27:38Z

I think it should just be done away with, with no deprecation, and only a migration tool.

As if comments weren't enough of a problem, there are a variety of other surprising behaviors and undesirable possibilities.

Problems include:

First, the feature enables difficult-to-understand variants of common code patterns:

val sq = (x: Int) \u21D2 x * x
val i\u006ec = (x: Int) => x + 1
va\u0072 x = 7  // Am I mutable?

None of these add any value at all. If they are there, it's probably a mistake.

Secondly, we have the comment issue, as previously addressed:

// Maybe you should try \u000a println("Ha ha, you already tried it!")

Thirdly, string literals have rather non-intuitive behavior (meaning that they're not really literal):

scala> """\n"""
res11: String = \n

scala> """\u000A"""
res12: String =
"
"scala> """\\u000A"""
res13: String = \\u000A

scala> """\\\u000A"""
res14: String =
"\\
"

Note that this behavior means you can't actually write the text of a unicode escape into a raw string in any straightforward way, which is especially weird since you'd expect it to be easiest there.

Fourth, unlike decades ago, UTF-8 is ubiquitous. You can just insert the character you care about in the code if you need it, greatly assisting readability.

// Which is better?
val jérôme = 7
val j\u00e9r\u00f4me = 7

Fifth, there is no reason why late interpolation can't insert as many non-ASCII unicode characters as your heart desires into strings, characters, and backticked identifier names, and any existing instances should be able to be automatically translated.

So I don't see any reason to keep it as it is for any cycles at all. It facilitates unclear code, presents difficulties for catching malicious code, creates irregularities in raw string processing, is useless in its full generality, and can be automatically converted anywhere UTF-8 or UTF-16 are allowed (which is approximately everywhere--and a compiler flag could be used to rescue people who are irreversibly attached to Windows-1252).

SethTisue · 2018-05-11T21:16:57Z

What's the process here @SethTisue? SIP? It needs a spec change

When something is a lot of work to implement, then involving the SIP committee first can save work, if something is just going to end up being rejected. But waiting for the committee also costs time. Anyway, this is really minor stuff around the edges of the language.

This one probably isn't much work to implement, so I'd suggest that someone simply submit a pull request (and yes, best if it includes a spec change). If the pull request becomes contentious, a SIP might eventually be asked for. Often that doesn't happen and the PR can just be merged.

Even if a SIP does end up being asked for, it's still good to have the pull request in hand, because every SIP ultimately needs an implementation, and anyway having a green pull request to refer to makes any discussion so much more grounded and concrete.

SethTisue · 2018-05-11T21:19:00Z

Calling attention to the issue on https://contributors.scala-lang.org can help rally support. Or unexpected opposition, but if that's coming, you might as well find out :-)

SethTisue · 2018-05-11T21:26:08Z

fyi @adamdecaf

adamdecaf · 2018-05-12T20:50:18Z

Thanks for bumping the bug related to this. I couldn't quite find one. As @Ichoran pointed out unicode escapes often cause more problems than they fix and as we've seen can lead to problematic code.

I'll help where I can to get this pushed through. Just let me know!

martijnhoekstra · 2018-05-14T12:43:00Z

Calling attention to the issue on https://contributors.scala-lang.org can help rally support. Or unexpected opposition, but if that's coming, you might as well find out :-)

https://contributors.scala-lang.org/t/escaping-unicode/1894

SethTisue · 2019-02-20T21:58:38Z

@martijnhoekstra I think this can go forward if you're still interested

(if you don't plan to tackle it, also let us know, since perhaps someone else would step up)

ashawley · 2019-02-20T22:24:08Z

I ran in to this and added some tests to document the status quo scala/scala#6501

I didn't know about this ticket. I guess the files should be renamed to t3220, since u000a is cryptic.

martijnhoekstra · 2019-07-26T11:18:35Z

Oh dear, I missed the ping on this one. I'll get back on this

martijnhoekstra · 2019-09-09T12:24:28Z

Status update: Needs to be ported to dotty too, to prevent progression regressing again when dotty becomes scala 3.

som-snytt · 2019-09-09T18:43:40Z

since u000a is cryptic.

It's u for unit test. Let's rename all the tests from t to u. And convert decimal bug numbers to hex, which will make them much easier to read. PR soon.

I'll also look for any old tests named bugNNNN.

ashawley · 2019-09-09T18:48:06Z

Needs to be ported to dotty too, to prevent progression regressing again when dotty becomes scala 3.

There's isn't a broader initiative to update Dotty with Scala 2.13 tests? I know they just updated the collections in Dotty with what's in 2.13.0. Hopefully, the tests will get refreshed as well at some point.

ashawley · 2019-09-09T18:49:08Z

Let's rename all the tests from t to u. And convert decimal bug numbers to hex, which will make them much easier to read.

I agree. This is what the people want.

martijnhoekstra · 2019-09-09T19:07:44Z

@ashawley it's the fix that needs to be ported to dotty, not the test.

Well, also the tests, I suppose.

som-snytt · 2019-09-09T19:25:21Z

smarter recently made a plea for ongoing syntax tweaks in Scala 2 to get fixes in dotty. We might consider an umbrella ticket in dotty to track these differences?

martijnhoekstra · 2019-09-09T20:06:21Z

Tracking them is nice, but someone still needs to implement. A situation where migrating to scala 3 means regressing in recent changes in scala 2 makes for a terrible migration story. Being able to point at an umbrella ticket doesn't change that.

Dotty being the ruggedly handsome swashbuckling research compiler making bold forays into uncharted territory may need a lot of convincing that these kinds of boring syntax changes are worthy of it's time.

eed3si9n · 2021-01-14T19:39:53Z

Did we make

scala> def \u0061 = "foo"
a: String

illegal in a patch release? Until 2.13.2, this was the second paragraph of the Scala Language Spec:

In Scala mode, Unicode escapes are replaced by the corresponding Unicode character with the given hexadecimal code.

SethTisue · 2021-01-14T19:41:12Z

@eed3si9n yes, scala/scala#8282 landed in 2.13.2

scabug closed this as completed May 18, 2011

scabug added enhancement won't fix labels Apr 7, 2017

This was referenced Apr 7, 2017

Unicode escapes in single line comments trigger error #3677

Closed

Unicode literal syntax thwarts common use cases for triple-quotes #4706

Closed

Illegal character and unclosed string literal *when a string is commented out only* #7702

Closed

SethTisue mentioned this issue May 11, 2018

REPL does not consistently process unicode characters in comments. #10873

Closed

SethTisue reopened this May 11, 2018

SethTisue removed the won't fix label May 11, 2018

SethTisue added this to the Backlog milestone May 11, 2018

martijnhoekstra mentioned this issue May 17, 2018

unspecialcase unicode escapes [WIP] scala/scala#6661

Closed

SethTisue assigned martijnhoekstra May 17, 2018

SethTisue modified the milestones: Backlog, 2.13.0-RC1 Nov 8, 2018

SethTisue added the has PR label Nov 8, 2018

SethTisue modified the milestones: 2.13.0-RC1, 2.13.1 Feb 20, 2019

SethTisue mentioned this issue Feb 20, 2019

Add partest for Unicode newline scala/scala#6501

Merged

martijnhoekstra mentioned this issue Aug 2, 2019

Unicode escapes are now ordinary escape sequences (not processed early) scala/scala#8282

Merged

szeiger modified the milestones: 2.13.1, 2.13.2 Sep 9, 2019

SethTisue modified the milestones: 2.13.2, Backlog Feb 6, 2020

lrytz closed this as completed in scala/scala#8282 Mar 3, 2020

lrytz modified the milestones: Backlog, 2.13.2 Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalac should remove comments before translating unicode escapes #3220

Scalac should remove comments before translating unicode escapes #3220

scabug commented Mar 27, 2010

scabug commented Mar 27, 2010

scabug commented Mar 30, 2010

scabug commented Mar 30, 2010

scabug commented Mar 31, 2010

scabug commented Apr 7, 2010

scabug commented Jul 12, 2015

SethTisue commented May 11, 2018

som-snytt commented May 11, 2018

martijnhoekstra commented May 11, 2018 •

edited by SethTisue

Ichoran commented May 11, 2018

SethTisue commented May 11, 2018 •

edited

SethTisue commented May 11, 2018 •

edited

SethTisue commented May 11, 2018

adamdecaf commented May 12, 2018

martijnhoekstra commented May 14, 2018

SethTisue commented Feb 20, 2019

ashawley commented Feb 20, 2019

martijnhoekstra commented Jul 26, 2019

martijnhoekstra commented Sep 9, 2019

som-snytt commented Sep 9, 2019

ashawley commented Sep 9, 2019

ashawley commented Sep 9, 2019

martijnhoekstra commented Sep 9, 2019

som-snytt commented Sep 9, 2019

martijnhoekstra commented Sep 9, 2019

eed3si9n commented Jan 14, 2021

SethTisue commented Jan 14, 2021 •

edited

Scalac should remove comments before translating unicode escapes #3220

Scalac should remove comments before translating unicode escapes #3220

Comments

scabug commented Mar 27, 2010

scabug commented Mar 27, 2010

scabug commented Mar 30, 2010

scabug commented Mar 30, 2010

scabug commented Mar 31, 2010

scabug commented Apr 7, 2010

scabug commented Jul 12, 2015

SethTisue commented May 11, 2018

som-snytt commented May 11, 2018

martijnhoekstra commented May 11, 2018 • edited by SethTisue

Ichoran commented May 11, 2018

SethTisue commented May 11, 2018 • edited

SethTisue commented May 11, 2018 • edited

SethTisue commented May 11, 2018

adamdecaf commented May 12, 2018

martijnhoekstra commented May 14, 2018

SethTisue commented Feb 20, 2019

ashawley commented Feb 20, 2019

martijnhoekstra commented Jul 26, 2019

martijnhoekstra commented Sep 9, 2019

som-snytt commented Sep 9, 2019

ashawley commented Sep 9, 2019

ashawley commented Sep 9, 2019

martijnhoekstra commented Sep 9, 2019

som-snytt commented Sep 9, 2019

martijnhoekstra commented Sep 9, 2019

eed3si9n commented Jan 14, 2021

SethTisue commented Jan 14, 2021 • edited

martijnhoekstra commented May 11, 2018 •

edited by SethTisue

SethTisue commented May 11, 2018 •

edited

SethTisue commented May 11, 2018 •

edited

SethTisue commented Jan 14, 2021 •

edited