Unicode code-point escape identifiers #92

getify · 2015-06-15T20:35:56Z

var \u{20BB7} = 42;

seems in most ways equivalent to:

var 𠮷 = 42;

IIUC, the tree (at least as I see it with acorn) will take the former of these two and represent it as if it'd originally been the latter, even in the raw representation. Is that correct?

Unfortunately, it is possible to have an engine that supports the latter and not the former (I have it installed right now: Chrome 43). And therein lies my problem. I am trying to parse an ES6 file to see if it uses a unicode code-point escape form (the former) for the identifier, because that requires a different test than the symbol form itself (the latter).

Am I understanding this correctly? Is there no way via the estree format to tell the difference or to determine if the former was used? Even a flag on the Identifier node to indicate it was originally in the escaped form would be helpful. Is that possible?

On a similar note, if a tool wanted to parse a program and then recreate exactly as-written without changing this identifier, how could you go back to the former from the latter represented in the tree?

The text was updated successfully, but these errors were encountered:

nzakas · 2015-06-15T20:53:10Z

AFAIK, you are correct. Keep in mind that an identifier can have more than one Unicode code point escaped character, so the only possible flag would be to say, "somewhere in this identifier, there was at least one extended escape sequence," which also isn't enough information to get back to the raw representation.

On a similar note, if a tool wanted to parse a program and then recreate exactly as-written without changing this identifier, how could you go back to the former from the latter represented in the tree?

I don't think is a goal of ESTree, rather, you can return a representation of the AST as code but not necessarily the representation from which the AST was generated. Since you could use the actual character or the escape sequence, it would be up to your serializer to evaluate the identifier and determine how it should best be represented in the output.

gibson042 · 2015-06-15T20:55:48Z

Isn't this a specialized subset of #41, to be addressed by a CST plan? After all, var C_DEAD = 0xBEEF and var C_\u0044\u0045\u0041\u0044 = 48879 yield identical ASTs.

getify · 2015-06-15T21:07:21Z

@gibson042

I suppose it is. I was just trying to understand why I can get this out of acorn from '\u{20BB7}':

{
  "start": 0,
  "value": "𠮷",
  "raw": "'\\u{20BB7}'",
  "type": "Literal",
  "end": 11
}

But from \u{20BB7}, I get:

{
  "start": 0,
  "name": "𠮷",
  "type": "Identifier",
  "end": 9
}

Seems like a strange/inconsistent limitation. If CST is my only option here, just adds more weight to why I really want to figure that out.

gibson042 · 2015-06-15T21:14:07Z

I can get this out of acorn from '\u{20BB7}': …
But from \u{20BB7}, I get: …

I would characterize "raw" as a Literal-only sneak preview of the benefits from going beyond abstract syntax.

If CST is my only option here, just adds more weight to why I really want to figure that out.

Indeed.

mikesherov added the CST label Jun 24, 2015

gibson042 mentioned this issue Jul 9, 2015

Concrete Syntax in tree #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode code-point escape identifiers #92

Unicode code-point escape identifiers #92

getify commented Jun 15, 2015

nzakas commented Jun 15, 2015

gibson042 commented Jun 15, 2015

getify commented Jun 15, 2015

gibson042 commented Jun 15, 2015

Unicode code-point escape identifiers #92

Unicode code-point escape identifiers #92

Comments

getify commented Jun 15, 2015

nzakas commented Jun 15, 2015

gibson042 commented Jun 15, 2015

getify commented Jun 15, 2015

gibson042 commented Jun 15, 2015