Normalization of line-endings in template literals #90

getify · 2015-06-05T01:01:51Z

The spec calls for special processing (normalization) on line-endings in template literals:

http://people.mozilla.org/~jorendorff/es6-draft.html#sec-static-semantics-tv-s-and-trv-s

See specifically the note at the end of that section, which says:

<CR><LF> and <CR> LineTerminatorSequences are normalized to <LF> for both TV and TRV. An explicit EscapeSequence is needed to include a <CR> or <CR><LF> sequence.

So, I have several questions relating to how (if at all?) the AST spec deals with this:

Does the parser do all this normalization before creating the AST, or does the AST need to preserve the actual information in the code so it's handled post-AST (like in interpretation/code-gen/etc)?
If the parser handles the normalization (changing occurrences of U+000D and U+000DU+000A to U+000A) before producing the tree, then should it do that for both the node value and the raw?

My instinct would say that raw should preserve the original U+000D or U+000DU+000A sequences (pre-normalization). However, the spec says that the template literal's raw value is post-normalization, so perhaps the parser/AST should also normalize its raw? Will it be confusing if the AST raw property and the template literal raw property don't match?

But that would mean that you couldn't completely faithfully recreate a JS file that had such line-endings mixed into its template literals. That seems like a bad thing.
The spec says that an actual \r or \r\n escape sequence in the string is not normalized, only the U+000D / U+000DU+000A values themselves. However, the human-readable representation of the AST (which is often JSON stringification) would represent a U+000D value from the code as \r. So how would you tell the difference? Would a \r actually show up as \\r instead?

+@allenwb @RReverser

The text was updated successfully, but these errors were encountered:

DanielRosenwasser · 2015-06-05T01:29:13Z

I notice the tree doesn't have cooked and raw - it has cooked and value. My instinct would be the processed string be stored as cooked, and the original string be stored as value. Then you can get the raw string portions by processing the value on demand.

But like I said, that's instinct. I'm curious to hear whether this is the intention.

gibson042 · 2015-06-05T02:38:38Z

cooked and value (elsewhere raw) properties in the value of a TemplateElement should correspond respectively to cookedValue (TV; Template Value) and rawValue (TRV; Template Raw Value) in the template runtime algorithm, not to the literal code points of the program source (the so-called concrete syntax). Therefore, input like

// 9 characters after "::" (<CR>, <LF>, backslash, "u", "0", "0", "0", "D", <CR>)
// IOW, includes both a <CR><LF> LineTerminator and a naked <CR>
// ...plus an embedded Unicode escape for <CR>
`LineTerminator ::
\u000D
`

should parse to ESTree output like

value: {
    // raw value: 8 characters after "::" (<LF>, backslash, "u", "0", "0", "0", "D", <LF>)
    // (<CR><LF> → <LF>; naked <CR> → <LF>)
    value: "LineTerminator ::\n\\u000D\n",

    // cooked value: 3 characters after "::" (<LF>, <CR>, <LF>)
    // (<CR><LF> → <LF>; escape sequence → represented value; naked <CR> → <LF>)
    cooked: "LineTerminator ::\n\r\n"
}

getify · 2015-06-05T03:15:30Z

@gibson042

...not to the literal code points of the program source

So does that mean there's no way to literally reconstruct the exact source from the AST (if it included \r in a template literal)?

I know things like whitespace (and comments) are typically not kept in the AST, but I also have the parallel hopes that the AST will be extended to support concrete syntax, with the express goal of being able to keep absolutely everything. If the actual code points aren't going to be kept in the AST, this issue seems like another wrinkle that would need to be considered for concrete syntax preservation.

gibson042 · 2015-06-05T04:10:51Z

I have that same hope, but doing it right means representing program source in a way that is agnostic of AST node type—no special treatment for nodes that happen to represent Literal input elements. So for this issue in particular, the AST (i.e., cooked and value properties) should not differentiate the three embedded line-feed-equivalent LineTerminatorSequences (for that matter, it also shouldn't differentiate equivalent escape sequences like \\ and \x5c and \u005C from each other), but should differentiate embedded LineTerminatorSequences from their escape sequences.

In other words, value alone would be sufficient to reconstruct source equivalent (but not necessarily identical) to the actual input, but cooked would not be.

RReverser · 2015-06-05T08:08:17Z

So does that mean there's no way to literally reconstruct the exact source from the AST (if it included \r in a template literal)?

There is - we still have ranges for that. In any case, I believe that according to spec both "raw" and "cooked" representations should be processed at least in sense of \r\n->\n and \r->\n so that any code that works with them, always sees just \n and nothing else (IIUC, this was the idea behind the line normalization).

mikesherov added the CST label Jun 24, 2015

gibson042 mentioned this issue Jul 9, 2015

Concrete Syntax in tree #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization of line-endings in template literals #90

Normalization of line-endings in template literals #90

getify commented Jun 5, 2015

DanielRosenwasser commented Jun 5, 2015

gibson042 commented Jun 5, 2015

getify commented Jun 5, 2015

gibson042 commented Jun 5, 2015

RReverser commented Jun 5, 2015

Normalization of line-endings in template literals #90

Normalization of line-endings in template literals #90

Comments

getify commented Jun 5, 2015

DanielRosenwasser commented Jun 5, 2015

gibson042 commented Jun 5, 2015

getify commented Jun 5, 2015

gibson042 commented Jun 5, 2015

RReverser commented Jun 5, 2015