Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization of line-endings in template literals #90

Open
getify opened this issue Jun 5, 2015 · 5 comments
Open

Normalization of line-endings in template literals #90

getify opened this issue Jun 5, 2015 · 5 comments
Labels

Comments

@getify
Copy link
Contributor

getify commented Jun 5, 2015

The spec calls for special processing (normalization) on line-endings in template literals:

http://people.mozilla.org/~jorendorff/es6-draft.html#sec-static-semantics-tv-s-and-trv-s

See specifically the note at the end of that section, which says:

<CR><LF> and <CR> LineTerminatorSequences are normalized to <LF> for both TV and TRV. An explicit EscapeSequence is needed to include a <CR> or <CR><LF> sequence.

So, I have several questions relating to how (if at all?) the AST spec deals with this:

  1. Does the parser do all this normalization before creating the AST, or does the AST need to preserve the actual information in the code so it's handled post-AST (like in interpretation/code-gen/etc)?

  2. If the parser handles the normalization (changing occurrences of U+000D and U+000DU+000A to U+000A) before producing the tree, then should it do that for both the node value and the raw?

    My instinct would say that raw should preserve the original U+000D or U+000DU+000A sequences (pre-normalization). However, the spec says that the template literal's raw value is post-normalization, so perhaps the parser/AST should also normalize its raw? Will it be confusing if the AST raw property and the template literal raw property don't match?

    But that would mean that you couldn't completely faithfully recreate a JS file that had such line-endings mixed into its template literals. That seems like a bad thing.

  3. The spec says that an actual \r or \r\n escape sequence in the string is not normalized, only the U+000D / U+000DU+000A values themselves. However, the human-readable representation of the AST (which is often JSON stringification) would represent a U+000D value from the code as \r. So how would you tell the difference? Would a \r actually show up as \\r instead?

+@allenwb @RReverser

@DanielRosenwasser
Copy link

I notice the tree doesn't have cooked and raw - it has cooked and value. My instinct would be the processed string be stored as cooked, and the original string be stored as value. Then you can get the raw string portions by processing the value on demand.

But like I said, that's instinct. I'm curious to hear whether this is the intention.

@gibson042
Copy link

cooked and value (elsewhere raw) properties in the value of a TemplateElement should correspond respectively to cookedValue (TV; Template Value) and rawValue (TRV; Template Raw Value) in the template runtime algorithm, not to the literal code points of the program source (the so-called concrete syntax). Therefore, input like

// 9 characters after "::" (<CR>, <LF>, backslash, "u", "0", "0", "0", "D", <CR>)
// IOW, includes both a <CR><LF> LineTerminator and a naked <CR>
// ...plus an embedded Unicode escape for <CR>
`LineTerminator ::
\u000D
`

should parse to ESTree output like

value: {
    // raw value: 8 characters after "::" (<LF>, backslash, "u", "0", "0", "0", "D", <LF>)
    // (<CR><LF> → <LF>; naked <CR> → <LF>)
    value: "LineTerminator ::\n\\u000D\n",

    // cooked value: 3 characters after "::" (<LF>, <CR>, <LF>)
    // (<CR><LF> → <LF>; escape sequence → represented value; naked <CR> → <LF>)
    cooked: "LineTerminator ::\n\r\n"
}

@getify
Copy link
Contributor Author

getify commented Jun 5, 2015

@gibson042

...not to the literal code points of the program source

So does that mean there's no way to literally reconstruct the exact source from the AST (if it included \r in a template literal)?

I know things like whitespace (and comments) are typically not kept in the AST, but I also have the parallel hopes that the AST will be extended to support concrete syntax, with the express goal of being able to keep absolutely everything. If the actual code points aren't going to be kept in the AST, this issue seems like another wrinkle that would need to be considered for concrete syntax preservation.

@gibson042
Copy link

I have that same hope, but doing it right means representing program source in a way that is agnostic of AST node type—no special treatment for nodes that happen to represent Literal input elements. So for this issue in particular, the AST (i.e., cooked and value properties) should not differentiate the three embedded line-feed-equivalent LineTerminatorSequences (for that matter, it also shouldn't differentiate equivalent escape sequences like \\ and \x5c and \u005C from each other), but should differentiate embedded LineTerminatorSequences from their escape sequences.

In other words, value alone would be sufficient to reconstruct source equivalent (but not necessarily identical) to the actual input, but cooked would not be.

@RReverser
Copy link
Member

So does that mean there's no way to literally reconstruct the exact source from the AST (if it included \r in a template literal)?

There is - we still have ranges for that. In any case, I believe that according to spec both "raw" and "cooked" representations should be processed at least in sense of \r\n->\n and \r->\n so that any code that works with them, always sees just \n and nothing else (IIUC, this was the idea behind the line normalization).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants