Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Preserve Unicode Escape Sequences and make current behavior opt-in to avoid malformed output in some environments #485

Closed
wessberg opened this issue Oct 22, 2020 · 4 comments

Comments

@wessberg
Copy link

wessberg commented Oct 22, 2020

Please see this repro.
In it, you'll see an input file such as this:

export const unicodeWhitespaceCharacters = '\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF';

Running esbuild with no configuration on it results in the file output.js which produces a different declaration of
unicodeWhitespaceCharacters as can be seen in the repro.

Intuitively, I would expect a 1:1 mapping between the input and output in this particular case.

I've been trying to make sense of it, and I understand that this is related to an optimization you've written that rewrites unicode escape sequences. In this issue: #70, you write the following:

I'm deliberately using UTF-8 output to be as compact as possible since Unicode escape sequences are more verbose.

However, this may lead to Unterminated string constant exceptions in browser such as IE 11 - even with <meta charset="utf-8"> in the HTML document. On that note, it is not always possible for the website author to instruct IE to use UTF-8. There's a bunch of ways IE can choose to ignore that.

As I can see there have been a few similar issues like this one (#315, #112, #70), I would like to propose that unicode escape sequences are preserved as they are presented in the source code, and then the current behavior could be opt-in, for example as an advanced minification option.

I ran into this problem specifically while bundling core-js (the declaration from the example is located within core-js/internals/whitespaces.js with esbuild and testing it across browsers.

I don't currently have any way of working around this other than performing a string replacement on the bundle generated by esbuild to bring back the unicode escape sequences and updating the sourcemap mappings, which is slow and tedious, so I really hope for a better way around this :-)

@wessberg wessberg changed the title Files containing unicode characters may lead to malformed output Files containing unicode escape sequences may lead to malformed output Oct 22, 2020
@wessberg wessberg changed the title Files containing unicode escape sequences may lead to malformed output Suggestion: Preserve Unicode Escape Sequences and make current behavior opt-in to avoid malformed output in some environments Oct 22, 2020
@evanw
Copy link
Owner

evanw commented Oct 24, 2020

This is good to know. I think this increases the priority of having an ASCII-only output mode (#70).

I would like to propose that unicode escape sequences are preserved as they are presented in the source code

I don't think this sounds like the right solution. Either your environment supports UTF-8 or it doesn't. If it doesn't, you don't want unescaped UTF-8 in your output even if it was originally written that way in the source code. And if it does, I don't see why esbuild shouldn't take advantage of it. So I think esbuild should either always use UTF-8 in string literals, or always generate escape sequences for all non-ASCII characters in string literals.

the current behavior could be opt-in, for example as an advanced minification option

I don't think the default behavior should be to assume the target environment doesn't support UTF-8. Partially because IE is well on its way out, and because other commonly-used minifiers such as Terser and UglifyJS also have the same default behavior as esbuild here (assuming that UTF-8 is supported). Edit: and after a quick check it looks like Webpack and Parcel also do this by default, so most of the ecosystem behaves this way.

So I think the way to solve this is an opt-in ASCII-only output mode (issue #70). I would have done this a long time ago except this unfortunately isn't as simple as it sounds because it involves parsing regular expressions to strip UTF-8 from them too, and esbuild doesn't currently parse regular expressions at all. However, it's pretty trivial to move forward with an ASCII-only output option with the caveat that regular expressions containing non-ASCII characters will continue to be unescaped in the output for now. So right now I'm thinking of doing that.

@clydin
Copy link

clydin commented Oct 24, 2020

Something else to consider as well is that many browsers have an optimized scanning path for ASCII code (even when marked as UTF-8). Converting to UTF-8 may provide a decrease in file size but could end up causing increased execution time and cancel out any improvements in transfer time.

Chromium code reference: https://source.chromium.org/chromium/v8/v8.git/+/master:src/parsing/scanner-character-streams.cc;l=669?originalUrl=https:%2F%2Fcs.chromium.org%2F

@evanw
Copy link
Owner

evanw commented Oct 24, 2020

many browsers have an optimized scanning path for ASCII code

This is interesting. My first instinct was to discount this because surely download times massively dwarf parse times in the vast majority of cases, especially because scanning UTF-8 is very cache-friendly. But the chunk size is 512 which is kind of big. I also found https://v8.dev/blog/scanner which seems to indicate that parsing an ASCII string is 1.7x faster than parsing a UTF-8 string. There is also this quote at the end of the article:

As a developer you can further improve parsing performance by increasing the information density of your programs. The easiest way to do so is by minifying your source code, stripping out unnecessary whitespace, and to avoid non-ASCII identifiers where possible.

It turns out that esbuild also does this kind of optimization itself when mapping from byte offsets to column numbers for source maps. It was a noticeable speedup to avoid doing this for ASCII-only lines (chunking happens at the line level):

// Start the mapping if this character is non-ASCII
if c > 0x7F && columnsForNonASCII == nil {
columnByteOffset = i - lineByteOffset
byteOffsetToFirstNonASCII = int32(columnByteOffset)
columnsForNonASCII = []int32{}
}
// Update the per-byte column offsets
if columnsForNonASCII != nil {
for lineBytesSoFar := i - lineByteOffset; columnByteOffset <= lineBytesSoFar; columnByteOffset++ {
columnsForNonASCII = append(columnsForNonASCII, column)
}
}

I still think this isn't that strong a reason though, since these cases are relatively rare. A given chunk of JavaScript code will be ASCII the vast majority of the time because minifiers use ASCII identifiers for variable names.

To me a stronger reason is the convenience of not having to deal with UTF-8 encoding issues when using esbuild. Some of my goals for esbuild are a) to follow ecosystem conventions and b) to be easy to use without too much configuration. Those two goals are in conflict in this situation and I'm still trying to figure out what to do about it. It's why I haven't released this new flag yet.

@wessberg
Copy link
Author

wessberg commented Oct 24, 2020

@evanw

After a quick check it looks like Webpack and Parcel also do this by default, so most of the ecosystem behaves this way.

I did the same experiment as you and actually found out some really interesting things that are relevant to this conversation.

Here's the output of that unicodeWhitespaceCharacters constant from the original example across bundlers:

webpack (without minification)

"\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF";

webpack (with minification)

"\t\n\v\f\r                 \u2028\u2029\ufeff";

parcel (without minification)

"\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF";

parcel (with minification)

"\t\n\v\f\r                 \u2028\u2029\ufeff";

rollup

"\u0009\u000A\u000B\u000C\u000D\u0020\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000\u2028\u2029\uFEFF";

esbuild:
Now this one actually produces something quite different from the others:

"	\n\v\f\r                 \u2028\u2029"

I can't actually copy the whole thing over, because it looks like this in vscode:
Screen Shot 2020-10-25 at 12 07 32 AM

Apparently, for some reason \ufeff is lost in the output from esbuild, and it produces what appears to be an invalid character that can't be copied as seen in the image above. I'm guessing that it is that same invalid/incomplete character that is leading to a crash in IE 11. Also, there are other differences between Parcel, Webpack, and esbuild. The former two produces a \t that isn't present in the esbuild output.

So, I guess there's something there worth looking into.

But then another observation is that these bundlers actually preserve the unicode escape sequence unless they are instructed to apply minification. The explanation for that is quite simple: Both of them use terser for minification, and as you also mentioned yourself, terser is the one applying this transformation of the escape sequence as an optimization.

For me, I'm actually OK with this optimization being applied at all times, but the thing that prompted me to create this issue in the first place was seeing this specific piece of code leading to crashes in legacy browsers, and now it would seem the problem is more related to the production of an invalid/incomplete unicode character.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants