Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Javascript can represent invalid unicode strings #41

Open
kannanvijayan-zz opened this issue Jun 14, 2018 · 2 comments
Open

Javascript can represent invalid unicode strings #41

kannanvijayan-zz opened this issue Jun 14, 2018 · 2 comments

Comments

@kannanvijayan-zz
Copy link
Collaborator

It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example \udc11 (which is a lone surrogate pair component) can show up in our string literals.

The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.

My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.

When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.

I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.

@mroch
Copy link

mroch commented Jun 18, 2018

WTF-8 seems like a similar scheme

@kannanvijayan-zz
Copy link
Collaborator Author

@mroch yeah, it's exactly that scheme :) I just didn't realize it. We can replace my whole comment with "use WTF-8 for encoding JS strings".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants