Behavior of `` and lone surrogates unicode entities #146

nicolo-ribaudo · 2022-03-07T17:06:08Z

This might be another one to spec in JSX btw, because there’s likely divergence between implementations. There are a bunch of different things not allowed by XML/HTML/markdown (such as \0 or lone surrogates)

It looks like currently Babel and TS behave the same (they translate  to \0 and &#xD800; to \uD800). I didn't test other parsers.

The text was updated successfully, but these errors were encountered:

Huxpro · 2022-03-08T03:47:01Z

Hmm. Could any of you help me understand this issue better?

My understanding is that the current JSX spec allowed  and both Babel and TS are conforming here. Was the concern that  is actually NOT allowed by XML/HTML/markdown spec and implementations such as MDX are behaving differently at this moment?

wooorm · 2022-03-08T08:36:40Z

For security reasons, several (numeric) character references don’t turn into their corresponding character code according to HTML. They are replaced with U+FFFD (�) or even a different character. At a high-level it’s described in: https://html.spec.whatwg.org/multipage/syntax.html#character-references.

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

More concrete, see: https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state.

Note that there are even some C1 Unicode whitespace characters (that thus would be disallowed), that would have a meaning in the Windows 1252 encoding, which in HTML map to those characters. E.g., U+0080 is a “padding character”, but a € in Windows 1252. So HTML turns 0x80 into €.
I don’t particularly recommend this part.
But I definitely see value in prohibiting \0, whitespace, lone surrogates, noncharacters, just like 0x10FFFF and higher is prohibited.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Behavior of `` and lone surrogates unicode entities #146

Behavior of `` and lone surrogates unicode entities #146

nicolo-ribaudo commented Mar 7, 2022

Huxpro commented Mar 8, 2022 •

edited

wooorm commented Mar 8, 2022

Behavior of &#0; and lone surrogates unicode entities #146

Behavior of &#0; and lone surrogates unicode entities #146

Comments

nicolo-ribaudo commented Mar 7, 2022

Huxpro commented Mar 8, 2022 • edited

wooorm commented Mar 8, 2022

Behavior of `` and lone surrogates unicode entities #146

Behavior of `` and lone surrogates unicode entities #146

Huxpro commented Mar 8, 2022 •

edited