Provide named character entities for invisible and ambiguous Unicode characters #10297

r12a · 2024-04-25T09:50:32Z

What problem are you trying to solve?

It is much easier for content authors to spot and work with invisible Unicode characters if they are coded using named entities. Some users have to deal with many such characters on a regular basis (Arabic authors work with 12 or more regularly) and it is difficult to remember the Unicode code points. Others only use these characters infrequently, and it is equally difficult to remember the appropriate code point value when needed. In addition, invisible characters in the code can be problematic to work with, especially if they impact the display (such as paired directional embeddings, in RTL scripts), because they are overlooked or duplicated, or miscopied.

What solutions exist today?

Some of these characters have named character entities, but some of the more frequently used ones do not.

How would you solve it?

The W3C i18n WG proposes the following additions. For convenience, the list includes characters for which we already have named entities; these are indicated using ✅. Possible named entities are suggested for the new items; these are derived from standard Unicode abbreviations, where available.

Latin 1 Supplement — Latin-1 punctuation and symbols

✅ U+00A0 NO-BREAK SPACE  
✅ U+00AD SOFT HYPHEN

Combining Diacritical Marks — Grapheme joiner

U+034F COMBINING GRAPHEME JOINER &cgj;

Arabic — Format character

U+061C ARABIC LETTER MARK &alm;

Ogham — Space

U+1680 OGHAM SPACE MARK

Mongolian — Format controls

U+180B MONGOLIAN FREE VARIATION SELECTOR ONE &fvs1;
U+180C MONGOLIAN FREE VARIATION SELECTOR TWO &fvs2;
U+180D MONGOLIAN FREE VARIATION SELECTOR THREE &fvs3;
U+180E MONGOLIAN VOWEL SEPARATOR &mvs;
U+180F MONGOLIAN FREE VARIATION SELECTOR FOUR &fvs4;

General Punctuation — Spaces

U+2000 EN QUAD &nqsp;
U+2001 EM QUAD &mqsp;
✅ U+2002 EN SPACE &ensp;
✅ U+2003 EM SPACE &emsp;
✅ U+2004 THREE-PER-EM SPACE &emsp13;
✅ U+2005 FOUR-PER-EM SPACE &emsp14;
U+2006 SIX-PER-EM SPACE &6msp;
✅ U+2007 FIGURE SPACE &numsp;
✅ U+2008 PUNCTUATION SPACE &puncsp;
✅ U+2009 THIN SPACE   AND  
✅ U+200A HAIR SPACE &hairsp; AND &VeryThinSpace; AND part of   (U+0205F U+200A)

General Punctuation — Format character

✅ U+200B ZERO WIDTH SPACE &ZeroWidthSpace; AND &NegativeMediumSpace; AND &NegativeThickSpace; AND &NegativeThinSpace; AND &NegativeVeryThinSpace;
✅ U+200C ZERO WIDTH NON-JOINER &zwnj;
✅ U+200D ZERO WIDTH JOINER &zwj;
✅ U+200E LEFT-TO-RIGHT MARK &lrm;
✅ U+200F RIGHT-TO-LEFT MARK &rlm;
U+202A LEFT-TO-RIGHT EMBEDDING &lre;
U+202B RIGHT-TO-LEFT EMBEDDING. &rle;
U+202C POP DIRECTIONAL FORMATTING &pdf;
U+202D LEFT-TO-RIGHT OVERRIDE &lro;
U+202E RIGHT-TO-LEFT OVERRIDE &rlo;
✅ U+2060 WORD JOINER &NoBreak;
U+2066 LEFT-TO-RIGHT ISOLATE &lri;
U+2067 RIGHT-TO-LEFT ISOLATE &rli;
U+2068 FIRST STRONG ISOLATE &fsi;
U+2069 POP DIRECTIONAL ISOLATE &pdi;

We would also like to coin a new &zwsp; entity name, in addition to the too long and complicated &ZeroWidthSpace;for U+200B.

General Punctuation — Separators

U+2028 LINE SEPARATOR &lsep;
U+2029 PARAGRAPH SEPARATOR &psep;

General Punctuation — Space

U+202F NARROW NO-BREAK SPACE &nnbsp;
✅ U+205F MEDIUM MATHEMATICAL SPACE   AND part of   (U+205F U+200A)

General Punctuation — Invisible operators

✅ U+2061 FUNCTION APPLICATION ⁡ AND ⁡
✅ U+2062 INVISIBLE TIMES ⁢ AND ⁢
✅ U+2063 INVISIBLE SEPARATOR ⁣ AND ⁣
U+2064 INVISIBLE PLUS

CJK Symbols And Punctuation — CJK symbols and punctuation

U+3000 IDEOGRAPHIC SPACE &idsp;

Emoji Variation Selectors - turns on and off colour

U+FE0E: VARIATION SELECTOR-15 &vs15;
U+FE0F: VARIATION SELECTOR-16 &vs16;

Potential additional candidates

Hangul Jamo — Old initial consonants

U+115F HANGUL CHOSEONG FILLER &hcf;

Hangul Jamo — Medial vowels

U+1160 HANGUL JUNGSEONG FILLER &hjf;

Hangul Compatibility Jamo — Special character

U+3164 HANGUL FILLER &hf;

Halfwidth And Fullwidth Forms — Halfwidth Hangul variants

U+FFA0 HALFWIDTH HANGUL FILLER &hwhf;

General Punctuation — Invisible operators

U+206D ACTIVATE ARABIC FORM SHAPING &aafs;

Shorthand Format Controls — Shorthand format controls

U+1BCA0 SHORTHAND FORMAT LETTER OVERLAP
U+1BCA1 SHORTHAND FORMAT CONTINUING OVERLAP
U+1BCA2 SHORTHAND FORMAT DOWN STEP
U+1BCA3 SHORTHAND FORMAT UP STEP

Musical Symbols — Beams and slurs

U+1D173 MUSICAL SYMBOL BEGIN BEAM
U+1D174 MUSICAL SYMBOL END BEAM
U+1D175 MUSICAL SYMBOL BEGIN TIE
U+1D176 MUSICAL SYMBOL END TIE
U+1D177 MUSICAL SYMBOL BEGIN SLUR
U+1D178 MUSICAL SYMBOL END SLUR
U+1D179 MUSICAL SYMBOL BEGIN PHRASE
U+1D17A MUSICAL SYMBOL END PHRASE

Anything else?

There are other invisible characters which probably do not need entities. The list above selects those most likely to be useful. In particular, only 2 of the many, many variation selectors are listed here – these are the two that are regularly used for emojis.

There may also be a need to support Egyptian hieroglyph formatting controls, some of which will come out with Unicode 16 later this year.

The text was updated successfully, but these errors were encountered:

annevk · 2024-04-25T10:05:08Z

Can this be folded into #5121 or vice versa? I'm not sure why we need two issues for this.

Psychpsyo · 2024-04-29T10:07:53Z

I get that &6msp; for the SIX-PER-EM SPACE might be derived from a standard Unicode abbreviation (couldn't actually find the relevant standard at the moment) but given that THREE-PER-EM SPACE and FOUR-PER-EM SPACE are already &emsp13; and &emsp14;, shouldn't this one be &emsp16; for consistency?
The way it is right now seems confusing.

Similarly, it might make sense to change &nqsp; and &mqsp; to &enqsp; and &emqsp; for consistency with the other em/en related ones as well.

ntounsi · 2024-05-02T13:15:01Z

Thank @r12a for bringing this up.

It is very welcome to use named entities instead of the digits of the codepoint and their markup syntax x#&HHHH;.

About formatting characters, one can perhaps remember the very common [202A/202B, 202D] to delimit bidi sentences (although you still have to remember that A is for left and B is for right...) But now there are the others: lro/rlo coded 202D/202E, and rli/lri fsi pdfi coded coded 2066/2067 2068 2069, not the same range...

Especially since some HTML editors also replace the digital entities x#&HHHH; by the corresponding Unicode characters U+HHHH which are invisible in the source. Whereas if it's a named entity they put a visible mark instead.

(BTW, I've always wondered why &lrm;/&rlm; and not &lre; etc.?)

r12a added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest i18n-needs-resolution Issue the Internationalization Group has raised and looks for a response on. labels Apr 25, 2024

r12a mentioned this issue Apr 25, 2024

create an issue against html requesting the list of named entities based on work in action 73 w3c/i18n-actions#77

Closed

w3cbot mentioned this issue Apr 25, 2024

Provide named character entities for invisible and ambiguous Unicode characters w3c/i18n-activity#1847

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide named character entities for invisible and ambiguous Unicode characters #10297

Provide named character entities for invisible and ambiguous Unicode characters #10297

r12a commented Apr 25, 2024 •

edited

annevk commented Apr 25, 2024

Psychpsyo commented Apr 29, 2024

ntounsi commented May 2, 2024 •

edited

Provide named character entities for invisible and ambiguous Unicode characters #10297

Provide named character entities for invisible and ambiguous Unicode characters #10297

Comments

r12a commented Apr 25, 2024 • edited

What problem are you trying to solve?

What solutions exist today?

How would you solve it?

Latin 1 Supplement — Latin-1 punctuation and symbols

Combining Diacritical Marks — Grapheme joiner

Arabic — Format character

Ogham — Space

Mongolian — Format controls

General Punctuation — Spaces

General Punctuation — Format character

General Punctuation — Separators

General Punctuation — Space

General Punctuation — Invisible operators

CJK Symbols And Punctuation — CJK symbols and punctuation

Emoji Variation Selectors - turns on and off colour

Potential additional candidates

Hangul Jamo — Old initial consonants

Hangul Jamo — Medial vowels

Hangul Compatibility Jamo — Special character

Halfwidth And Fullwidth Forms — Halfwidth Hangul variants

General Punctuation — Invisible operators

Shorthand Format Controls — Shorthand format controls

Musical Symbols — Beams and slurs

Anything else?

annevk commented Apr 25, 2024

Psychpsyo commented Apr 29, 2024

ntounsi commented May 2, 2024 • edited

r12a commented Apr 25, 2024 •

edited

ntounsi commented May 2, 2024 •

edited