Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepended_Concatenation_Marks should not be zero-width #119

Open
Jules-Bertholet opened this issue Feb 12, 2024 · 2 comments
Open

Prepended_Concatenation_Marks should not be zero-width #119

Jules-Bertholet opened this issue Feb 12, 2024 · 2 comments

Comments

@Jules-Bertholet
Copy link

UAX 44, Prepended_Concatenation_Mark:

A small class of visible format controls, which precede and then span a sequence of other characters, usually digits. These have also been known as "subtending marks", because most of them take a form which visually extends underneath the sequence of following digits.

As they have visible display before the characters they modify, these should not be considered zero-width, however this library incorrectly treats them as such.

@Jules-Bertholet
Copy link
Author

Jules-Bertholet commented Feb 12, 2024

Unicode §5.21 - "Characters Ignored for Display" - "Default Ignorable Code Point" says:

A small number of format characters (General_Category = Cf) are also not given the Default_Ignorable_Code_Point property. This may surprise implementers, who often assume that all format characters are generally ignored in fallback display. The exact list of these exceptional format characters can be found in the Unicode Character Database. There are, however, three important sets of such format characters to note:

  • prepended concatenation marks
  • interlinear annotation characters
  • Egyptian hieroglyph format controls

The prepended concatenation marks always have a visible display. See “Prepended Concatenation Marks” in Section 23.2, Layout Controls for more discussion of the use and display of these signs.

The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters, U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR, and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. These characters should have a visible glyph display for fallback rendering, because if they are not displayed, it is too easy to misread the resulting displayed text. See “Annotation Characters” in Section 23.8, Specials, as well as Section 11.4, Egyptian Hieroglyphs for more discussion of the use and display of these characters.

Software that interprets the interlinear annotation anchors should probably do that processing before passing to wcswidth, so assuming fallback rendering (and non-zero width) likely makes sense for them. Additionally, next to no rendering implementations currently support the Egyptian hieroglyph format controls (though this could change in the future), so assuming a fallback rendering may sense there as well.

@jquast
Copy link
Owner

jquast commented Feb 15, 2024

Thanks for all of the detailed information, I'll look into it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants