wrong width for U+00AD #8

stevengj · 2015-03-11T22:08:12Z

Hi, I was looking at your wcwidth library for comparison, since in the utf8proc library we are also implementing a similar feature (see JuliaStrings/utf8proc#2). The first disagreement that I came across between your implementation and ours was for U+00AD (soft hyphen), where you seem to give 1

>>> from wcwidth import wcwidth
>>> wcwidth(unichr(173))
1

and we give zero (a soft hyphen is used for line breaking, but is ordinarily not printed). In general, we return 0 for most characters in category Cf (formatting control characters). The wcwidth function on MacOS 10.10.2 also returns -1 (not printable) for this code point.

Am I calling your implementation incorrectly? This is for git master of wcwidth.

The text was updated successfully, but these errors were encountered:

stevengj · 2015-03-11T22:11:09Z

In case it is helpful, the draft table of character widths that we are currently planning to use can be found in this CharWidths.txt gist (each line of which is codepoints; width) where non-printing characters are assigned a width of 0. This is generated automatically from the unicode 7 tables combined with font metrics from GNU unifont, as described in JuliaStrings/utf8proc#27

jquast · 2015-03-11T22:22:10Z

Interesting, what terminal are you testing the "character cells consumed when printed" on OSX? I too am using OSX, and on iTerm2 it displays as "a-b", consuming 3 characters, so wcwidth would be correct, here... I would need to see evidence of it not forwarding the cell when printed on at least some terminal emulators, and file bugs for the others. Just to be very clear, the purpose of wcwidth is "printable width on a terminal", and not firefox or anything else (for which such character is hidden).

Also, I don't necessarily trust the OS-provided 'wcwidth', they are typically based on very old (5-10 years old) unicode specifications. I have a program I've tested on osx and linux, both are wildly different, and in each case my version was correct: https://github.com/jquast/wcwidth/blob/master/bin/wcwidth-libc-comparator.py

The combining and wide character tables are programmatically updated by "python setup.py update", which is similar to your https://github.com/JuliaLang/utf8proc/pull/27/files#diff-3832b9cfe2fc10d35ac5c63d9b7b8133R20

There is no unicode specification reference tables for 0-width characters that I know of, so its just hardcoded here https://github.com/jquast/wcwidth/blob/master/wcwidth/wcwidth.py#L161-171

jquast · 2015-03-11T22:38:39Z

Using the 'Cf' category listings on iTerm2, it appears the following all consume 1 character cell, some with symbols, some simply by blanks

And the following consume 0 cells:

180E
200B
200C
200D
FEFF

which may indeed need to be supported by wcwidth once i test a few more terminals

stevengj · 2015-03-12T02:08:53Z

(We don't trust the system-provided wcwidth either, for the same reason as you, which is why we compute the widths independently. However, the OSX 10.10.2 wcwidth agrees with our results when it returns a nonnegative value, so it mostly seems to have errors of omission—it returns -1 for many valid printable characters from recent Unicode standards. Moreover, U+00AD has been part of Unicode since 1993, so I would think that most wcwidth implementations would handle it properly.)

There is an interesting article on the soft hyphen, which apparently has had a controversial history, and is rendered in different ways depending on the font and the rendering system. I'm not sure what the right answer is here, but the Unicode standard seems to somewhat favor the viewpoint that it should be invisible although it leaves it up to the implementation. However, the article mentions that the Unicode FAQ does say In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances and maybe that is what is done in practice.

cc: @jiahao and @StefanKarpinski.

stevengj · 2015-03-12T02:17:43Z

Note that the Arabic characters U+0601 etc. are defined by the unicode standard as exceptions to usual rule that Cf characters are invisible:

Unlike most other format control characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order. — Unicode Standard v.6.2.0, Section 8.2 - Arabic (p.256)

In contrast, e.g. U+200E is a left-to-right mark, and in my understanding is defined as an invisible formatting character that controls the direction of the text. Some terminals may give it a nonzero width (although the MacOS Terminal with the default font gives it zero width on my machine), but that seems like a bug in the terminal (or the font); it seems like it is better to return what the Unicode standard says rather than propagating a particular buggy implementation.

jiahao · 2015-03-12T02:27:07Z

I remember that article about the soft hyphen. Under "Modern Unicode semantics" it references UAX 14 for Unicode 7.0.0, §5.4, which says:

Unlike U+2010 hyphen, which always has a visible rendition, the character U+00AD soft hyphen (shy) is an invisible format character that merely indicates a preferred intraword line break position. If the line is broken at that point, then whatever mechanism is appropriate for intraword line breaks should be invoked, just as if the line break had been triggered by another hyphenation mechanism, such as a dictionary lookup.

The description in the following paragraphs suggests that the rendering of a soft hyphen is accomplished not by printing the soft hyphen itself, but rather by inserting an additional, printable hyphen glyph:

The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010 hyphen, U+058A armenian hyphen, U+180A mongolian nirugu, or U+1806 mongolian todo soft hyphen.

Based on this description it would seem that the character U+00AD by itself is nonprintable and should have a width of 0 or -1.

stevengj · 2015-03-12T02:31:17Z

Interestingly, the Unicode FAQ entry that the SHY article quoted seems to no longer exist — from that passage in the Unicode 7.0.0 standard that @jiahao quoted it seems like the Unicode consortium decided to put its foot down and and declare that the soft hyphen is definitely invisible, ISO 8859-1 be damned.

jquast · 2015-04-21T21:03:32Z

I really appreciate all of the resarch, @stevengj and @jiahao.

My decision is to use the common denominator across the most popular terminal emulators
for wcwidth. I might make a note of it in the readme that it deviates from the standard, as the
primary purpose of this project is how text is displayed by the most common (utf-8 capable)
terminal emulators.

I've made a checklist:

create a bin/cf-print-test.py or amend bin/wcwidth-browser.py to display 'Cf' categoryv

Then, test the following and report:

I'm not sure how to gauge the "popular terminal emulators", this is just from memory.

sidenote: More importantly, how to factor their weight in wcwidth for any given
differences: perhaps some way to configure how the printable width
of such discrepancies may be reported if the consumer of wcwidth
knows their target audience's emulator (unfortunately all such terminals
borrow the common value "xterm" or "xterm-256color" as the OS
Environment Variable for TERM, and using the response of
the "answerback sequence" (^E) which at least PuTTY replies
to, but I'm afraid thats far out of scope for wcwidth, it would
require interaction with a terminal driver.

Finally, we can make a PR and release any update.

jiahao · 2015-04-21T21:37:12Z

@jquast thanks for your detailed consideration. As you had stated above, iTerm seems to have different needs from us at this point.

jiahao · 2015-04-21T21:45:03Z

However, I don't think it is possible to provide consistency across terminal environments without considering also the interactions with the choice of users' fonts. Many fonts simply have wrong advance widths for some code points.

Here is a simple rendering text for the fixed width fonts on my system. Consider

U+003C9 U+00302= \omega\hat =  ω̂

should render with the hat combining character on the omega.

U+00302 U+003C9 = \hat\omega =  ̂ω

should render with a hat to the left of omega.

jquast · 2015-04-21T21:48:04Z

You are correct, but terminal emulators don't typically care, they're the ones who handle the width of "printable cells" -- What is your system, is it a terminal emulator?

jiahao · 2015-04-21T21:50:45Z

The screenshots I pasted were taken from an IPython notebook rendering test HTML using those fonts. I can see the same spacing issues if I manually change the font in OSX Terminal and generate these characters in the Julia console REPL.

jquast · 2015-09-14T06:33:23Z

Version wcwidth 0.1.5 which includes better combining character width determination by PR #11 is available on pypi.

A terminal sequence may be emitted to illicit the terminal emulator to respond with its cursor position.

This can be used to manually display all questionable characters across different popular Font face profiles and terminal emulators, and programatically determine whether they consider it 0 width for such characters, making a report of the most common discrepenancies, weighing on the side of "most correct", resolving any.

Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json

jquast · 2023-10-30T19:32:14Z

This is closed by #91

About U+00AD in particular, it is part of the Cf category, and the entire category of 'Cf' is now classified as zero-width, along with 'Mc', 'Zl', 'Zp', and part of 'Sk' category. I have written this specification that describes precisely how the width of characters are determined https://github.com/jquast/wcwidth/blob/master/docs/specs.rst#width-of-0 I hope it is helpful.
This issue also talked about the need to best match the behavior of popular terminals. I have also published an automatic testing tool for wide, zero, combining, and emoji zwj sequences. Though this only works with python's wcwidth, the technique would be very easy to copy to or aide other languages or wcwidth implementations, https://pypi.org/project/ucs-detect/
And finally, "BIDI" text was mentioned, I suggest to see related resource https://gist.github.com/XVilka/a0e49e1c65370ba11c17 about the state of BIDI, it has had some traction in the last few years, in any case the 'ucs-tool' appears to verify left-to-right text with wcwidth is ok. The LTR marker is 0-width.

avih · 2024-03-13T15:57:28Z

For reference, in glibc wcwidth(0xad) appears to be 1.

Judging by this discussion: https://sourceware.org/bugzilla/show_bug.cgi?id=22073 which concluded that it should be 1.

That discussion took place in 2017 - after the main discussion in this issue, but before the last #8 (comment) here.

Also, in musl-libc, 0xad is also of wcwidth 1.

jquast · 2024-03-13T18:03:07Z

It's a bit ambiguous isn't it? From https://codepoints.net/U+00AD,

is a code point reserved in some coded character sets for the purpose of breaking words across lines by inserting visible hyphens if they are fall on the line end but remain invisible within the line.

I will add a test to ucs-detect and whichever measured width (0 or 1) that is used among the most popular and compliant terminals will be used in this library.

avih · 2024-03-13T18:53:12Z

For what it's worth, the musl-libc maintainer, @richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".

Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).

Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.

And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.

jquast · 2024-03-13T18:58:14Z

Thanks for relaying @richfelke <https://github.com/richfelker>‘s thoughts, I’m in full agreement with all of them, especially for -1 as this kind of character is meant to be managed by the terminal emulator, and it’s width is indeterminate (like \n, \t, etc). But if the most popular terminal emulators measure it as width of 1 then I’d like to match

…

-- Jeff Quast ***@***.***

On Wed, Mar 13, 2024, at 2:53 PM, avih wrote: For what it's worth, the musl-libc maintainer, @richfelker <https://github.com/richfelker> said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption". Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?). Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc. And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice. — Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHNOKBLY2HT5NLTDSE46HDYYCOC7AVCNFSM4A5VFAVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGUZTQMBZHA4A>. You are receiving this because you modified the open/close state.Message ID: ***@***.***>

avih · 2024-03-13T19:19:57Z

if the most popular terminal emulators measure it as width of 1 then I’d like to match

Right.

I would guess that terminals measure its width according to the wcwidth implementation which they use? And I would also guess that typically that would be whatever libc provides? (not including windows terminal, which brings its own implementation, because on windows there's no system wcwidth).

And so ultimately, I would think the goal should be agreement between wcwidth implementations, rather than between this implementation and the behavior of popular terminal emulators?

stevengj · 2024-03-13T19:37:10Z

Ultimately, the utf8proc library decided to also report a width of 1 for U+00AD as well, in order to agree with other wcwidth implementations, and with typical terminal programs which display a soft hyphen as a visible - glyph.

avih · 2024-03-13T20:07:26Z

I would guess that terminals measure its width according to the wcwidth implementation which they use? And I would also guess that typically that would be whatever libc provides?

Well, that was not a good argument, and I would agree that if this was the only or main wcwidth implementation, then it should try to match the common terminal emulators behavior.

But because this is one of several wcwidth implementations, its goal should be to agree with other wcwidth implementations rather the terminals.

That being said, it would still be nice to know how terminals handle it.

At which case, the test should be dual:

In the middle of a line - where semantically it should be 0.
Towards the end of a line, where Unicode suggests that if a word doesn't fit, then it should have width 1 with visible hyphen[-like] glyph, followed by newline.

I would guess that most terminals don't handle it dually like the Unicode semantics suggests (and would imply a -1 wcwidth value), hence they probably treat it as always 1 or always 0, though that's a guess.

avih · 2024-03-16T08:53:54Z

At which case, the test should be dual...

So, I tested it in the following terminals on Alpine linux 3.19.1, and all the tested terminal emulators treat it either as hard 0 or hard 1. I.e. no terminal handles it dually as 0 at the middle of the line and hyphen+wordbreak in a word which spills over the end of the line.

Specifically, I tested using this script, and observed the result on-screen (not automated). the SHY byte is always at this word xxx<SHY>yyy:

EDITED: THIS SCRIPT IS BROKEN AND THE RESULTS ARE INVALID. See fixed script at the next post.

test-shy.sh (broken)

#!/bin/sh

dots() {
    R=
    while [ ${#R} -lt $1 ]; do R=$R.; done
    echo "$R"
}

has() { command -v "$1" >/dev/null; }

nth() { shift $1; printf %s\\n "$1"; }

cols() {
      if [ "${COLUMNS-}" ]; then echo $COLUMNS
    elif has stty;       then nth 2 $(stty size)
    elif has ttysize;    then nth 1 $(ttysize)
    else echo 80; fi
}

cols=$(cols)
printf "$(dots $cols)\n\n"
printf "SHY mid line: aaa xxx\255yyy bbb\n\n"
printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n"
printf "SHY before last column: $(dots $((cols - 34))) aaa xxx\255yyy bbb\n\n"
printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx\255yyy bbb\n\n"

All the terminals were invoked with UTF-8 locale, e.g.:

LC_ALL=en_US.UTF-8 xterm

Results:

xterm 388, VTE (tested {gnome,xfce4,lx}-terminal), konsole 23.08.4, and st 0.9: always display it as U+FFFD REPLACEMENT CHARACTER, as if wcwidth(0xad) == 1:

urxvt: similat to xterm etc. above, but always displays it as a hyphen, as if wcwidth(0xad) == 1.

alacritty 0.12.3 and kitty 0.31.0: seem to ignore it at the input, as if wcwidth(0xad) == 0:

So while 1 is common, I don't think it's black and white.

So I would think the goal should be to match other wcwidth implementations, where the value appears to be 1 at least in glibc, musl, and utf8proc.

avih · 2024-03-18T06:36:40Z

Actually, the test script above is wrong. It printed the byte 0xad (which is invalid UTF-8 sequence) rather than the UTF-8 sequence for U+00AD - which is 0xc2 0xad.

This is the revised script:

fixed test-shy.sh

#!/bin/sh

sf="\302\255"  # printf fmt of UTF-8 of U+00AD SOFT-HYPHEN

dots() {
    R=
    while [ ${#R} -lt $1 ]; do R=$R.; done
    echo "$R"
}

has() { command -v "$1" >/dev/null; }

nth() { shift $1; printf %s\\n "$1"; }

cols() {
      if [ "$COLUMNS" ]; then echo $COLUMNS
    elif has stty;       then nth 2 $(stty size)
    elif has ttysize;    then nth 1 $(ttysize)
    else echo 80; fi
}

cols=$(cols)
printf "$(dots $cols)\n\n"
printf "SHY mid line: aaa xxx${sf}yyy bbb\n\n"
printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n"
printf "SHY before last column: $(dots $((cols - 34))) aaa xxx${sf}yyy bbb\n\n"
printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx${sf}yyy bbb\n\n"

And these are the results at the various terminals (kitty doesn't have "kitty" at the title, and xfce4-terminal and gnome-terminal have the same result as lxterminal - as all are VTE-based):

Like before, this is on Alpine linux 3.19.1 with the terminals installed from the distro packages repository, and all terminals were invoked after exporting LC_ALL=en_US.UTF-8.

Results:

xterm, alacritty, st, and rxvt-unicode always display it as hard-hyphen, as if wcwidth(0xad) == 1.
VTE terminals (xfce4-terminal, gnome-terinal, lxterminal), and konsole always display it as hard space, as if wcwidth(0xad) == 1.
Kitty seems to ignore it at the input, as if wcwidth(0xad) == 0.

jquast added the needs-research label Mar 11, 2015

Screwtapello mentioned this issue Feb 6, 2019

Proposal: Select by character or display indices mawww/kakoune#2724

Closed

jquast added the bug label Jun 1, 2020

jquast mentioned this issue Oct 19, 2023

Bugfixes for zero-width characters #91

Merged

jquast closed this as completed Oct 30, 2023

jquast reopened this Mar 13, 2024

stevengj closed this as completed Mar 13, 2024

stevengj reopened this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong width for U+00AD #8

wrong width for U+00AD #8

stevengj commented Mar 11, 2015

stevengj commented Mar 11, 2015

jquast commented Mar 11, 2015

jquast commented Mar 11, 2015

stevengj commented Mar 12, 2015

stevengj commented Mar 12, 2015

jiahao commented Mar 12, 2015

stevengj commented Mar 12, 2015

jquast commented Apr 21, 2015 •

edited

jiahao commented Apr 21, 2015

jiahao commented Apr 21, 2015

jquast commented Apr 21, 2015

jiahao commented Apr 21, 2015

jquast commented Sep 14, 2015

jquast commented Oct 30, 2023

avih commented Mar 13, 2024

jquast commented Mar 13, 2024 •

edited

avih commented Mar 13, 2024

jquast commented Mar 13, 2024 via email

avih commented Mar 13, 2024

stevengj commented Mar 13, 2024

avih commented Mar 13, 2024

avih commented Mar 16, 2024 •

edited

avih commented Mar 18, 2024 •

edited

wrong width for U+00AD #8

wrong width for U+00AD #8

Comments

stevengj commented Mar 11, 2015

stevengj commented Mar 11, 2015

jquast commented Mar 11, 2015

jquast commented Mar 11, 2015

stevengj commented Mar 12, 2015

stevengj commented Mar 12, 2015

jiahao commented Mar 12, 2015

stevengj commented Mar 12, 2015

jquast commented Apr 21, 2015 • edited

jiahao commented Apr 21, 2015

jiahao commented Apr 21, 2015

jquast commented Apr 21, 2015

jiahao commented Apr 21, 2015

jquast commented Sep 14, 2015

jquast commented Oct 30, 2023

avih commented Mar 13, 2024

jquast commented Mar 13, 2024 • edited

avih commented Mar 13, 2024

jquast commented Mar 13, 2024 via email

avih commented Mar 13, 2024

stevengj commented Mar 13, 2024

avih commented Mar 13, 2024

avih commented Mar 16, 2024 • edited

avih commented Mar 18, 2024 • edited

jquast commented Apr 21, 2015 •

edited

jquast commented Mar 13, 2024 •

edited

avih commented Mar 16, 2024 •

edited

avih commented Mar 18, 2024 •

edited