Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should wcwidth have "Treat ambiguos-width as wide" option? #123

Open
keatonLiu opened this issue Mar 30, 2024 · 7 comments
Open

Should wcwidth have "Treat ambiguos-width as wide" option? #123

keatonLiu opened this issue Mar 30, 2024 · 7 comments

Comments

@keatonLiu
Copy link

keatonLiu commented Mar 30, 2024

import wcwidth

if __name__ == '__main__':
    print(wcwidth.wcswidth("①你好"))
    print(wcwidth.wcswidth("你好啊"))

results in:
image
But it displays 2 character width in monospace font:
image
image

@jquast
Copy link
Owner

jquast commented Mar 30, 2024

Which terminal emulator are you using in this example?

For iTerm2, this is correct,

image

As well as WezTerm,

image

And also Kitty,

image

@keatonLiu
Copy link
Author

Maybe wcwidth only focus on terminal font? I'm using PrettyTable to generate table, which depends on wcwidth, and I want to display the table text on browser. For example, I'm using chrome, and I found monospace fonts works fine most of the time. But for some unicode words, it displays with a different length.

@keatonLiu
Copy link
Author

It will be helpful if I can provide the font family and get a more general result. Is it possible?

@jquast
Copy link
Owner

jquast commented Mar 30, 2024

wcwidth is primarily focused for terminals, that is if browsers and terminals disagree we would rather match with terminals. Although I expect a javascript or browser-based library that is more focused on browser width, I cannot find one at this moment, please suggest if you do.

Browsers are able to communicate directly with the font engine of the operating system, while wcwidth in python and other languages are not, so we generally take a more naive approach. And this is probably why most terminals are also wrong in this case while browsers are not.

In this case, the problem with ① (https://codepoints.net/U+2460) is that it is Ambiguous width (https://unicode.org/reports/tr11/#Ambiguous) and,

They have a “resolved” width of either narrow or wide depending on the context of their use.

In the following code blocks I use the same character, one with english letters on the same line,

①2345
12345

and another of your example with your Mandarin Chinese "hello",

①你好
12345

Although they render differently sized, at least on my browser (Firefox 120.0.1), they have approximately the same width. I will say that monospace fonts do not always align vertically in browsers (note how the number '5' does not align in the first example), while they always do in terminals.

Screenshot of the above,
image
(End screenshot)

It would require more experimentation, but maybe for a page of Chinese locale it would render differently, such as in your original screenshot, I'm not really sure.

In any case, there are options on many terminals, to cause ambiguous width characters to display as 2 cells,

I'm not certain, but maybe this option is more frequently used for east-asian language users in terminals?

But it is very problematic -- the entire software stack needs to agree to "treat ambiguous width as wide", for example, here is an "$LD_PRELOAD-able library and a wrapper script" that patches posix wcwidth for this option, and references many issues and bugs about this option. https://github.com/fumiyas/wcwidth-cjk

The "Terminal Working Group" tried to come to a consensus about this and other issues, https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9#note_406682 -- there was a great deal of discussion but this "Working Group" specifications project has failed to come to any consensus at all on any single issue (the "accepted" folder is empty, 31 open issues)

And, maybe this library could also provide such an option, to "treat ambiguous width as wide". And, I will rewrite this github issue to match that request.

@jquast jquast changed the title Some unicode width not correct Should wcwidth have "Treat ambiguos-width as wide" option? Mar 30, 2024
@GalaxySnail
Copy link
Collaborator

It's also rendered with a width of 1 in Windows Terminal.

Even more, it's rendered with a width of 1 in my webbrowser (chromium).

I personally agree that "①" should be East Asian Wide, but unfortunately it is East Asian Ambiguous (and a similar character U+2780 is East Asian Neutral). In my opinion, it may need to be addressed in Unicode, but I'm not sure. Unicode is a bit chaotic. ¯\_(ツ)_/¯

@keatonLiu
Copy link
Author

keatonLiu commented Mar 30, 2024

Thank you for so much work! You are very helpful.
I have tested that in my windows terminal and gets the same result.
image

I understand it is because the ① character is an East Asian Ambiguous character, which is treated as different size in different context. I agree that it can have a "treat ambiguous width as wide" option because in most cases it displays the same size as a east asian character in my locale.
You can visit this website and get an intuitive demo: https://www.zhonghuazidian.com/zi/%E2%91%A0
On my browser, chrome:
image
Even in Word:
image
I think it will be a wide width character if you use a monospace font-family in browser.

@keatonLiu
Copy link
Author

wcwidth is primarily focused for terminals, that is if browsers and terminals disagree we would rather match with terminals. Although I expect a javascript or browser-based library that is more focused on browser width, I cannot find one at this moment, please suggest if you do.

Browsers are able to communicate directly with the font engine of the operating system, while wcwidth in python and other languages are not, so we generally take a more naive approach. And this is probably why most terminals are also wrong in this case while browsers are not.

In this case, the problem with ① (https://codepoints.net/U+2460) is that it is Ambiguous width (https://unicode.org/reports/tr11/#Ambiguous) and,

They have a “resolved” width of either narrow or wide depending on the context of their use.

In the following code blocks I use the same character, one with english letters on the same line,

①2345
12345

and another of your example with your Mandarin Chinese "hello",

①你好
12345

Although they render differently sized, at least on my browser (Firefox 120.0.1), they have approximately the same width. I will say that monospace fonts do not always align vertically in browsers (note how the number '5' does not align in the first example), while they always do in terminals.

Screenshot of the above, image (End screenshot)

It would require more experimentation, but maybe for a page of Chinese locale it would render differently, such as in your original screenshot, I'm not really sure.

In any case, there are options on many terminals, to cause ambiguous width characters to display as 2 cells,

I'm not certain, but maybe this option is more frequently used for east-asian language users in terminals?

But it is very problematic -- the entire software stack needs to agree to "treat ambiguous width as wide", for example, here is an "$LD_PRELOAD-able library and a wrapper script" that patches posix wcwidth for this option, and references many issues and bugs about this option. https://github.com/fumiyas/wcwidth-cjk

The "Terminal Working Group" tried to come to a consensus about this and other issues, https://gitlab.freedesktop.org/terminal-wg/specifications/-/issues/9#note_406682 -- there was a great deal of discussion but this "Working Group" specifications project has failed to come to any consensus at all on any single issue (the "accepted" folder is empty, 31 open issues)

And, maybe this library could also provide such an option, to "treat ambiguous width as wide". And, I will rewrite this github issue to match that request.

Interesting, I'm using chrome and displays in another way:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants