Drop UNICODE_VERSION ? #104

jquast · 2023-11-22T17:45:09Z

From the work and results of ucs-detect, https://ucs-detect.readthedocs.io/results.html

I have discovered that terminals do not support a single version of Unicode. At this time, very few support a single version of the specification completely.

For specific types, like wide characters, they may very at any version, fe. gnome terminal, https://ucs-detect.readthedocs.io/sw_results/GNOMETerminal.html#gnometerminal supports 93% of characters unique to version 15.0, and 90% of characters unique to version 14.
It may not be immediately obvious, as, "Language Support" is a bit of a proxy for "Zero-Width support", because combining characters are best tested with the characters expected to be combined with, but their support for combining characters or the tables used in their code not necessarily match their latest wide table. In fact, most terminals only update their wide tables for the most popular demand of emoji support.
And of course, though ZWJ and VS-16 came out at roughly unicode version 8 and 9, very few terminals that support unicode 9 or higher of the wide tables support ZWJ and VS-16, see this specific part of the table:

Because of those results, I think its perfectly fine to drop support for this UNICODE_VERSION, I very much doubt it is used, or useful to anyone when it is, because it cannot correctly describe the terminal's support to wcwidth.

If is a useful idea?

I was interested whether terminal emulator authors would have feedback about UNICODE_VERSION, and whether they would consider exporting it. I have not received any feedback.

However, with tools like 'ucs-detect', we can very programmatically determine with black-box testing, which wide, zero-width, and whether ZWJ and VS-16 are supported, right down to exactly which ones. By making this a delta of expected terminal support, and using ranges with codepoints, maybe it is possible to describe with a complex environment variable.

Just spitballing an idea of what it might look like,

UNICODE_SUPPORT="zero[8.0:!category:Mc,Mn,!1001-1002,!1003],wide[15.1:!zwj,!vs16,!9009-9010]"

The text was updated successfully, but these errors were encountered:

danschwarz · 2023-11-29T17:39:15Z

I've been following your work as I am a contributor to the toot mastodon client which uses wcwidth.

My use case: display characters correctly in our TUI across all terminals where it's possible. The main challenge is emoji, where we attempt to display a double width emoji on a terminal that doesn't support a version of Unicode with that emoji, and defaults to single width. We then get display glitches, off by one error on that line.

As you note, UNICODE_VERSION doesn't solve the problem. Your suggestion of a much more granular report, down to the individual symbol level, could be the basis for an application level workaround for terminal deficiencies. I hope...?

But generating that report on a per-session basis could be time consuming and glitch prone. And a set of Python utility functions would be handy to parse the report and insert spaces (?) or cursor movements (?) after unsupported double width symbols.

Another idea is to promote your utility as a test suite that all terminal emulators should run to assess Unicode compliance. Open a bunch of issues against the popular terminals asking them to achieve 100% compliance for a given Unicode version. If this catches on it could be an accepted part of the terminal emulator development cycle.

jquast · 2023-11-29T18:35:01Z

I'm really happy for your feedback and attention on this issue @danschwarz, thank you for writing!!

I am planning to write an article to bring awareness to the issue you are discovering and overview and explanation of the results of the ucs-detect utility. I will share that link in this issue for your review.

And then also, as you suggest, to open issues for the most popular terminals, to link directly to the results and that article, and to suggest they can use the ucs-detect tool (or replicate the technique) to test for compliance.

As a stretch goal, @GalaxySnail and I improved the tables to be generated using Jinja2, so that could easily add a template file to generate tables for C, C++, js, ruby, etc, towards #103. I would also like to take the effort to improve our code generation process to also generate code for use in their projects. If I can also submit to them to up-to-date tables in the language of the emulator, it would go a long way towards getting it solved, as that appears to be the pain point.

But generating that report on a per-session basis could be time consuming and glitch prone.

The --quick option of ucs-detect finishes under 0.3 seconds on my computer, and is able to correctly detect that my terminal is v12.0 wide, 15.0 ZWJ, and has no VS-16 support. But wcwidth doesn't have the facilities to understand that right now. My last statement is about, "wcwidth could help understand specific qualities of a terminal's support level by an environment variable".

That environment variable could be produced by ucs-detect's code, and incorporated into TUI applications or just export the variable directly from the emulator. However I don't believe anyone has ever implemented UNICODE_VERSION in the wild, I received no feedback at all from https://www.jeffquast.com/post/terminal_wcwidth_solution/ and so I expect also that a call for UNICODE_SUPPORT would also be ignored.

And a set of Python utility functions would be handy to parse the report and insert spaces (?) or cursor movements (?) after unsupported double width symbols.

Upstream compliance is much better than downstream workarounds, although we can write code to add extra spaces to specific emojis for specific terminals by using ucs-detect's code to detect when it is needed, it is just too many layers away from the problem to properly resolve it, it would be certainly be slow or "glitch-prone".

danschwarz · 2023-12-01T05:12:02Z

I agree that upstream compliance is the right goal here.

To that end, I suggest:

The existing scorecard with A+...F grades should be replaced with more precise metrics. For example...

UNICODE 15 SUPPORT: WIDE: x out of y characters supported. (and whatever the appropriate metrics are for LANG, ZWJ, VS-16).

and so on.

I'm not sure if all the categories are equally important to implement. I suppose it depends on the use case. But if it's easy enough to support all categories, simply by updating some data tables, then this should be done.

A badge for "UNICODE 15 COMPLIANT - VERIFIED BY WCWIDTH" (or similar) could be created and offered to any terminal developer whose current production release passes the test 100%. This badge could be updated automatically (not sure how, but i've seen it done in other contexts.)

danschwarz · 2023-12-10T14:37:04Z

Re: your stretch goal of generating tables, and even code to consume the table data for use in specific terminal apps: my recommendation is, don't try to boil the ocean. Roll it out in stages-

Test suite with compliance badges
Data tables that anyone can write code to consume on their own
File issues with popular terminal emulators referencing the test suite and data tables
Code generation

jquast · 2023-12-14T05:26:50Z

'2.' and '4.' are solved, please see https://github.com/jquast/wcwidth/blob/master/bin/update-tables.py and its use of jinja2 templates, https://github.com/jquast/wcwidth/tree/master/code_templates feel free to add support for more languages in a separate PR.

I decline to volunteer my time to make "compliance badges". If you'd like to join contribution of the ucs-detect tool, you are welcome to submit Issues or PR's for whatever you have in mind, but I really don't wish to add complications of hosting a special web service or to add automatic image generation to a tool that otherwise has very few dependencies. I cannot be a gatekeeper of compliance testing, I have very limited windows of time during periods of unemployment to volunteer to these projects.

I will begin submitting issues with terminal developers tomorrow. I am cleaning up the last of this article https://www.jeffquast.com/post/ucs-detect-test-results/

erf · 2024-01-08T03:22:17Z

In Wezterm you can configure the unicode version and set the unicode version via escape codes, but i think all terminal emulators should strive to support the latest version, although this might break some TUI apps

jquast mentioned this issue Nov 22, 2023

wcwidth should have a "C Extension" #103

Open

jquast added the question label Dec 29, 2023

This was referenced Jan 12, 2024

Unicode glyphs ruin the TUI in my terminal ihabunek/toot#420

Open

Update unicode table to the version 15.1.0 urwid/urwid#744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop UNICODE_VERSION ? #104

Drop UNICODE_VERSION ? #104

jquast commented Nov 22, 2023

danschwarz commented Nov 29, 2023

jquast commented Nov 29, 2023

danschwarz commented Dec 1, 2023

danschwarz commented Dec 10, 2023

jquast commented Dec 14, 2023 •

edited

erf commented Jan 8, 2024

Drop UNICODE_VERSION ? #104

Drop UNICODE_VERSION ? #104

Comments

jquast commented Nov 22, 2023

If is a useful idea?

danschwarz commented Nov 29, 2023

jquast commented Nov 29, 2023

danschwarz commented Dec 1, 2023

danschwarz commented Dec 10, 2023

jquast commented Dec 14, 2023 • edited

erf commented Jan 8, 2024

jquast commented Dec 14, 2023 •

edited