wcwidth should have a "C Extension" #103

jquast · 2023-11-22T17:32:23Z

Previously discussed,

Remove python2 compat codes, introduce dataclasses and type hints for bin/update-tables.py #58 (comment)
Use a new jinja2 templating process to update library to latest unicode (14.0) #61 (comment)

I'm open to any specific solution. Plenty were discussed in the past.

Why compile?

As wcwidth is likely a "hot path" to intensive applications (like pymux, a terminal emulator)
and, the _bisearch() and code tables are basic "if/else" statements and large arrays of integers.
pre-compiling machine code and providing build hooks could very significantly improve the performance of any dependent applications.
and, with our jinja2 framework in bin/update-tables.py, it would be very easy to systematically generate these "code tables" for any type language

My only suggested requirement is that wcwidth install without error on minority operating systems/environments that can't build or fetch a matching pre-built package: that those systems should succeed to install anyway and continue to use the pure-python implementation.

I think using just the basic C language is a fine choice, our use of the language and build would be the most basic and supportable across all kinds of systems and I know C well enough so I don't mind that at all.

Python-like languages like Cython are also very "inclusive" for outside developers to dissect and contribute to, as they are very likely to be python developers, whereas using Rust or something to create a foreign function interface might be very alienating.

The text was updated successfully, but these errors were encountered:

jquast · 2023-11-22T17:51:09Z

Related, I created issue #104 because the code for parsing UNICODE_VERSION to a matching table is a bit complicated, it aims to be very lenient but best-matching. I am confident in writing a "safe" C _bisearch, but it makes it difficult to write wcwidth() and wcswidth() in C because they call out to _wcmatch_version() which does a lot of work (and caches its results), and this would have to be re-implemented in C and its really just a big chore for very little gain!

I think if we make a C extension, we should also drop UNICODE_VERSION support.

SlySven · 2023-12-23T17:16:52Z

I was recently made aware of this wcwidth project and wanted to draw your attention to another related one: widecharwidth that provides a wcwidth.h header file that has been included in the Mudlet MUD Client that I contribute code to.

jquast · 2023-12-24T16:44:40Z

Hello @SlySven that's really great work!! It looks like we have very similar interests :)

I will take more time to look at widecharwidth soon, and file issues or conversations there, my brief findings just by browsing code,

it is not measuring emoji ZWJ and VS-16 sequences (must parse emoji-variation-sequences.txt and emoji-zwj-sequences.txt in generate.py)

many "Wide" characters in EastAsianWidth.txt are actually zero-width (?!?!), be sure to "subtract" from those tables as we do here

wcwidth/bin/update-tables.py

Lines 332 to 336 in 0ba0278

    
           # subtract(!) wide characters that were defined above as 'W' category in EastAsianWidth, 
        
           # but also zero-width category 'Mn' or 'Mc' in DerivedGeneralCategory! 
        
           table[version].values.discard(parse_category(fname=UnicodeDataFile.DerivedGeneralCategory(version), 
        
                                                        category_codes=('Mn', 'Mc'), 
        
                                                        wide=0).values)

but it looks like you have the same logic that our library had, that it measures for zero-width before wide characters, so the duplication of characters in both tables means its a bit of a non-bug

We should also be able to adapt the https://github.com/jquast/ucs-detect utility to use widechar_width.py and can make a clear report of any differences in any case.

Also that's great that you also contribute to Mudlet! I have been meaning to work on extending https://github.com/jquast/telnetlib3 to support the MUD protocol extensions and I have been using Mudlet as a reference client. I've had a hard time keeping maintenance of telnetlib3 but I hope to resume again soon

SlySven · 2023-12-25T19:56:13Z

Ah, I should point out that widecharwidth is not my project - I am one of the main coders of Mudlet though, and what we do is use Qt's QTextBoundaryFinder in QTextBoundaryFinder::Grapheme mode to identifier the first code-point in each grapheme and then feed just that to the (int) widechar_wcwidth(uint32_t) function defined in the widechar_width.h file that we import from @ridiculousfish 's project.

jquast · 2024-01-06T15:12:07Z

widecharwidth provides only "wcwidth", but does not provide a "wcswidth" function, so, it cannot support VS-16 or ZWJ. Probably not much of a problem for MUD's. Anyway I hope to continue my work on telnetlib3 and use mudlet again soon, thanks again for your work there.

jquast mentioned this issue Nov 29, 2023

Drop UNICODE_VERSION ? #104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wcwidth should have a "C Extension" #103

wcwidth should have a "C Extension" #103

jquast commented Nov 22, 2023

jquast commented Nov 22, 2023 •

edited

SlySven commented Dec 23, 2023 •

edited

jquast commented Dec 24, 2023

SlySven commented Dec 25, 2023

jquast commented Jan 6, 2024

wcwidth should have a "C Extension" #103

wcwidth should have a "C Extension" #103

Comments

jquast commented Nov 22, 2023

Why compile?

jquast commented Nov 22, 2023 • edited

SlySven commented Dec 23, 2023 • edited

jquast commented Dec 24, 2023

SlySven commented Dec 25, 2023

jquast commented Jan 6, 2024

jquast commented Nov 22, 2023 •

edited

SlySven commented Dec 23, 2023 •

edited