Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wcwidth should have a "C Extension" #103

Open
jquast opened this issue Nov 22, 2023 · 5 comments
Open

wcwidth should have a "C Extension" #103

jquast opened this issue Nov 22, 2023 · 5 comments

Comments

@jquast
Copy link
Owner

jquast commented Nov 22, 2023

Previously discussed,

I'm open to any specific solution. Plenty were discussed in the past.

Why compile?

  • As wcwidth is likely a "hot path" to intensive applications (like pymux, a terminal emulator)
  • and, the _bisearch() and code tables are basic "if/else" statements and large arrays of integers.
  • pre-compiling machine code and providing build hooks could very significantly improve the performance of any dependent applications.
  • and, with our jinja2 framework in bin/update-tables.py, it would be very easy to systematically generate these "code tables" for any type language

My only suggested requirement is that wcwidth install without error on minority operating systems/environments that can't build or fetch a matching pre-built package: that those systems should succeed to install anyway and continue to use the pure-python implementation.

I think using just the basic C language is a fine choice, our use of the language and build would be the most basic and supportable across all kinds of systems and I know C well enough so I don't mind that at all.

Python-like languages like Cython are also very "inclusive" for outside developers to dissect and contribute to, as they are very likely to be python developers, whereas using Rust or something to create a foreign function interface might be very alienating.

@jquast
Copy link
Owner Author

jquast commented Nov 22, 2023

Related, I created issue #104 because the code for parsing UNICODE_VERSION to a matching table is a bit complicated, it aims to be very lenient but best-matching. I am confident in writing a "safe" C _bisearch, but it makes it difficult to write wcwidth() and wcswidth() in C because they call out to _wcmatch_version() which does a lot of work (and caches its results), and this would have to be re-implemented in C and its really just a big chore for very little gain!

I think if we make a C extension, we should also drop UNICODE_VERSION support.

@SlySven
Copy link

SlySven commented Dec 23, 2023

I was recently made aware of this wcwidth project and wanted to draw your attention to another related one: widecharwidth that provides a wcwidth.h header file that has been included in the Mudlet MUD Client that I contribute code to.

@jquast
Copy link
Owner Author

jquast commented Dec 24, 2023

Hello @SlySven that's really great work!! It looks like we have very similar interests :)

I will take more time to look at widecharwidth soon, and file issues or conversations there, my brief findings just by browsing code,

  • it is not measuring emoji ZWJ and VS-16 sequences (must parse emoji-variation-sequences.txt and emoji-zwj-sequences.txt in generate.py)
  • many "Wide" characters in EastAsianWidth.txt are actually zero-width (?!?!), be sure to "subtract" from those tables as we do here
    # subtract(!) wide characters that were defined above as 'W' category in EastAsianWidth,
    # but also zero-width category 'Mn' or 'Mc' in DerivedGeneralCategory!
    table[version].values.discard(parse_category(fname=UnicodeDataFile.DerivedGeneralCategory(version),
    category_codes=('Mn', 'Mc'),
    wide=0).values)
    • but it looks like you have the same logic that our library had, that it measures for zero-width before wide characters, so the duplication of characters in both tables means its a bit of a non-bug
  • We should also be able to adapt the https://github.com/jquast/ucs-detect utility to use widechar_width.py and can make a clear report of any differences in any case.

Also that's great that you also contribute to Mudlet! I have been meaning to work on extending https://github.com/jquast/telnetlib3 to support the MUD protocol extensions and I have been using Mudlet as a reference client. I've had a hard time keeping maintenance of telnetlib3 but I hope to resume again soon

@SlySven
Copy link

SlySven commented Dec 25, 2023

Ah, I should point out that widecharwidth is not my project - I am one of the main coders of Mudlet though, and what we do is use Qt's QTextBoundaryFinder in QTextBoundaryFinder::Grapheme mode to identifier the first code-point in each grapheme and then feed just that to the (int) widechar_wcwidth(uint32_t) function defined in the widechar_width.h file that we import from @ridiculousfish 's project.

@jquast
Copy link
Owner Author

jquast commented Jan 6, 2024

widecharwidth provides only "wcwidth", but does not provide a "wcswidth" function, so, it cannot support VS-16 or ZWJ. Probably not much of a problem for MUD's. Anyway I hope to continue my work on telnetlib3 and use mudlet again soon, thanks again for your work there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants