Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose new function, width(control_codes='ignore') #79

Open
jquast opened this issue Sep 23, 2023 · 6 comments
Open

Propose new function, width(control_codes='ignore') #79

jquast opened this issue Sep 23, 2023 · 6 comments
Labels

Comments

@jquast
Copy link
Owner

jquast commented Sep 23, 2023

Problem

As for the need for "width" function, just about every downstream library has some issue with the POSIX wcwidth() and wcswidth() functions, either in C or in this python library.

This is mainly because both functions may return -1, and the return value must be checked, but it often is not.

Although using wcswidth() on a string is the most popular use case, it has the possibility to return -1 by POSIX definition, and Markus Kuhn's 2007 implementation returns -1 for control characters.

The return value is often unchecked where it is used with sum(), slice() or screen positioning functions with surprising results.

Solution

Provide new function signature, width that always returns a "best effort" of measured distance. It may ignore or measure control codes, instead. If "catching unexpected control codes" is a desired function, we can continue to provide it as an optional keyword argument, and, rather than return -1, raise an exception.

Maybe new keyword argument control_codes with default argument 'ignore', in similar spirit to 'errors' for https://docs.python.org/3/library/stdtypes.html#bytes.decode,

  • 'ignore': measure all individual control codes and terminal sequences as 0 width.
  • 'strict': raise ValueError on any control codes.
  • 'replace': make best effort to account for horizontal width movement in terminal sequences.

Workaround

As a workaround, I have suggested to use wcwidth() directly on each individual character and clip the possible -1 return value to 0, example: https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L364

This provides the same function as wcswidth but provides a "best guess", however, this method cannot handle coming changes to wcswidth to handle zero width joiner (ZWJ) sequences.

@GalaxySnail
Copy link
Collaborator

Just out of curiosity, are there any real-world examples where this enhancement would be beneficial?

@jquast
Copy link
Owner Author

jquast commented Sep 30, 2023

I have used to strike-through on the suggestion of the terminal sequence, I will save for another issue.

As for the need for "width" function, just about every downstream library has some issue with the POSIX wcwidth and wcswidth functions.

This is mainly because both functions may return -1, and the return value must be checked, but it often is not.

And I think all downstream users wish for us to have a single function that makes a "best effort". if a zero width joined emoji sequence also contains a newline or other control character, it is best to just return our best estimate of the measurement rather than -1 as wcswidth() does.

wcswidth()

Although using wcswidth() on string is the most popular use case, it has the possibility to return -1 by POSIX definition, and Markus Kuhn's 2007 implementation returns -1 for control characters, chr(1) through 32.

  • This function is most used by most downstream users of this library, the return value is often unchecked where it is used with sum(), slice() or screen positioning functions with surprising results.

wcwidth()

As a workaround, I have suggested to use wcwidth() directly on each individual character and clip the possible -1 return value to 0, example: https://github.com/jquast/blessed/blob/a34c6b1869b4dd467c6d1ab6895872bb72db7e0f/blessed/sequences.py#L364

This provides the same function as wcswidth but provides a "best guess", however, this method cannot handle coming changes to wcswidth to handle zero width joiner (ZWJ) sequences.

@jquast
Copy link
Owner Author

jquast commented Sep 30, 2023

Although I am open to changing wcswidth() to never return -1 and make a "best effort", it would deviate from the original 2007 implementation and POSIX specification, and this is why i suggest an entirely new function name and strongly suggest it is the best alternative in the docstrings of wcswidth and wcwidth

@GalaxySnail
Copy link
Collaborator

Thank you for the clarification!

@jquast
Copy link
Owner Author

jquast commented Oct 19, 2023

I have created it in development branch but I will make a bugfix release first, I will make a PR for this next,

wcwidth/wcwidth/wcwidth.py

Lines 262 to 277 in 1f1443b

def width(text, unicode_version='auto'):
"""
Given a unicode string, return its printable length on a terminal.
Unlike :func:`wcswidth`, ``-1`` is never returned when a C0 or C1 control
character is encountered.
:param str text: Measure width of given unicode string.
:param str unicode_version: An explicit definition of the unicode version
level to target for measurement, may be ``auto`` (default), which uses
the Environment Variable, ``UNICODE_VERSION`` if defined, or the latest
available unicode version, otherwise.
:rtype: int
:returns: Approximate number cells needed to display the characters of ``text``.
"""
return _wcswidth(text, unicode_version=_wcmatch_version(unicode_version), errors='ignore')

@jquast jquast changed the title Suggest new easier public API function, "width" There should be wcswidth() that ignores control characters Oct 21, 2023
@jquast jquast changed the title There should be wcswidth() that ignores control characters There should be wcswidth() that doesn't return -1 for c0 and c1 control characters Oct 21, 2023
@jquast jquast changed the title There should be wcswidth() that doesn't return -1 for c0 and c1 control characters Needs variant of wcswidth() that doesn't return -1 for c0 and c1 control characters Oct 21, 2023
@jquast jquast changed the title Needs variant of wcswidth() that doesn't return -1 for c0 and c1 control characters Variant of wcswidth() that doesn't return -1 for ctrl chars? Oct 21, 2023
@jquast jquast changed the title Variant of wcswidth() that doesn't return -1 for ctrl chars? Have wcswidth() ignore ctrl chars? Oct 21, 2023
@jquast jquast changed the title Have wcswidth() ignore ctrl chars? Can wcswidth() that ignores control characters? Oct 21, 2023
@jquast jquast changed the title Can wcswidth() that ignores control characters? New width ignores control characters? Oct 21, 2023
@jquast jquast changed the title New width ignores control characters? New width() function that ignores control characters? Oct 21, 2023
@jquast
Copy link
Owner Author

jquast commented Oct 21, 2023

I have revised this description and related issue #92

And I do think they are closely related. control characters like \b is just as much a terminal sequences as \x1b[0m. Ignoring the '\x1b' is not enough, I think we should measure the full sequence \x1b[0m as 0 instead of 3 (char lengths 0, 1, 1, 1). And provide a choice for ambigous characters like '\b' and or '\x1b[D' as either -1 (moving backwards, 'parse') or 0 (ignored)

@jquast jquast changed the title New width() function that ignores control characters? Propose new function, width(control_codes='ignore') Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants