Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all Unicode Versions #23

Merged
merged 45 commits into from
Jun 1, 2020
Merged

Support all Unicode Versions #23

merged 45 commits into from
Jun 1, 2020

Conversation

jquast
Copy link
Owner

@jquast jquast commented Mar 10, 2017

Support all versions of Unicode, using the UNICODE_VERSION environment variable, when defined, or, for non-shells, explicitly by passing argument unicode_version to the wcwidth family of functions.

A demonstration utility that determines the Terminal's Unicode Version is made available as a separate package, https://github.com/jquast/ucs-detect/ which contains a Problem and Solution statement, copied here:

Problem

Chinese, Japanese, Korean, and Emoticon characters are "double-wide", occupying 2 cells, instead of 1, and some other special characters are "zero-width".

Any terminal application that formats and displays these characters may have trouble determining how it will be displayed to the end-user.

This problem happens often, because the Unicode Consortium releases new versions of the Unicode Standard periodically, but the source code of libraries and applications are not updated at the same time, or at all!

Many languages and libraries continue to conform only to Unicode 5.0, which is the last version of wcwidth.c released by Markus Kuhn in 2007.

Solution

The most important factor is to determine: What version of unicode is the Terminal Emulator using?

This program, ucs-detect, is able to automatically detect the version of unicode that the connecting Terminal supports. The python wcwidth library supports all Unicode versions, 4.1.0 through 12.1.0 at time of this writing, and so it is able to select and match the correct return value for by using the given value of the UNICODE_VERSION environment variable.

@jquast jquast changed the title draft: towards unicode version levels review: towards unicode version levels Oct 22, 2017
@jayvdb
Copy link

jayvdb commented Oct 2, 2019

Hi @jquast , I am wondering if you might be interested in the "multiple unicodedata versions" problem being solved in a separate library. I created fonttools/unicodedata2#28 about this, as I know that project is already partially solving that problem.

@jquast
Copy link
Owner Author

jquast commented Mar 1, 2020

More than anything, I've been mulling over the idea, "How best should users select their unicode version support level?"

And recently, woah! iTerm2 supports a way to switch versions, see "Unicode Version" in https://iterm2.com/documentation-escape-codes.html

And, I think I can devise a way to determine the support version, by introspection of the terminal, to display 1 double-width char that is new for each unicode support level, and use report-cursor-position to determine what support level the connected terminal is at.

So, in the years since I first developed wcwidth for python, there have been some enhancements to the general ecosystem for determining or setting the version level, but nothing particularly universal or portable/common.

I waited for a few years to add 24-bit color support for https://github.com/jquast/blessed because there was no way to determine whether the terminal would support it, and I couldn't decide how to expose an easy API to select 24-bit color support. Over the years, all terminals implementing 24-bit colors added a COLORTERM environment variable definition to announce their support, http://jdebp.eu/Softwares/nosh/guide/TerminalCapabilities.html

So now the code is perfectly clear and straight-forward for me as a library, and all downstream applications, even users, also do not have to specify this terminal support level, even existing applications that use the library can support 24-bit colors without changes by users or the application developers.

So anyway, I do think environment variable is the best way to go, at least from a terminal support level perspective.

@@ -92,26 +97,48 @@ def flushout():
assert 'narrow Python build' in err.args[0], err.args
LIMIT_UCS = 0x10000

#: printable length of highest unicode character
#: printable length of highest unicode character description
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistaken comment, revert

if inp.code == term.KEY_ENTER:
break
elif inp.code == term.KEY_ESCAPE or inp == chr(3):
text = None
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not return None

for version, boundaries in ZERO_WIDTH.items():
for (begin, end) in boundaries:
if version == _wcmatch_version(unicode_version):
for val in [_val for _val in
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanuo


.. autofunction:: wcwidth._get_package_version

.. autofunction:: wcwidth._wcmatch_version
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate. Should make function public.

@jquast
Copy link
Owner Author

jquast commented Mar 1, 2020

Documentation for wcwidth and wcswidth functions will be default value None, for unicode_version argument, which means that the value from environment variable UNICODE_VERSION will be used, or 8 if unspecified.

And in the README, we will be clear to spell out this transitional time of terminal support, and how to set the environment variable for version level 9, if you like, for terminals like iTerm2, to see the results magically appear in any downstream programs like bpython without changes.

And that's the real goal here, if terminal applications or power users can start exporting this variable, we can have a language-independent solution for unicode version level selection.

@codecov
Copy link

codecov bot commented Jun 1, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@ce8acd8). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #23   +/-   ##
=========================================
  Coverage          ?   97.84%           
=========================================
  Files             ?        3           
  Lines             ?       93           
  Branches          ?       18           
=========================================
  Hits              ?       91           
  Misses            ?        1           
  Partials          ?        1           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce8acd8...5177f98. Read the comment docs.

@jquast jquast mentioned this pull request Jun 1, 2020
@jquast jquast changed the title review: towards unicode version levels Support *all* Unicode Versions Jun 1, 2020
@jquast jquast changed the title Support *all* Unicode Versions Support all Unicode Versions Jun 1, 2020
@jquast jquast merged commit 16a762f into master Jun 1, 2020
@jquast jquast deleted the towards-versioned-unicodes branch June 1, 2020 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants