Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode double-width character support #34

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

obfusk
Copy link

@obfusk obfusk commented Feb 16, 2021

Patch mentioned in #33 (but for master).

@obfusk obfusk force-pushed the unicode branch 2 times, most recently from 3634ce2 to a1dd08f Compare February 16, 2021 23:33
@obfusk obfusk changed the title [proof of concept] unicode support unicode support Feb 17, 2021
@obfusk obfusk marked this pull request as ready for review February 17, 2021 02:07
@obfusk
Copy link
Author

obfusk commented Feb 17, 2021

I managed to get rid of the FIXMEs and it seems to work quite well 😄
It's still a bit rough though.

@mattip
Copy link
Member

mattip commented Feb 17, 2021

Does this work with python2/python3?

@mattip
Copy link
Member

mattip commented Feb 17, 2021

I guess so, tests are passing

@obfusk
Copy link
Author

obfusk commented Feb 17, 2021

Does this work with python2/python3?

Yes. The tests originally failed with python2, but I fixed that.
I've now tested it myself with pypy, pypy3, and cpython3.9.

@obfusk
Copy link
Author

obfusk commented Feb 17, 2021

It doesn't handle e.g. zero width joiners, though stand-alone emoji work fine.
But readline doesn't really handle those either, and neither do many terminals AFAIK.

@blueyed blueyed added the enhancement New feature or request label Feb 17, 2021
@blueyed
Copy link
Collaborator

blueyed commented Feb 17, 2021

Thanks for this already!
I wonder how good test coverage in this regard really is, and think that it would be great to get e.g. the corner cases you've fixed covered.

Unfortunately codecov is acting up, at least with the latest commits, since it would be useful to revisit/see its patch status (i.e. how much / which code is covered in tests after all).

@obfusk
Copy link
Author

obfusk commented Feb 17, 2021

I wonder how good test coverage in this regard really is, and think that it would be great to get e.g. the corner cases you've fixed covered.

The tests don't really seem to cover any of the curses stuff. Which is of course hard to test.
At least the tests helped me fix the python2 compatibility.

I think I may have fixed one or two existing bugs. But I may also have introduced some 😅

@blueyed blueyed closed this Feb 25, 2021
@blueyed blueyed reopened this Feb 25, 2021
@blueyed

This comment has been minimized.

Copy link
Collaborator

@blueyed blueyed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def width(c):
return 2 if unicodedata.east_asian_width(c) in "FW" else 1
def wlen(s):
return sum(map(width, s))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using https://pypi.org/project/wcwidth/?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking in from the PyPy side, wcwidth seems like a small package with JIT-freindly structure (no complicated classes or deep function chains). It was dormant from 2016 to 2020, but seems to be active again. The license is MIT, which is fine for PyPy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that that bit of code could be improved, but what would be the advantage of using wcwidth over python's built-in unicode support? It seems to provide mostly the same functionality except with its own unicode tables that need to be kept up to date.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wcwidth can also return 0 or -1, which the current code doesn't know how to handle. And I'm not sure what the best way to handle those would be tbh.

For improved performance, we could use e.g. @functools.lru_cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(note that pyrepl already shows e.g. control characters as ^c etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is possible to find the Unicode version of a terminal by somewhat intrusive detection https://github.com/jquast/ucs-detect

According to the table in the README, lxterminal and kitty support the same unicode version (11.0.0).
But as my screenshot shows, they don't display emoji the same way.

(Slightly OT: I'd consider switching to kitty, but unfortunately it doesn't support Japanese input 😿)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly tested using wcwidth: obfusk/pyrepl-wide@eba676a.

  • I don't notice any improvement atm.
  • I had to modify the original prompt length calculation (because that subtracts the length of ansi escape sequences).
  • _my_unctrl() currently turns a zero-width joiner into "\u200d" and I'm not sure returning it unmodified won't break something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blueyed is there a suspicion that this PR is worse in some cases than the current code? If not maybe it can be merged as-is and other alternatives could be parts of future PRs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with wcwidth's upcoming width() function (and setting aside the differences between terminals), handling zero-width joiners etc. properly is not easy, since width(s) == sum(width(c) for c in s) is no longer true for all strings. This would presumably require more careful and extensive modifications to the code than the current PR.

I think wcwidth is likely to be quite useful in improving unicode support even more in the future, but for now I think this PR is a good first step. Further improvements get a lot more complicated and should probably be separate, future PRs.

I'd also point out that readline doesn't really seem to handle emoji any better.

AFAIK this PR doesn't make anything worse. Unfortunately (but understandably since I don't really know a good way to test all the curses stuff), the test suite can't rule that out.

I do however think the PR is still a little rough and could use a code review and probably some updates to comments and documentation before merging.

Copy link
Author

@obfusk obfusk Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking in from the PyPy side, wcwidth seems like a small package with JIT-freindly structure (no complicated classes or deep function chains). It was dormant from 2016 to 2020, but seems to be active again. The license is MIT, which is fine for PyPy.

I don't think wcwidth supports python2 though, which I presume is an issue for PyPy. [edit: I was mistaken]

@obfusk
Copy link
Author

obfusk commented Feb 25, 2021

(Manual codecov link: https://app.codecov.io/gh/pypy/pyrepl/compare/34/diff)

That link gives me a 403.

@mattip
Copy link
Member

mattip commented Mar 15, 2021

Any more thoughts here? I would like to get this in the upcoming PyPy release.

@obfusk
Copy link
Author

obfusk commented Mar 19, 2021

Any more thoughts here? I would like to get this in the upcoming PyPy release.

To summarise:

  • This handles double-width characters as found in e.g. Japanese and Chinese reasonably well.
  • It seems to be an improvement over the current lack of unicode support and AFAIK doesn't make anything worse than before.
  • It doesn't handle emoji (but that's hard and neither does readline afaik).
  • It doesn't seem to handle e.g. Arabic (but I don't know much about that), but I don't think it makes it worse either.
  • I've tested it with Japanese and found no (further) bugs, but it would be nice to have a comprehensive test suite.
  • I'd like someone to review the code to make sure I didn't miss anything.
  • The PR is a bit rough and could probably use a bit of refactoring; some old comments should probably be updated.
  • Future improvements would probably benefit from switching to wcwidth, but this will require more extensive modifications as some invariants/assumptions (such as width(a) + width(b) == width(a+b)) will no longer hold.

@obfusk obfusk changed the title unicode support unicode double-width character support Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants