Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add letters regex to match for more non ASCII chars #51

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

faxemaxee
Copy link
Contributor

@faxemaxee faxemaxee commented Sep 12, 2020

Hey, it's me again! :D

I realised that the letters rule only checks for a very small amount of chars thus causing issues in our 14 language support product. Me, as a German cannot use Umlaute (ä, ö, ü) or other special letters like ß. After debugging and reading up on regex I decided to make another PR included SOME but not all charsets I would like to see in the package. I also updated the readme with the necessary information about the supported chars.

Regarding tests. First and foremost I wanted to make sure that the general behaviour did not change and added negative test for all the other rules. I then started to make complete checks for the other charsets (not included in the PR) and quickly realises that something like letters(128) took several seconds to finish (we should consider capping the count value here, maybe 10 or so, everything above 5 sounds unreasonable to me, but who knows). And those 128 were only the greek and coptic alphabet. Not even starting with the CJK set... :D

Okay, let me know what you think!

@tarunbatra
Copy link
Owner

Hey @faxemaxee, glad to see you again. :D

This is really a very tricky issue and that's why I took some time to analyze it. I have been wanting to do this, but wanted to avoid the handpicking of unicode characters to achieve this. This two broad issues with these are:

  1. Maintaining all the character sets in the library is not maintainable. And choosing few character sets over others feel wrong. Eg, the languages I speak (Hindi and Punjabi) are not included in this PR.
  2. The same will need to be done for digits and symbols, which is a can of worms.

I was looking to use Unicode Property Escapes to achieve this result such that a regex like /\p{L}/ will be able to replace /a-zA-Z/ and include all the letters from all the scripts.

As of now I am concerned about the support of this in browsers and the performance. A quick test on regex101 seem promising but needs more effort.

@faxemaxee
Copy link
Contributor Author

Hey @tarunbatra,

I totally get your concern, I basically had the same. AFAIK /\p{L}/ is far from good browser support. I figured that not supporting any is worse than supporting some charsets. That's why I also updated the readme to clearly state which sets are supported and invite others to contribute the charsets they want to see supported.

I did some research and found this: https://github.com/slevithan/xregexp

But I am fine with not accepting this PR when it is not meeting your standards regarding this issue. :)

@tarunbatra
Copy link
Owner

Hey @faxemaxee, sorry for late reply.

The browser support doesn't look very bad to me as it seems every major browser except IE (surprise) and Firefox Android supports Unicode Property Escapes. I do agree with you that something is better than nothing but at this point I feel like I'd rather wait or use a library like the one you suggested.

I understand that this feature is important for you and I'm sorry this couldn't be shipped. I appreciate the effort you have put in and hope you'll be able to use a fork of the library until we can ship it. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants