add letters regex to match for more non ASCII chars #51

faxemaxee · 2020-09-12T11:05:33Z

Hey, it's me again! :D

I realised that the letters rule only checks for a very small amount of chars thus causing issues in our 14 language support product. Me, as a German cannot use Umlaute (ä, ö, ü) or other special letters like ß. After debugging and reading up on regex I decided to make another PR included SOME but not all charsets I would like to see in the package. I also updated the readme with the necessary information about the supported chars.

Regarding tests. First and foremost I wanted to make sure that the general behaviour did not change and added negative test for all the other rules. I then started to make complete checks for the other charsets (not included in the PR) and quickly realises that something like letters(128) took several seconds to finish (we should consider capping the count value here, maybe 10 or so, everything above 5 sounds unreasonable to me, but who knows). And those 128 were only the greek and coptic alphabet. Not even starting with the CJK set... :D

Okay, let me know what you think!

tarunbatra · 2020-09-14T18:36:35Z

Hey @faxemaxee, glad to see you again. :D

This is really a very tricky issue and that's why I took some time to analyze it. I have been wanting to do this, but wanted to avoid the handpicking of unicode characters to achieve this. This two broad issues with these are:

Maintaining all the character sets in the library is not maintainable. And choosing few character sets over others feel wrong. Eg, the languages I speak (Hindi and Punjabi) are not included in this PR.
The same will need to be done for digits and symbols, which is a can of worms.

I was looking to use Unicode Property Escapes to achieve this result such that a regex like /\p{L}/ will be able to replace /a-zA-Z/ and include all the letters from all the scripts.

As of now I am concerned about the support of this in browsers and the performance. A quick test on regex101 seem promising but needs more effort.

faxemaxee · 2020-09-22T21:01:48Z

Hey @tarunbatra,

I totally get your concern, I basically had the same. AFAIK /\p{L}/ is far from good browser support. I figured that not supporting any is worse than supporting some charsets. That's why I also updated the readme to clearly state which sets are supported and invite others to contribute the charsets they want to see supported.

I did some research and found this: https://github.com/slevithan/xregexp

But I am fine with not accepting this PR when it is not meeting your standards regarding this issue. :)

tarunbatra · 2020-10-04T21:52:55Z

Hey @faxemaxee, sorry for late reply.

The browser support doesn't look very bad to me as it seems every major browser except IE (surprise) and Firefox Android supports Unicode Property Escapes. I do agree with you that something is better than nothing but at this point I feel like I'd rather wait or use a library like the one you suggested.

I understand that this feature is important for you and I'm sorry this couldn't be shipped. I appreciate the effort you have put in and hope you'll be able to use a fork of the library until we can ship it. :)

add letters regex to match for more non ASCII chars

7f6e505

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add letters regex to match for more non ASCII chars #51

add letters regex to match for more non ASCII chars #51

faxemaxee commented Sep 12, 2020 •

edited

tarunbatra commented Sep 14, 2020

faxemaxee commented Sep 22, 2020

tarunbatra commented Oct 4, 2020

add letters regex to match for more non ASCII chars #51

Are you sure you want to change the base?

add letters regex to match for more non ASCII chars #51

Conversation

faxemaxee commented Sep 12, 2020 • edited

tarunbatra commented Sep 14, 2020

faxemaxee commented Sep 22, 2020

tarunbatra commented Oct 4, 2020

faxemaxee commented Sep 12, 2020 •

edited