Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is case-insensitive string match supported? #132

Open
twang18 opened this issue Nov 5, 2020 · 13 comments
Open

Is case-insensitive string match supported? #132

twang18 opened this issue Nov 5, 2020 · 13 comments

Comments

@twang18
Copy link

twang18 commented Nov 5, 2020

Hi, thanks for the great work! I am wondering if case-insensitive string match is supported. For example, when there is "information system" in the built Trie, and it can match "Information system", "Information System" and "INFORMATION SYSTEM" as well?

@Dobatymo
Copy link

Dobatymo commented Nov 6, 2020

To support case-insensitive matching you should insert lowercase words into the trie and then lowercase your input before you search for it.

@twang18
Copy link
Author

twang18 commented Nov 6, 2020

Thanks for the suggestion! This could be one solution.

@WojciechMula
Copy link
Owner

There's no case insensitive search option and TBH I don't like to add this to the core library. What I may suggest as an enhancement to the library is to provide an extra python wrapper class that would keep the interface, but do lowercase conversion underneath.

@pombredanne
Copy link
Collaborator

IMHO case insensitive is something best done outside (that's the way I do this)

@zhu
Copy link

zhu commented Mar 24, 2021

https://github.com/nppoly/cyac#unicode
Wrong offset may returns when convert to lowercase outside.

@WojciechMula
Copy link
Owner

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

@zhu
Copy link

zhu commented Mar 26, 2021

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

I think UCS-32 cannot solve it. u"İ".lower() generate 2 unicode code point. And u"İ".lower().upper() != u"İ".
There's some special casing:
https://stackoverflow.com/questions/23524231/lower-case-of-turkish-character-dotted-i
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

casing incurs a change in string length or is dependent on context or locale

@WojciechMula
Copy link
Owner

@zhu Thank a lot. It is more than strange. So there is no perfect solution.

@pombredanne
Copy link
Collaborator

I actually also experienced this same rare issue (not specific to pyahocorasick):
https://github.com/nexB/scancode-toolkit/blob/f3ef5b4ad823577e507d673a7fbc65d5efe4f6af/src/licensedcode/match.py#L1432

NOTE: we have a rare Unicode bug/issue because of some
Unicode codepoint such as some Turkish characters that
decompose char + punct when casefolded.
See nexB/scancode-toolkit#1872
See also: https://bugs.python.org/issue34723#msg359514

This is eventually a well known problem known as "The Turkish-I Problem" See:

The way I solved it on my side is to check the length does not change when folding case, and if so do something special. This slows down everything of course. Even more so in my case because I split on punctuation before I lowercase (which makes me think that I could simplify my case by lower after my split ...)

See also https://bugs.python.org/issue34723#msg359514 and in particular these messages:

I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'.

Here is the document

We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0.

IMHO we should wait for Python 3.10

@pombredanne
Copy link
Collaborator

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

@zhu
Copy link

zhu commented Mar 30, 2021

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

German 'ß'.lower() == 'ß', 'ß'.upper() == 'SS', 'ẞ'.lower() == 'ß', 'ß'.upper().lower() == 'ss', 'ß'.casefold() == 'ss'. http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a-new-german-unicode-letter-ca.html
Seems the scancode-toolkit cannot handle this case.

BTW: Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. Unicode normalization also may change the text length. It has similar issues.

It'd better find a library which can remember the char offset when normalizing and casefolding text.

@WojciechMula
Copy link
Owner

Thank you @pombredanne for the explanation. I always thought that Unicode is just for assigning the numbers to characters. :)

@pombredanne
Copy link
Collaborator

So to the point of @WojciechMula I would rather keep this in a separate add-on and not in the C core. @zhu we now support bytes and unicode, so there should enough to do something with all the detailed gathered here. Would you be willing to help to craft this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants