-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is case-insensitive string match supported? #132
Comments
To support case-insensitive matching you should insert lowercase words into the trie and then lowercase your input before you search for it. |
Thanks for the suggestion! This could be one solution. |
There's no case insensitive search option and TBH I don't like to add this to the core library. What I may suggest as an enhancement to the library is to provide an extra python wrapper class that would keep the interface, but do lowercase conversion underneath. |
IMHO case insensitive is something best done outside (that's the way I do this) |
https://github.com/nppoly/cyac#unicode |
@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly. |
I think UCS-32 cannot solve it.
|
@zhu Thank a lot. It is more than strange. So there is no perfect solution. |
I actually also experienced this same rare issue (not specific to pyahocorasick):
This is eventually a well known problem known as "The Turkish-I Problem" See:
The way I solved it on my side is to check the length does not change when folding case, and if so do something special. This slows down everything of course. Even more so in my case because I split on punctuation before I lowercase (which makes me think that I could simplify my case by lower after my split ...) See also https://bugs.python.org/issue34723#msg359514 and in particular these messages:
IMHO we should wait for Python 3.10 |
Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14 It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true. |
German BTW: Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. Unicode normalization also may change the text length. It has similar issues. It'd better find a library which can remember the char offset when normalizing and casefolding text. |
Thank you @pombredanne for the explanation. I always thought that Unicode is just for assigning the numbers to characters. :) |
So to the point of @WojciechMula I would rather keep this in a separate add-on and not in the C core. @zhu we now support bytes and unicode, so there should enough to do something with all the detailed gathered here. Would you be willing to help to craft this? |
Hi, thanks for the great work! I am wondering if case-insensitive string match is supported. For example, when there is "information system" in the built Trie, and it can match "Information system", "Information System" and "INFORMATION SYSTEM" as well?
The text was updated successfully, but these errors were encountered: