Is case-insensitive string match supported? #132

twang18 · 2020-11-05T08:50:53Z

Hi, thanks for the great work! I am wondering if case-insensitive string match is supported. For example, when there is "information system" in the built Trie, and it can match "Information system", "Information System" and "INFORMATION SYSTEM" as well?

Dobatymo · 2020-11-06T02:19:44Z

To support case-insensitive matching you should insert lowercase words into the trie and then lowercase your input before you search for it.

twang18 · 2020-11-06T02:45:24Z

Thanks for the suggestion! This could be one solution.

WojciechMula · 2020-12-23T11:10:40Z

There's no case insensitive search option and TBH I don't like to add this to the core library. What I may suggest as an enhancement to the library is to provide an extra python wrapper class that would keep the interface, but do lowercase conversion underneath.

pombredanne · 2021-01-31T09:06:34Z

IMHO case insensitive is something best done outside (that's the way I do this)

zhu · 2021-03-24T08:19:04Z

https://github.com/nppoly/cyac#unicode
Wrong offset may returns when convert to lowercase outside.

WojciechMula · 2021-03-24T15:57:07Z

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

zhu · 2021-03-26T01:34:00Z

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

I think UCS-32 cannot solve it. u"İ".lower() generate 2 unicode code point. And u"İ".lower().upper() != u"İ".
There's some special casing:
https://stackoverflow.com/questions/23524231/lower-case-of-turkish-character-dotted-i
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

casing incurs a change in string length or is dependent on context or locale

WojciechMula · 2021-03-27T19:03:50Z

@zhu Thank a lot. It is more than strange. So there is no perfect solution.

pombredanne · 2021-03-28T18:02:59Z

I actually also experienced this same rare issue (not specific to pyahocorasick):
https://github.com/nexB/scancode-toolkit/blob/f3ef5b4ad823577e507d673a7fbc65d5efe4f6af/src/licensedcode/match.py#L1432

NOTE: we have a rare Unicode bug/issue because of some
Unicode codepoint such as some Turkish characters that
decompose char + punct when casefolded.
See nexB/scancode-toolkit#1872
See also: https://bugs.python.org/issue34723#msg359514

This is eventually a well known problem known as "The Turkish-I Problem" See:

The way I solved it on my side is to check the length does not change when folding case, and if so do something special. This slows down everything of course. Even more so in my case because I split on punctuation before I lowercase (which makes me think that I could simplify my case by lower after my split ...)

See also https://bugs.python.org/issue34723#msg359514 and in particular these messages:

I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'.

Here is the document

We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0.

IMHO we should wait for Python 3.10

pombredanne · 2021-03-29T08:18:31Z

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

zhu · 2021-03-30T01:24:49Z

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

German 'ß'.lower() == 'ß', 'ß'.upper() == 'SS', 'ẞ'.lower() == 'ß', 'ß'.upper().lower() == 'ss', 'ß'.casefold() == 'ss'. http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a-new-german-unicode-letter-ca.html
Seems the scancode-toolkit cannot handle this case.

BTW: Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. Unicode normalization also may change the text length. It has similar issues.

It'd better find a library which can remember the char offset when normalizing and casefolding text.

WojciechMula · 2021-04-07T20:46:21Z

Thank you @pombredanne for the explanation. I always thought that Unicode is just for assigning the numbers to characters. :)

pombredanne · 2023-01-14T15:58:26Z

So to the point of @WojciechMula I would rather keep this in a separate add-on and not in the C core. @zhu we now support bytes and unicode, so there should enough to do something with all the detailed gathered here. Would you be willing to help to craft this?

WojciechMula added the enhancement label Dec 23, 2020

pombredanne added the help-wanted label Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is case-insensitive string match supported? #132

Is case-insensitive string match supported? #132

twang18 commented Nov 5, 2020

Dobatymo commented Nov 6, 2020

twang18 commented Nov 6, 2020

WojciechMula commented Dec 23, 2020

pombredanne commented Jan 31, 2021

zhu commented Mar 24, 2021

WojciechMula commented Mar 24, 2021

zhu commented Mar 26, 2021

WojciechMula commented Mar 27, 2021

pombredanne commented Mar 28, 2021

pombredanne commented Mar 29, 2021

zhu commented Mar 30, 2021

WojciechMula commented Apr 7, 2021

pombredanne commented Jan 14, 2023

Is case-insensitive string match supported? #132

Is case-insensitive string match supported? #132

Comments

twang18 commented Nov 5, 2020

Dobatymo commented Nov 6, 2020

twang18 commented Nov 6, 2020

WojciechMula commented Dec 23, 2020

pombredanne commented Jan 31, 2021

zhu commented Mar 24, 2021

WojciechMula commented Mar 24, 2021

zhu commented Mar 26, 2021

WojciechMula commented Mar 27, 2021

pombredanne commented Mar 28, 2021

pombredanne commented Mar 29, 2021

zhu commented Mar 30, 2021

WojciechMula commented Apr 7, 2021

pombredanne commented Jan 14, 2023