support for equivalence classes #404

BurntSushi · 2017-10-07T11:55:24Z

See "character equivalents" here: https://www.regular-expressions.info/posixbrackets.html#eq

The major change this seems to require (aside for actually determining character equivalents) is introducing the concept of locale into the regex engine. I'm not sure that's a good idea, but this issue should at least be investigated at some point.

albfan · 2017-11-05T09:54:30Z

Researching for grep implementation. It uses libc regexp for this.

Basically it looks for equivalence table defined on locale

link to code on glibc

link to doc

order_start  forward;backward
UNDEFINED    IGNORE;IGNORE
<LOW>
<space>      <LOW>;<space>
...          <LOW>;...
<a>          <a>;<a>
<a-acute>    <a>;<a-acute>
<a-grave>    <a>;<a-grave>
<A>          <a>;<A>
<A-acute>    <a>;<A-acute>
<A-grave>    <a>;<A-grave>
<ch>         <ch>;<ch>
<Ch>         <ch>;<Ch>
<s>          <s>;<s>
<eszet>      "<s><s>";"<eszet><eszet>"
order_end

locale knows what characters are equivalent of others, here equivalent for a.

I design a simple test case:

$ cat >testfile <<EOF
bar
camión
bazz
foo
EOF
$ grep [[=o=]] testfile

which should result in:

camión
foo

trying to debug

$ gdb --args grep [[=o=]] testfile

I can't do it. libc as it has to be compiled with optimizations and can't see a single var

if there's a way to retrieve that findidx function call result into rust, this would be just detect "[=" "=]" for bracket expression. I will try to find where rust regexp engine parse bracket expression and propose a WIP on this

albfan · 2017-11-05T11:23:50Z

I'm little lost into rust locale approach, so rust-locale seems a decent choose for me (sorry if I'm wrong)

I think all the pieces are put in place, but I have no skills on bindings to call findidx from libc and understand what is libc doing with that info (here info for es_ES):

_NL_COLLATE_TABLEMB = 2207821300
_NL_COLLATE_WEIGHTMB = 2207822324
_NL_COLLATE_EXTRAMB = 2208009096
_NL_COLLATE_INDIRECTMB = 2208041452

to retrieve equivalence chars for a char

findidx (table, indirect, extra, &cp, -1);

So for 'o' in spanish it would start with:

findidx (2207821300, 2208041452, 2208009096, &'o', -1);

BurntSushi · 2017-11-05T13:42:19Z

@albfan Thanks for doing this research! I don't have time to dig in myself at the moment, but I'd like to just say a couple things:

I'm hopeful that we can come up with an implementation plan before doing the work in a PR.
How do other regex engines handle this?
I'd prefer not to depend on any new third party crates with respect to locales. Rust doesn't really have a locale story yet, so we shouldn't try to pick a winner at this point.

albfan · 2017-11-05T17:24:19Z

libc regex relay on localedata. Many other regex don't have such concept.

I think suggestion on rust-locale is correct

rust-locale/rust-locale#26 (comment)

Relaying on unicode there's a concept of normalization form for characters.

Searching for a character all combining characters on search text should be equivalent. That's a good design because it will not relay on predefined metadata but I guess it could be a slow implementation for rg

BurntSushi · 2017-11-05T18:16:12Z

Yeah the regex engine isn't getting support for normalization, sorry. In particular, the regex engine is normalization agnostic. If you care about it, then you need to normalize the regex and the text searched consistently.

That means that if ripgrep wants to care about normalization, then it has to normalize the regex and every file before it searches.

albfan · 2017-11-05T23:31:22Z

Yes. I think I can fork ripgrep and provide an implementation that does that and see how much it slows the searches (without any idea for PR)

About regex, I think again, forking for my special needs can be a decent solution, preprocess all characters for a(2) e(2) i(2) o(2) u(3) n(2) A(2) E(2) I(2) O(2) U(3) = 24 equivalences is now worth.

I see all this as refining my toolset, but not as a general helpful contribution

I suggest to close this issue

tmccombs · 2024-04-16T06:23:48Z

That means that if ripgrep wants to care about normalization, then it has to normalize the regex and every file before it searches.

Normalizing the file is (relatively) straightforward. However, normalizing the regex, in general, can be very complicated.

My main concern is with canonical equivalence. In many cases, a single code point is equivalent to a series of two or more code points (generally as part of a single grapheme). Because of the semantics of regex, a simple normalization of the string representation is insufficient, because depending on the position of a character composing or decomposing it would significantly change the meaning of the regex.

As a concrete example, consider the regex "[a\u00e0-\u00e5]" which is in NFC form (composed). If this needs to be normalized to NFD (decomposed), the ideal representation would be something like: a[\u0300-\u0303\u0308\u030a]?. And if the original regex were instead [a\u00e0-\u00e6], then it would need to be normalized to (?:a[\u0300-\u0303\u0308\u030a]?|\u00e6).

A more naive implementation could, of course, iterate over every code point in the range, and generate an alternation of all the normalized results. But that would perform poorly, both for normalizing the regex, and the search.

Normalizing to "NFC" is even more problematic. For example consider something like (a|u)\u0308?. That could roughly be converted to ([aäuü]), and maybe that is what the user wants? but maybe it isn't.

Maybe it doesn't make sense for normalization or equivalence to be part of the regex crate, and there should be a separate crate to handle this. However, normalizing the regex and the input is not a trivial workaround.

BurntSushi · 2024-04-16T11:30:00Z

@tmccombs Indeed! The "Canonical Equivalents"' section of UTS#18 elaborates on this (and even brings up some of the same issues that you do). Their suggested work-around is pretty straight-forward though:

Putting the text to be matched into a defined normalization form (NFD or NFKD).

Having the user design the regular expression pattern to match against that defined normalization form. For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.

Applying the matching algorithm on a code point by code point basis, as usual.

In other words, instead of trying to devise a general algorithm for turning a regex into a particular normalization form, you have to ask the end user to do it themselves. That is, you ask the user to be aware of normalization forms and craft their regex appropriately. Yes, I agree it isn't an ideal user experience, but I think it's the best that can probably be done in this space until someone comes up with a better idea. (And I wouldn't hold my breath on that front.)

There is also a middle ground where you might be able to fix 80% of the cases. That is, you put all literal strings that occur in a regex into NFD form. We don't always need to focus on 100% correctness. For example, this approach would seemingly fix sharkdp/fd#1535.

BurntSushi mentioned this issue Oct 7, 2017

Support for equivalence classes BurntSushi/ripgrep#626

Closed

albfan mentioned this issue Nov 5, 2017

Access to collation and equivalence classes rust-locale/rust-locale#26

Closed

BurntSushi closed this as completed Dec 3, 2017

tmccombs mentioned this issue Apr 13, 2024

[BUG] search strings containing umlaut fails to find any results sharkdp/fd#1535

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for equivalence classes #404

support for equivalence classes #404

BurntSushi commented Oct 7, 2017

albfan commented Nov 5, 2017 •

edited

albfan commented Nov 5, 2017

BurntSushi commented Nov 5, 2017

albfan commented Nov 5, 2017 •

edited

BurntSushi commented Nov 5, 2017

albfan commented Nov 5, 2017

tmccombs commented Apr 16, 2024

BurntSushi commented Apr 16, 2024

support for equivalence classes #404

support for equivalence classes #404

Comments

BurntSushi commented Oct 7, 2017

albfan commented Nov 5, 2017 • edited

albfan commented Nov 5, 2017

BurntSushi commented Nov 5, 2017

albfan commented Nov 5, 2017 • edited

BurntSushi commented Nov 5, 2017

albfan commented Nov 5, 2017

tmccombs commented Apr 16, 2024

BurntSushi commented Apr 16, 2024

albfan commented Nov 5, 2017 •

edited

albfan commented Nov 5, 2017 •

edited