Please support locale-specific/unicode matching with `--basic-regexp` #234

rrthomas · 2022-09-19T12:58:28Z

GNU grep, for example, matches accented letters with [:alpha:] in UTF-8 locales.

Yes, I could either use GNU grep, or use -P to match \p{L} instead, but it would be nice to be able to use "standard" patterns with ugrep -G and not have to worry!

I also appreciate that ugrep currently documents the POSIX character classes as being ASCII-only, so it might be necessary to add a flag to support locale-sensitive POSIX regexs; at least for my use case that would be fine—I expect to have to use a shell alias or similar to "configure" ugrep to work the way I want.

The text was updated successfully, but these errors were encountered:

genivia-inc · 2022-09-19T13:47:33Z

Thanks for the feedback! I agree with your assessment. Initially ugrep (based on RE/flex) is more of a developer-centric tool that always uses the "C" locale. I will revisit the design choices to see what can be added, but this has to wait a little bit.

genivia-inc · 2024-05-24T02:13:20Z

Yes, I could either use GNU grep, or use -P to match \p{L} instead, but it would be nice to be able to use "standard" patterns with ugrep -G and not have to worry!

You don't need to use -P for Unicode pattern matching. So \P{L} and \w work by default to match Unicode. Only if you use -U to disable Unicode then they're limited to ASCII.

The [:alpha:] character class and the other forms of this are kept non-Unicode for legacy compatibility reasons.

In a future update I want to update option -w to perform Unicode word boundary matching and also make the word boundaries \b \< \> Unicode. They are Unicode with option -P, but not without at the moment.

rrthomas · 2024-05-24T10:15:21Z

Thanks for the hint! Of course, the attraction of using [:alpha:] and similar is that they work with non-PCRE-supporting tools.

genivia-inc · 2024-05-29T14:49:46Z

I will take a closer look at [:alpha:] and friends to make them use the current locale.

In the meantime, I've updated my dev version to implement option -w (--word-regexp) and boundaries \b, \B, \< and \> to obey Unicode (not locale specific). This updated -w also runs faster by up to 15% depending on the match frequency.

genivia-inc · 2024-05-30T19:39:30Z

Testing GNU grep 3.11 matching with \w and [[:alnum:]] with and without -P produces different results for these inputs:

Voix ambiguë d’un cœur qui au zéphyr préfère les jattes de kiwi.
Ταχίστη αλώπηξ βαφής ψημένη γη, δρασκελίζει υπέρ νωθρού κυνός.

and observing that ggrep -P does match any accented characters or greek letters (it does match without this option) on MacOS. Strangely, on Debian Linux -P actually matches accented characters with the French locale, which is the other way round. Also on Linux, greek is not matched at all even when the greek locale is explicitly set on the system, but it is matched with -P.

Nothing is matched by GNU grep 3.11 on MacOS or Linux for cn letters with [[:alnum:]] or \w and with or without -P:

敏捷的棕色狐狸跨过懒狗

By contrast, with ugrep I made sure that \w matches the entire Unicode word-like character set, which is a huge set of about 136,000 characters to support all languages.

Also ripgrep and silver searcher do not match any accented characters in the French text with pattern [[:alpha:]], even when the locale is explicitly changed to French. Also cn characters are not matched with [[:alpha:]] by these tools. So what are [[:alpha:]] and [[:alnum:]] matching, precisely?

For \w it's different too. Silver searcher does not match any accented characters or cn letters with \w but rg does.

We could make [[:alpha:]] match \p{L} which is also a very large set or make it locale specific like GNU grep, but that is surprisingly non-intuitive. Crazy, since I can't find details on what [[:alpha:]] is supposed to match precisely. Maybe I'm overlooking something?

EDIT: the results depend on MacOS versus Linux.

genivia-inc · 2024-06-01T16:38:27Z

So it seems that [[:alpha:]] should match [\p{Lu}\p{Ll}] i.e. lower and upper case letters only, judging from some experiments. With ugrep this can be specified as [\u\l] where \u is \p{Lu} and \l is \p{Ll} where these short escape forms aren't standard. Perhaps [[:alnum:]] should match [\p{Lu}\p{Ll}\p{Nd}].

This looks good to me to move forward. Correct me if I am wrong with these:

[[:alpha:]] matches Unicode Ll or Lu
[[:alnum:]] matches Unicode Ll or Lu or Nd
[[:punct:]] matches Unicode P (punctuation)
[[:print:]] matches any Unicode character except control C (Cc + Cf)
[[:graph:]] matches [[:print:]] except space Zs
[[:space:]] matches Unicode Zs
[[:digit:]] matches Unicode Nd

Note that ugrep does not match newlines when part of these classes, such as [[:space:]]. To match a newline \n, we must also specify \n explicitly like [[:space:]\n] or [\s\n which is the same. This is for compatibility with GNU/BSD grep that never match multiple lines, i..e to avoid getting unintentional multiline matches.

rrthomas · 2024-06-03T17:29:52Z

Looks good, thanks for your work on this!

genivia-inc added the enhancement New feature or request label Sep 19, 2022

genivia-inc self-assigned this Dec 20, 2023

genivia-inc closed this as completed in 37f866f Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please support locale-specific/unicode matching with `--basic-regexp` #234

Please support locale-specific/unicode matching with `--basic-regexp` #234

rrthomas commented Sep 19, 2022

genivia-inc commented Sep 19, 2022

genivia-inc commented May 24, 2024

rrthomas commented May 24, 2024

genivia-inc commented May 29, 2024

genivia-inc commented May 30, 2024 •

edited

genivia-inc commented Jun 1, 2024 •

edited

rrthomas commented Jun 3, 2024

Please support locale-specific/unicode matching with --basic-regexp #234

Please support locale-specific/unicode matching with --basic-regexp #234

Comments

rrthomas commented Sep 19, 2022

genivia-inc commented Sep 19, 2022

genivia-inc commented May 24, 2024

rrthomas commented May 24, 2024

genivia-inc commented May 29, 2024

genivia-inc commented May 30, 2024 • edited

genivia-inc commented Jun 1, 2024 • edited

rrthomas commented Jun 3, 2024

Please support locale-specific/unicode matching with `--basic-regexp` #234

Please support locale-specific/unicode matching with `--basic-regexp` #234

genivia-inc commented May 30, 2024 •

edited

genivia-inc commented Jun 1, 2024 •

edited