Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-syntactic way to match from the beginning of text, or "implicit \A". #974

Closed
iago-lito opened this issue Apr 9, 2023 · 2 comments
Closed

Comments

@iago-lito
Copy link

iago-lito commented Apr 9, 2023

From the first lines in the docs, I have learned this:

In this crate, every expression is executed with an implicit .*? at the beginning and end, which allows it to match anywhere in the text.

I have always been happy with this so far, and I am accustomed to use \A to anchor my searches to the beginning of input and elide the implicit .*?.

Today, I am designing a regex-based, general purpose "lexing/tokenizing" API supposed to help users consume input bit by bit from left to right with their own regexes. User can write their own patterns, but then the semantics of these patterns differ from the above semantics: When user writes "a+b*c" within this context, what they mean is r"\Aa+b*c" and not ".*?a+b*c".

To work around this, I have considered then dismissed a few options:

  • ⛈️ Document that patterns passed through my API must always start with \A.
    • This is noisy and easy-to-forget. Bugs happening downstream when user does forget are rather confusing.
  • 🌧️ Interrupt user if they forget \A.
    • User won't get the bugs, but I need to run additional checks under the hood and the noise in the API remains.
  • ☁️ Prepend \A to all the patterns handed out by user.
    • The API would be fixed so it's a good way to move forward, but again:
      • I would need to copy all received patterns into owned Strings so as to prepend \A.
      • User cannot provide pre-compiled regexes anymore, unless I recompile them with constructs like Regex::new(&format!(r"\A{}", their_regex.as_str())).

As a consequence, and unless I am missing an obvious way to work around this, I am suggesting that maybe "implicit \A prefix" vs. "implicit .*? prefix" could be something configurable within the regex crate itself? This could take either of several forms, sorted here from the most comfortable/flexible one to the less comfy, but I guess the best fit would eventually depend on the actual implementation of regex:

  • 🌞 Extend the Regex interface with e.g. Regex::find_at_start(haystack) in addition to Regex::find(haystack). The former would match from the beginning of the input only, in a way similar to Python's re.match / re.search duo. The same would go for Regex::is_match_at_start(haystack), Regex::captures_at_start(haystack) etc.

  • ☁️ Extend the RegexBuilder interface with e.g. RegexBuilder::matches_from_start(bool). The semantics of RegexBuilder::new("pattern").matches_from_start(false).build() would be the today's default, while RegexBuilder::new("pattern").matches_from_start(true).build() would introduce implicit \A instead of implicit .*?.

  • 🌨️ Extend control over the "implicit .*?" semantics at the crate level with a crate feature, so that in my cargo [dependencies], I would rather regex = {version = "1.7", features = "implicit-start-anchor"} instead of the current default.

Of course, if this is something the regex crate would be willing to feature, then I assume the same issue could be addressed for \z in addition to \A.

@BurntSushi
Copy link
Member

All righty, I think I'm going to try and respond to this from the bottom up,
because I think it might help make my position a little clearer.

Extend the Regex interface with e.g. Regex::find_at_start(haystack)
in addition to Regex::find(haystack). The former would match from the
beginning of the input only, in a way similar to Python's re.match /
re.search

duo. The same would go for Regex::is_match_at_start(haystack),
Regex::captures_at_start(haystack) etc.

So I really do not want to do this because it makes the API surface bigger.
I don't think it just ends there. There are also the find_at routines. And
the iteration routines. And the split routines. And the replace routines.
All of those would likely benefit from anchored routines too. It might not
be immediately obvious why, but supporting anchored searches is useful in an
iteration context because it provides you with a guarantee that all matches
found are adjacent.

One could say, "no we should just have is_match/find/captures and that's
it." But it's going to be easy for folks to say, "well given that they exist
for those routines and I have a use case for needing it with the other APIs,
then why not add them?"

Finally, I'd also say that I specifically did not like how Python setup their
API. I always get mixed up between whether search or match is the
unanchored API. Granted, I suppose that's more of a naming issue. But I did
things the way I did to make it uniform: it's always unanchored and if you want
anchored, then you need to put a ^ in there.

Extend the RegexBuilder interface with e.g.
RegexBuilder::matches_from_start(bool). The semantics of
RegexBuilder::new("pattern").matches_from_start(false).build()
would be the today's default, while
RegexBuilder::new("pattern").matches_from_start(true).build() would
introduce implicit \A instead of implicit .*?.

This is probably the path I would be least unhappy with, but I don't really
see much benefit from it compared to just doing format!("^(?:{})", pattern).
You mentioned why just adding ^ to the pattern string doesn't work as well
for you (and I'll get to that later), but don't mention why a RegexBuilder
option does work for you.

Extend control over the "implicit .*?" semantics at the crate level with
a crate feature, so that in my cargo [dependencies], I would rather regex = {version = "1.7", features = "implicit-start-anchor"} instead of the
current default.

This can absolutely never happen. It is absolutely critical that crate features
not change the semantics of regexes. Crate features are allowed to change
the set of patterns that are valid, but they cannot change how a valid pattern
behaves. Imagine, for example, if someone depends on regex without the
implicit-start-anchor feature and then they bring in your library that
enables it. Now their use of regex is impacted as well. Disaster ensues.

From the first lines in the docs, I have learned this:

In this crate, every expression is executed with an implicit .*? at the
beginning and end, which allows it to match anywhere in the text.

I have always been happy with this so far, and I am accustomed to use \A to
anchor my searches to the beginning of input and elide the implicit .*?.

For some added context, it is worth mentioning that "anchored search" and \A
are orthogonal things. Their differences cannot be observed with Regex::find,
but can be observed with Regex::find_at. For example, if you have the regex
^abc and the haystack fooabc, then regex.find_at(haystack, 3) will
actually not return any matches while regex.find(&haystack[3..]) will return
a match. Why? Because ^ only matches at position 0.

Today, I am designing a regex-based, general purpose "lexing/tokenizing" API
supposed to help users consume input bit by bit from left to right with their
own regexes. User can write their own patterns, but then the semantics of
these patterns differ from the above semantics: When user writes "a+b*c"
within this context, what they mean is r"\Aa+b*c" and not ".*?a+b*c".

To work around this, I have considered then dismissed a few options:

  • Document that patterns passed through my API must always start with \A.

    • This is noisy and easy-to-forget. Bugs happening downstream when user does forget are rather confusing.
  • Interrupt user if they forget \A.

    • User won't get the bugs, but I need to run additional checks under the hood and the noise in the API remains.
  • Prepend \A to all the patterns handed out by user.

    • The API would be fixed so it's a good way to move forward, but again:

      • I would need to copy all received patterns into owned Strings so as to prepend \A.
      • User cannot provide pre-compiled regexes anymore, unless I recompile them with constructs like Regex::new(&format!(r"\A{}", their_regex.as_str())).

I don't really get why "Prepend \A" is being dismissed here. The downside of
having to copy all received patterns should be a non-issue. You don't say why
you don't want to do it, but I assume it's because you think there's a cost to
doing so? There is technically an absolute cost, but it's absolutely dwarfed by
regex compilation itself, which does a lot more than just copy the patterns. It
shouldn't be something you can observe in any meaningful benchmark.

Now the second reason against it is interesting, and having to re-compile
regexes is indeed a bummer. But I would argue that if you're letting your users
pass a pre-compiled Regex, then they should absolutely be responsible and
aware of the details of whether their search is unanchored or not. After all,
if they're building a Regex, then they are themselves a consumer of the
regex crate.

Of course, if this is something the regex crate would be willing to
feature, then I assume the same issue could be addressed for \z in addition
to \A.

Usually the $ case is not as important as the ^ case. But, yes, you're
right. And indeed, once #656 lands, you'll be able to do what you want here.
And it will handle both the ^ and $ cases. But you'll have to drop down to
using the regex-automata crate's lower level APIs to do it. From above, it
sounds like you're specifically trying to support the caller passing a Regex,
in which case, this might not work for you.

@iago-lito
Copy link
Author

Hi @BurntSushi and thanks a lot for your detailed answer, which makes your position very clear indeed :)

I think I do understand all of it. And I am convinced. I have learned about this subtle difference between \A and "anchoring" which I did not know about. I have also been terrified by the things you write "cargo features" were capable of doing : if true then this is definitely not the way to go here ^ ^"

In a nutshell, you are pointing me towards #656 again, and I cannot but wave at this very cool initiative. This would solve this particular issue and also so much more. Regarding that you are also reassuring me with the cost of pattern strings copying vs. regex compilation itself, I am now reconsidering the idea to prepend \A before every received pattern :)

Just in case this is of interest to anyone, my current plan to enforce this while still avoiding useless regex recompilation and paving the way towards future leveraging of #656 is to newtype Regex into some sort of RegexAnchoredAtStart. My API would accept these instead of Regexes, and the only way to construct such pre-compiled values would be from strings that can be prepended \A prior to compilation.

So.. I can now move forward. Thank you for your patience :) I am closing this to temper the noise around #656.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants