Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support empty alternations #524

Closed
BurntSushi opened this issue Oct 8, 2018 · 14 comments
Closed

support empty alternations #524

BurntSushi opened this issue Oct 8, 2018 · 14 comments

Comments

@BurntSushi
Copy link
Member

Today, if you try to compile a regex with an empty alternation, e.g., a||b, then you'll get this error message:

alternations cannot currently contain empty sub-expressions

When I initially built the regex crate, I don't think I was clear on what an empty alternation meant, so I simply made them illegal. However, an empty alternation should have the same match semantics as an empty regex. That is, a||b should match a, b or the empty string.

When I rewrote the regex-syntax crate, I specifically made sure to support empty alternations, which I believe were forbidden in the previous version of regex-syntax. The intent was to propagate that through the regex compiler. However, when I did that, I discovered that it did not implement the correct match semantics. Fixing it did not seem easy, so I simply made the compiler return an error if it found an empty alternate:

regex/src/compile.rs

Lines 491 to 500 in 488fe56

if prev_entry == self.insts.len() {
// TODO(burntsushi): It is kind of silly that we don't support
// empty-subexpressions in alternates, but it is supremely
// awkward to support them in the existing compiler
// infrastructure. This entire compiler needs to be thrown out
// anyway, so don't feel too bad.
return Err(Error::Syntax(
"alternations cannot currently contain \
empty sub-expressions".to_string()));
}

Part of my plans for the future are to rethink a lot of the regex internals, and the compiler itself is at the top of that list. So I plan to tackle this problem when I rework the compiler.

@zackw
Copy link

zackw commented Oct 8, 2018

As a workaround, I believe a||b can always be rewritten (?:a|b)?.

@oceanlewis
Copy link

oceanlewis commented Feb 8, 2019

I recently ran into this issue making a user agent parser. The nice thing about how I'm trying to go about it is that there's already this community that happens to maintain similar (identical in practice?) libraries that do the same thing and consume the same regex.yaml from uap-core. I would love to create the same for the Rust ecosystem since I have need of such a library myself, but when compiling the regexes inside this regex.yaml I get this error. I don't think this would be helpful for you at all, but this might be a good set of regexes for a test suite? Maybe I can help in some way.

@BurntSushi
Copy link
Member Author

Making the supported syntax be identical to some other arbitrary set of regex engines is not a hard goal of this crate. The are very likely other (possibly very subtle) differences that are being missed too. You will almost certainly need to apply some kind of transform to any pre-existing set of regexes.

In any case, this issue is really just about supporting empty alternations, and nothing more than that. This isn't going to happen until the compiler is reworked, as I said above.

@BurntSushi
Copy link
Member Author

Here's another bug that appears to be fallout from failing to handle empty alternations correctly:

fn main() {
    let re = regex::Regex::new(r"^(?:(j(?:))|([?]))").unwrap();
    let s = "john.doe@www.example.com:123/forum/questions/?tag=networking&order=new%20est#top";
    assert!(re.is_match(s));
    assert!(re.find(s).is_some());
}

The NFA compiler in regex-automata appears to handle this case correctly:

$ regex-automata-debug find '(()/)|([?])' <(echo "foo/bar")
compile time: 359.036µs
      memory: 26624
   read time: 18.051µs
  match time: 617ns
 match count: 1

I want to hopefully soon replace the matching engines in regex with the stuff in regex-automata, so I think that's how this bug will be fixed.

@Alphare
Copy link

Alphare commented Mar 24, 2020

@BurntSushi IIUC regex-automata does not have the same compilation-time guarantees and is not recommended for unknown patterns. I chose this morning to investigate using regex inside Mercurial for a little bit and I am now worried that - while that empty alternation problem would be taken care of - regex would be losing this very useful property. Maybe I don't understand the problem though.

FYI (because you asked for some numbers in case I ever had some) with the latest stable regex, compiling the same regex consistently takes 5ms where Re2 takes 1.5ms on a very stable machine. I would be happy to talk more directly if you are interested.

@BurntSushi
Copy link
Member Author

BurntSushi commented Mar 24, 2020

IIUC regex-automata does not have the same compilation-time guarantees and is not recommended for unknown patterns.

Sure. That's why I'm working on adding proper APIs for all of the regex engines inside this crate in regex-automata. At that point, regex-automata can move into this repo. regex-automata will become an "internal" but published dependency of regex, just like regex-syntax.

The idea that I would force all users of this crate to start compiling all regexes to full DFAs ahead of time is a bit crazy. :-)

FYI (because you asked for some numbers in case I ever had some) with the latest stable regex, compiling the same regex consistently takes 5ms where Re2 takes 1.5ms on a very stable machine. I would be happy to talk more directly if you are interested.

Could you please file an issue with a complete reproduction? Thanks. (The performance difference may be a wontfix bug, but I'm happy to investigate.)

@BurntSushi
Copy link
Member Author

I chose this morning to investigate using regex inside Mercurial for a little bit

Could you elaborate a bit more on this? Why does Mercurial need empty alternation support?

I've generally regarded this as a corner case that is "nice to have." But if this is a singular blocker for using this crate inside of Mercurial, then I could be convinced to just try and fix this so that you don't have to wait for the Great Regex Engine Refactor to be done. (Which could take months or even years, depending on how things go.)

@Alphare
Copy link

Alphare commented Mar 24, 2020

The idea that I would force all users of this crate to start compiling all regexes to full DFAs ahead of time is a bit crazy. :-)

That is also what I thought, thanks for the confirmation!

Could you please file an issue with a complete reproduction? Thanks. (The performance difference may be a wontfix bug, but I'm happy to investigate.)

Sure, I just need to make sure I make all tests pass with regex beforehand to make sure that my escaping isn't broken or something.

Could you elaborate a bit more on this? Why does Mercurial need empty alternation support?

It does not need it per se, but the current prefix used for relative glob patterns in .hgignore is (?:|.*/). So for example, if your .hgignore file contains glob:*.rs, the resulting regex string would be (?:|.*/)[^/]*\.rs(?:/|$). I am currently working around this issue by replacing (?:|.*/) by (?:.*/)? which looks like the same thing(?).

I've generally regarded this as a corner case that is "nice to have." But if this is a singular blocker for using this crate inside of Mercurial, then I could be convinced to just try and fix this so that you don't have to wait for the Great Regex Engine Refactor to be done. (Which could take months or even years, depending on how things go.)

In the end it's something that people might use in their own .hgignore and when comparing two otherwise very similar engines it still counts as a drawback. I have no way of knowing if it is an edge-case or not: maybe some big company somewhere has a pattern with an empty alternation in their .hgignore and gets a significant slowdown because we fall back to Python. So if you tell me "won't fix, too much work for no measurable benefit", I'll take it. :)

@BurntSushi
Copy link
Member Author

I am currently working around this issue by replacing (?:|.*/) by (?:.*/)? which looks like the same thing(?).

It does to me as well. I'm not sure why an empty alternation is being used there. I wonder if any of the core Mercurial developers might know?

In the end it's something that people might use in their own .hgignore and when comparing two otherwise very similar engines it still counts as a drawback. I have no way of knowing if it is an edge-case or not: maybe some big company somewhere has a pattern with an empty alternation in their .hgignore and gets a significant slowdown because we fall back to Python. So if you tell me "won't fix, too much work for no measurable benefit", I'll take it. :)

Aye, okay. For now, I'll plan on having this fixed (along with many other bugs) in the cut-over to regex-automata.

Sure, I just need to make sure I make all tests pass with regex beforehand to make sure that my escaping isn't broken or something.

Perfect, thanks.

@BurntSushi
Copy link
Member Author

Your questioning prompted me to elaborate a bit more on my future plans. Please see #656 for more details. :-)

@Alphare
Copy link

Alphare commented Mar 24, 2020

It does to me as well. I'm not sure why an empty alternation is being used there. I wonder if any of the core Mercurial developers might know?

I've asked the other devs and only have received a "maybe it's faster?"... but it's not, at least not in Python's re. So I think it was simply because it's a simple pattern and that it worked at the time it was written.

Aye, okay. For now, I'll plan on having this fixed (along with many other bugs) in the cut-over to regex-automata.

Sounds good.

Your questioning prompted me to elaborate a bit more on my future plans. Please see #656 for more details. :-)

Glad to see it, thank you!

@BurntSushi
Copy link
Member Author

For folks monitoring this issue, support for empty sub-expressions is now available in regex 1.3.8. Major thanks to @sliquister for doing this in #677!

@Alphare
Copy link

Alphare commented May 29, 2020

@BurntSushi This is the last nail in re2's coffin for Mercurial. What a relief! I might write a blogpost to tell the world.

@BurntSushi
Copy link
Member Author

Oh please do! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants