Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte regex can produce empty matches between UTF-8 code units #484

Closed
BurntSushi opened this issue Jun 3, 2018 · 3 comments
Closed

byte regex can produce empty matches between UTF-8 code units #484

BurntSushi opened this issue Jun 3, 2018 · 3 comments
Labels

Comments

@BurntSushi
Copy link
Member

Consider this program:

extern crate regex;

use regex::bytes::Regex;

fn main() {
    let re = Regex::new("").unwrap();
    for m in re.find_iter("☃".as_bytes()) {
        println!("{:?}", (m.start(), m.end()));
    }
}

its output is

(0, 0)
(1, 1)
(2, 2)
(3, 3)

Also, consider this program, which is a different manifestation of the same underlying bug:

extern crate regex;

use regex::bytes::Regex;

fn main() {
    let re = Regex::new("").unwrap();
    for m in re.find_iter(b"b\xFFr") {
        println!("{:?}", (m.start(), m.end()));
    }
}

its output is:

(0, 0)
(1, 1)
(2, 2)
(3, 3)

In particular, the empty pattern matches everything, including the locations between UTF-8 code units and otherwise invalid UTF-8.

A related note here is that find_iter is implemented slightly differently in bytes::Regex when compared with Regex. Namely, upon observing an empty match, the iterator forcefully advances its current position by a single character. For Unicode regexes, a character is a Unicode codepoint. For byte oriented regexes, a character is any single byte. The problem here is that the bytes::Regex iterator always assumes the byte oriented definition, even when Unicode mode is enabled for the entire regex (which is the default).

We could fix part of this issue by making the bytes::Regex iterator respect the value of the unicode flag when set via bytes::RegexBuilder. Namely, we could make the iterator advance one Unicode codepoint in the case of an empty match when Unicode mode is enabled for the entire regex. The problem here is the behavior in the second example, when Unicode mode is enabled, but we match at invalid UTF-8 boundaries. In that case, "skipping ahead one Unicode codepoint" doesn't really make sense, because it kind of assumes valid UTF-8. This is why the bytes::Regex iterator works the way it does. The intention was to rely on the matching semantics themselves to preserve the UTF-8 guarantee.

I guess ideally, the empty regex shouldn't match at locations that aren't valid UTF-8 boundaries when Unicode mode is enabled. This would completely fix the entire issue. I'm not entirely sure what the best way to implement this would be though.

This bug was initially reported as a bug in ripgrep in BurntSushi/ripgrep#937.

@BurntSushi
Copy link
Member Author

To make my current intention for this bug clear, my plan at this point is to see if this can be fixed in the compiler when that part of the crate gets refactored.

@BurntSushi
Copy link
Member Author

I've looked into this as part of #656. And it has in particular motivated the ability to configure lower level matching iterators to change whether character advancement is "one Unicode scalar value" or "one byte."

So we could fix this as I hinted at above by simply moving the byte oriented iterators over to using "one Unicode scalar value" instead of "one byte." The problem here is that we really don't want to enforce that behavior all the time. So, we could either presume this behavior under the current "Unicode" flag, or we could add a new flag, or we could mark this as wontfix (justified by the fact that this is an API on &[u8] after all).

I'm currently leaning towards wontfix. In part because this is an API on &[u8] after all, so I think we can get away with it. There are also backward compatibility concerns. Empty matches are a bit weird, but an empty regex isn't that weird, so someone could be relying on this.

I do not want to rely on the current Unicode flag, since I'd like to keep that in sync with the syntactic (?u) flag. Giving special significance to the RegexBuilder::unicode seems non-ideal.

We also could add a new flag that changes how the iterators work. But it would likely have to be opt-in. I think we could do that at any point, so I'd prefer to wait until there is a solid use case for it.

@BurntSushi
Copy link
Member Author

OK, so I think that overall this behavior by default is correct and consistent. Namely, the bytes::Regex API specifically sets an internal flag that permits matching invalid UTF-8. That in and of itself means that the matches produced here are fully consistent with that. So I'm going to close this as wontfix.

As a separate issue, we might consider exposing the "allow invalid UTF-8" flag in a bytes::RegexBuilder. It would be enabled by default (as it is today), but disabling it would ensure any matches always fall on valid UTF-8 boundaries. And it would in turn also prevent an empty match from splitting a codepoint. But I'm not sure if there is a use case for this, since the "ensure valid UTF-8" is really a construction meant to target Rust's guaranteed-to-be-valid-UTF-8 &str type. Outside of that, I don't think you care as much about UTF-8 boundaries? Either way, if someone has a use case, please open a new issue for it.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant