Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex with many literal alternates can erroneously match the empty string #999

Closed
BurntSushi opened this issue May 25, 2023 · 5 comments · Fixed by #1000
Closed

regex with many literal alternates can erroneously match the empty string #999

BurntSushi opened this issue May 25, 2023 · 5 comments · Fixed by #1000
Labels

Comments

@BurntSushi
Copy link
Member

This program panics:

fn main() {
    let needles = vec![
        "aA", "bA", "cA", "dA", "eA", "fA", "gA", "hA", "iA", "jA", "kA", "lA",
        "mA", "nA", "oA", "pA", "qA", "rA", "sA", "tA", "uA", "vA", "wA", "xA",
        "yA", "zA",
    ];
    let pattern = needles.join("|");
    let re = regex::Regex::new(&pattern).unwrap();
    let hay = "FUBAR";
    assert_eq!(0, re.find_iter(hay).count());
}

But it should run without panicking as the regex should not match anything in FUBAR.

This is another bug caused by the literal optimizer. This one is pretty hard to hit. All of the following need to be true:

  • The literals extracted need to be "complete." That is, the language described by the regex is small and finite.
  • There needs to be at least 26 distinct starting bytes among all of the elements in the language described by the regex.
  • There needs to be fewer than 26 distinct ending bytes among all of the elements in the language described by the regex.
  • Possibly other criteria...

This causes a weird code path in src/exec.rs that results in using an "empty" prefix literal matcher that matches at every position.

I'll plan to get a fix up for this soon, but this bug does not impact the rewrite. (Its literal infrastructure is far more principled than the hodge podge on master. We'll see if it lives up to my expectations though.)

@BurntSushi BurntSushi added the bug label May 25, 2023
@BurntSushi
Copy link
Member Author

This was originally reported via ripgrep: #999

BurntSushi added a commit that referenced this issue May 25, 2023
This bug results in some regexes reporting matches at every position
even when it should't. The bug happens because the internal literal
optimizer winds up using an "empty" searcher that reports a match at
every position. This is technically correct whenever the literal
searcher is used as a prefilter (although slower than necessary), but an
optimization added later enabled the searcher to run on its own and not
as a prefilter. i.e., Without the confirm step by the regex engine. In
that context, the "empty" searcher is totally incorrect.

So this bug corresponds to a code path where the "empty" literal
searcher is used, but is also in a case where the literal searcher is
used directly to find matches and not as a prefilter. I believe at
least the following are required to trigger this path:

* The literals extracted need to be "complete." That is, the language
described by the regex is small and finite.
* There needs to be at least 26 distinct starting bytes among all of
the elements in the language described by the regex.
* There needs to be fewer than 26 distinct ending bytes among all of
the elements in the language described by the regex.
* Possibly other criteria...

The actual fix is to change the code that selects the literal searcher.
Indeed, there was even a comment in the erroneous code saying that the
path was impossible, but of course, it isn't. We change that path to
return None, as it should have long ago. This in turn results in the
case outlined above not using a literal searcher and just the regex
engine.

Fixes #999
BurntSushi added a commit that referenced this issue May 25, 2023
This bug results in some regexes reporting matches at every position
even when it should't. The bug happens because the internal literal
optimizer winds up using an "empty" searcher that reports a match at
every position. This is technically correct whenever the literal
searcher is used as a prefilter (although slower than necessary), but an
optimization added later enabled the searcher to run on its own and not
as a prefilter. i.e., Without the confirm step by the regex engine. In
that context, the "empty" searcher is totally incorrect.

So this bug corresponds to a code path where the "empty" literal
searcher is used, but is also in a case where the literal searcher is
used directly to find matches and not as a prefilter. I believe at
least the following are required to trigger this path:

* The literals extracted need to be "complete." That is, the language
described by the regex is small and finite.
* There needs to be at least 26 distinct starting bytes among all of
the elements in the language described by the regex.
* There needs to be fewer than 26 distinct ending bytes among all of
the elements in the language described by the regex.
* Possibly other criteria...

The actual fix is to change the code that selects the literal searcher.
Indeed, there was even a comment in the erroneous code saying that the
path was impossible, but of course, it isn't. We change that path to
return None, as it should have long ago. This in turn results in the
case outlined above not using a literal searcher and just the regex
engine.

Fixes #999
BurntSushi added a commit that referenced this issue May 25, 2023
This bug results in some regexes reporting matches at every position
even when it should't. The bug happens because the internal literal
optimizer winds up using an "empty" searcher that reports a match at
every position. This is technically correct whenever the literal
searcher is used as a prefilter (although slower than necessary), but an
optimization added later enabled the searcher to run on its own and not
as a prefilter. i.e., Without the confirm step by the regex engine. In
that context, the "empty" searcher is totally incorrect.

So this bug corresponds to a code path where the "empty" literal
searcher is used, but is also in a case where the literal searcher is
used directly to find matches and not as a prefilter. I believe at
least the following are required to trigger this path:

* The literals extracted need to be "complete." That is, the language
described by the regex is small and finite.
* There needs to be at least 26 distinct starting bytes among all of
the elements in the language described by the regex.
* There needs to be fewer than 26 distinct ending bytes among all of
the elements in the language described by the regex.
* Possibly other criteria...

The actual fix is to change the code that selects the literal searcher.
Indeed, there was even a comment in the erroneous code saying that the
path was impossible, but of course, it isn't. We change that path to
return None, as it should have long ago. This in turn results in the
case outlined above not using a literal searcher and just the regex
engine.

Fixes #999
@BurntSushi
Copy link
Member Author

This is fixed in regex 1.8.3 on crates.io.

rhysd added a commit to rhysd/hgrep that referenced this issue May 29, 2023
@addisoncrump
Copy link
Contributor

Out of curiosity, does this also affect #978?

@BurntSushi
Copy link
Member Author

It does not. This bug was the result of the literal optimization infrastructure in regex proper. All of that stuff was completely re-thought and rewritten.

(It's plausible I introduced new bugs by doing this, but, well, that's what fuzzing is for. I plan to do more of it before the regex 1.9 release. And also let OSS-fuzz attack it for some period of time.)

@addisoncrump
Copy link
Contributor

Yeah, I was curious as to whether we could reproduce this with the differential fuzzer.

crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jun 12, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | patch | `1.8.1` -> `1.8.4` |

---

### Release Notes

<details>
<summary>rust-lang/regex</summary>

### [`v1.8.4`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;184-2023-06-05)

[Compare Source](rust-lang/regex@1.8.3...1.8.4)

\==================
This is a patch release that fixes a bug where `(?-u:\B)` was allowed in
Unicode regexes, despite the fact that the current matching engines can report
match offsets between the code units of a single UTF-8 encoded codepoint. That
in turn means that match offsets that split a codepoint could be reported,
which in turn results in panicking when one uses them to slice a `&str`.

This bug occurred in the transition to `regex 1.8` because the underlying
syntactical error that prevented this regex from compiling was intentionally
removed. That's because `(?-u:\B)` will be permitted in Unicode regexes in
`regex 1.9`, but the matching engines will guarantee to never report match
offsets that split a codepoint. When the underlying syntactical error was
removed, no code was added to ensure that `(?-u:\B)` didn't compile in the
`regex 1.8` transition release. This release, `regex 1.8.4`, adds that code
such that `Regex::new(r"(?-u:\B)")` returns to the `regex <1.8` behavior of
not compiling. (A `bytes::Regex` can still of course compile it.)

Bug fixes:

-   [BUG #&#8203;1006](rust-lang/regex#1006):
    Fix a bug where `(?-u:\B)` was allowed in Unicode regexes, and in turn could
    lead to match offsets that split a codepoint in `&str`.

### [`v1.8.3`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;183-2023-05-25)

[Compare Source](rust-lang/regex@1.8.2...1.8.3)

\==================
This is a patch release that fixes a bug where the regex would report a
match at every position even when it shouldn't. This could occur in a very
small subset of regexes, usually an alternation of simple literals that
have particular properties. (See the issue linked below for a more precise
description.)

Bug fixes:

-   [BUG #&#8203;999](rust-lang/regex#999):
    Fix a bug where a match at every position is erroneously reported.

### [`v1.8.2`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;182-2023-05-22)

[Compare Source](rust-lang/regex@1.8.1...1.8.2)

\==================
This is a patch release that fixes a bug where regex compilation could panic
in debug mode for regexes with large counted repetitions. For example,
`a{2147483516}{2147483416}{5}` resulted in an integer overflow that wrapped
in release mode but panicking in debug mode. Despite the unintended wrapping
arithmetic in release mode, it didn't cause any other logical bugs since the
errant code was for new analysis that wasn't used yet.

Bug fixes:

-   [BUG #&#8203;995](rust-lang/regex#995):
    Fix a bug where regex compilation with large counted repetitions could panic.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS4xMTEuMCIsInVwZGF0ZWRJblZlciI6IjM1LjExMS4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1912
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants