Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect case for word boundaries #579

Closed
BurntSushi opened this issue May 8, 2019 · 5 comments
Closed

incorrect case for word boundaries #579

BurntSushi opened this issue May 8, 2019 · 5 comments

Comments

@BurntSushi
Copy link
Member

Here's a reproduction:

use regex::bytes::Regex;

fn main() {
    let hay = "I have 12, he has 2!";
    let re = Regex::new(r"\b..\b").unwrap();
    for m in re.find_iter(hay.as_bytes()) {
        println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
    }
}

Actual output:

"I "
"12"

Expected output:

"I "
"12"
", "
"he"
" 2"

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=55914c890dfb6a68fc72b9c6fd986298

The same bug is present even if we use ASCII word boundaries: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eef23f309c9f608eb683aac982648301

Here's a smaller reproduction:

use regex::bytes::Regex;

fn main() {
    let hay = "az,,b";
    let re = Regex::new(r"\b..\b").unwrap();
    for m in re.find_iter(hay.as_bytes()) {
        println!("{:?}", String::from_utf8_lossy(m.as_bytes()));
    }
}

Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c7507e4d095141004909f9deb1c6cdd7

Originally reported against ripgrep: BurntSushi/ripgrep#1275

@pcpthm
Copy link

pcpthm commented Jun 10, 2019

I figured out how this bug occurs.
When a match is found on a forward DFA, reverse DFA matching is done on &text[start..]

regex/src/exec.rs

Lines 704 to 716 in 9921922

// Now run the DFA in reverse to find the start of the match.
match dfa::Fsm::reverse(
&self.ro.dfa_reverse,
self.cache,
false,
&text[start..],
end - start,
) {
Match(s) => Match((start + s, end)),
NoMatch(i) => NoMatch(i),
Quit => Quit,
}
}

However, a word boundary check cannot be done only with the substring. For example, when text = "a," and start = 1 then &text[start..] = "," but its index 0 (EOF for reverse matching) should be treated as a word boundary.

@sergeevabc
Copy link

@pcpthm, so how should we proceed?

@pcpthm
Copy link

pcpthm commented Feb 27, 2020

@sergeevabc I was not completely sure how to fix it and this is why I just commented on this issue and didn't write a fix.
An idea is to modify the Byte struct

regex/src/dfa.rs

Line 1725 in a0f541b

impl Byte {
so that there are two special eof Bytes where one returns true to .is_ascii_word() and other one returns false. Then, here
prev_si = match self.next_state(qcur, qnext, prev_si, Byte::eof()) {
, we have to use the appropriate type of eof depending on whether "text[-1]" is a word-byte or not (of course a slice cannot index at negative so we have to pass it as an additional argument). We have to account for the two kinds of EOF bytes for DFA table layouts e.g.

regex/src/dfa.rs

Lines 1522 to 1525 in a0f541b

fn num_byte_classes(&self) -> usize {
// We add 1 to account for the special EOF byte.
(self.prog.byte_classes[255] as usize + 1) + 1
}
and it might lead to slight performance degradation in some case. I'm not sure it can be improved.

@BurntSushi
Copy link
Member Author

This will be fixed as part of #656.

@malaire
Copy link

malaire commented Jan 4, 2021

I found a problem with \b - is this same issue as discussed here?

This returns None even though there are two possible matches at locations 3 and 4.

fn main() {
    let re = regex::Regex::new(r"\b.").unwrap();
    println!("{:?}", re.find_at("foo bar", 3));
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3892e7b72507f13b32a1110e94aee066

BurntSushi added a commit that referenced this issue Jul 5, 2023
I usually close tickets on a commit-by-commit basis, but this refactor
was so big that it wasn't feasible to do that. So ticket closures are
marked here.

Closes #244
Closes #259
Closes #476
Closes #644
Closes #675
Closes #824
Closes #961

Closes #68
Closes #510
Closes #787
Closes #891

Closes #429
Closes #517
Closes #579
Closes #779
Closes #850
Closes #921
Closes #976
Closes #1002

Closes #656
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue Jul 18, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex (regex)</summary>

### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07)

[Compare Source](rust-lang/regex@1.9.0...1.9.1)

\==================
This is a patch release which fixes a memory usage regression. In the regex
1.9 release, one of the internal engines used a more aggressive allocation
strategy than what was done previously. This patch release reverts to the
prior on-demand strategy.

Bug fixes:

-   [BUG #&#8203;1027](rust-lang/regex#1027):
    Change the allocation strategy for the backtracker to be less aggressive.

### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05)

[Compare Source](rust-lang/regex@1.8.4...1.9.0)

\==================
This release marks the end of a [years long rewrite of the regex crate
internals](rust-lang/regex#656). Since this is
such a big release, please report any issues or regressions you find. We would
also love to hear about improvements as well.

In addition to many internal improvements that should hopefully result in
"my regex searches are faster," there have also been a few API additions:

-   A new `Captures::extract` method for quickly accessing the substrings
    that match each capture group in a regex.
-   A new inline flag, `R`, which enables CRLF mode. This makes `.` match any
    Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and
    `(?m:$)` match after and before both `\r` and `\n`, respectively, but never
    between a `\r` and `\n`.
-   `RegexBuilder::line_terminator` was added to further customize the line
    terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte.
-   The `std` Cargo feature is now actually optional. That is, the `regex` crate
    can be used without the standard library.
-   Because `regex 1.9` may make binary size and compile times even worse, a
    new experimental crate called `regex-lite` has been published. It prioritizes
    binary size and compile times over functionality (like Unicode) and
    performance. It shares no code with the `regex` crate.

New features:

-   [FEATURE #&#8203;244](rust-lang/regex#244):
    One can opt into CRLF mode via the `R` flag.
    e.g., `(?mR:$)` matches just before `\r\n`.
-   [FEATURE #&#8203;259](rust-lang/regex#259):
    Multi-pattern searches with offsets can be done with `regex-automata 0.3`.
-   [FEATURE #&#8203;476](rust-lang/regex#476):
    `std` is now an optional feature. `regex` may be used with only `alloc`.
-   [FEATURE #&#8203;644](rust-lang/regex#644):
    `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave.
-   [FEATURE #&#8203;675](rust-lang/regex#675):
    Anchored search APIs are now available in `regex-automata 0.3`.
-   [FEATURE #&#8203;824](rust-lang/regex#824):
    Add new `Captures::extract` method for easier capture group access.
-   [FEATURE #&#8203;961](rust-lang/regex#961):
    Add `regex-lite` crate with smaller binary sizes and faster compile times.
-   [FEATURE #&#8203;1022](rust-lang/regex#1022):
    Add `TryFrom` implementations for the `Regex` type.

Performance improvements:

-   [PERF #&#8203;68](rust-lang/regex#68):
    Added a one-pass DFA engine for faster capture group matching.
-   [PERF #&#8203;510](rust-lang/regex#510):
    Inner literals are now used to accelerate searches, e.g., `\w+@&#8203;\w+` will scan
    for `@`.
-   [PERF #&#8203;787](rust-lang/regex#787),
    [PERF #&#8203;891](rust-lang/regex#891):
    Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`.

(There are many more performance improvements as well, but not all of them have
specific issues devoted to them.)

Bug fixes:

-   [BUG #&#8203;429](rust-lang/regex#429):
    Fix matching bugs related to `\B` and inconsistencies across internal engines.
-   [BUG #&#8203;517](rust-lang/regex#517):
    Fix matching bug with capture groups.
-   [BUG #&#8203;579](rust-lang/regex#579):
    Fix matching bug with word boundaries.
-   [BUG #&#8203;779](rust-lang/regex#779):
    Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`.
-   [BUG #&#8203;850](rust-lang/regex#850):
    Fix matching bug inconsistency between NFA and DFA engines.
-   [BUG #&#8203;921](rust-lang/regex#921):
    Fix matching bug where literal extraction got confused by `$`.
-   [BUG #&#8203;976](rust-lang/regex#976):
    Add documentation to replacement routines about dealing with fallibility.
-   [BUG #&#8203;1002](rust-lang/regex#1002):
    Use corpus rejection in fuzz testing.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957
Reviewed-by: crapStone <crapstone01@gmail.com>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants