Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested characters class do not report the same result as two character classes in an | #680

Closed
Ekleog opened this issue May 15, 2020 · 1 comment · Fixed by #860
Closed
Labels

Comments

@Ekleog
Copy link

Ekleog commented May 15, 2020

Hello,

Just wanted to drop by with one weird behavior I've found just now: r#"^([-.[:alnum:]]|[[:^ascii:]])+"# matches (on a find) the whole élégance.fr string, while r#"^[-.[:alnum:][:^ascii:]]+"# matches only its first character.

However, as far as I could understand, these two variants should have the same behavior.

Anyway thank you for the regex crate! And for regex_automata too, as I'm probably going to switch to it soon as per your advice in the other issue :)

@BurntSushi
Copy link
Member

Nice, thank you for the report. This appears to be a bug in the translator from AST to HIR. Here's the HIR:

$ regex-cli debug hir '^[[:alnum:][:^ascii:]]+'
^[[:alnum:][:^ascii:]]+
-----------------------
    parse time:  43.919µs
translate time:  11.904µs

Hir {
    kind: Concat(
        [
            Hir {
                kind: Anchor(
                    StartText,
                ),
                info: HirInfo {
                    bools: 343,
                },
            },
            Hir {
                kind: Repetition(
                    Repetition {
                        kind: OneOrMore,
                        greedy: true,
                        hir: Hir {
                            kind: Class(
                                Unicode(
                                    ClassUnicode {
                                        set: IntervalSet {
                                            ranges: [
                                                ClassUnicodeRange {
                                                    start: "0x80",
                                                    end: "\u{10ffff}",
                                                },
                                            ],
                                        },
                                    },
                                ),
                            ),
                            info: HirInfo {
                                bools: 1,
                            },
                        },
                    },
                ),
                info: HirInfo {
                    bools: 1,
                },
            },
        ],
    ),
    info: HirInfo {
        bools: 85,
    },
}

Which is definitely wrong. The AST however appears correct:

$ regex-cli debug ast '^[[:alnum:][:^ascii:]]+'                                                                                                                              [1/1038]
^[[:alnum:][:^ascii:]]+
-----------------------
parse time:  43.203µs

Concat(
    Concat {
        span: Span(Position(o: 0, l: 1, c: 1), Position(o: 23, l: 1, c: 24)),
        asts: [
            Assertion(
                Assertion {
                    span: Span(Position(o: 0, l: 1, c: 1), Position(o: 1, l: 1, c: 2)),
                    kind: StartLine,
                },
            ),
            Repetition(
                Repetition {
                    span: Span(Position(o: 1, l: 1, c: 2), Position(o: 23, l: 1, c: 24)),
                    op: RepetitionOp {
                        span: Span(Position(o: 22, l: 1, c: 23), Position(o: 23, l: 1, c: 24)),
                        kind: OneOrMore,
                    },
                    greedy: true,
                    ast: Class(
                        Bracketed(
                            ClassBracketed {
                                span: Span(Position(o: 1, l: 1, c: 2), Position(o: 22, l: 1, c: 23)),
                                negated: false,
                                kind: Item(
                                    Union(
                                        ClassSetUnion {
                                            span: Span(Position(o: 2, l: 1, c: 3), Position(o: 21, l: 1, c: 22)),
                                            items: [
                                                Ascii(
                                                    ClassAscii {
                                                        span: Span(Position(o: 2, l: 1, c: 3), Position(o: 11, l: 1, c: 12)),
                                                        kind: Alnum,
                                                        negated: false,
                                                    },
                                                ),
                                                Ascii(
                                                    ClassAscii {
                                                        span: Span(Position(o: 11, l: 1, c: 12), Position(o: 21, l: 1, c: 22)),
                                                        kind: Ascii,
                                                        negated: true,
                                                    },
                                                ),
                                            ],
                                        },
                                    ),
                                ),
                            },
                        ),
                    ),
                },
            ),
        ],
    },
)

@BurntSushi BurntSushi added the bug label May 15, 2020
BurntSushi added a commit that referenced this issue May 17, 2022
This fixes a bug in how ASCII class unioning was implemented. Namely, it
previously and erroneously unioned together two classes and then applied
negation/case-folding based on the most recently added class, even if
the class added previously wasn't negated. So for example, given the
regex '[[:alnum:][:^ascii:]]', this would initialize the class with
'[:alnum:]', then add all '[:^ascii:]' codepoints and then negate the
entire thing because of the negation in '[:^ascii:]'. Negating the
entire thing is clearly wrong and not the intended semantics.

We fix this by applying negation/case-folding only to the class we're
dealing with, and then we union it with whatever existing class we're
building.

Fixes #680
BurntSushi added a commit that referenced this issue May 18, 2022
This fixes a bug in how ASCII class unioning was implemented. Namely, it
previously and erroneously unioned together two classes and then applied
negation/case-folding based on the most recently added class, even if
the class added previously wasn't negated. So for example, given the
regex '[[:alnum:][:^ascii:]]', this would initialize the class with
'[:alnum:]', then add all '[:^ascii:]' codepoints and then negate the
entire thing because of the negation in '[:^ascii:]'. Negating the
entire thing is clearly wrong and not the intended semantics.

We fix this by applying negation/case-folding only to the class we're
dealing with, and then we union it with whatever existing class we're
building.

Fixes #680
otc-zuul bot pushed a commit to opentelekomcloud-infra/cloudmon-plugin-smtp that referenced this issue Jun 7, 2022
Bump regex from 1.5.4 to 1.5.6

Bumps regex from 1.5.4 to 1.5.6.

Changelog
Sourced from regex's changelog.

1.5.6 (2022-05-20)
This release includes a few bug fixes, including a bug that produced incorrect
matches when a non-greedy ? operator was used.

[BUG #680](rust-lang/regex#680):
Fixes a bug where [[:alnum:][:^ascii:]] dropped [:alnum:] from the class.
[BUG #859](rust-lang/regex#859):
Fixes a bug where Hir::is_match_empty returned false for \b.
[BUG #862](rust-lang/regex#862):
Fixes a bug where 'ab??' matches 'ab' instead of 'a' in 'ab'.

1.5.5 (2022-03-08)
This releases fixes a security bug in the regex compiler. This bug permits a
vector for a denial-of-service attack in cases where the regex being compiled
is untrusted. There are no known problems where the regex is itself trusted,
including in cases of untrusted haystacks.

SECURITY #GHSA-m5pq-gvj9-9vr8:
Fixes a bug in the regex compiler where empty sub-expressions subverted the
existing mitigations in place to enforce a size limit on compiled regexes.
The Rust Security Response WG published an advisory about this:
https://groups.google.com/g/rustlang-security-announcements/c/NcNNL1Jq7Yw




Commits

9aef5b1 1.5.6
2931b07 syntax: bump minimum regex-syntax version to 0.6.26
b41bde0 regex-syntax-0.6.26
d98da65 changelog: 1.5.6
1c19619 syntax: fix literal extraction for 'ab??'
88a2a62 syntax: fix 'is_match_empty' predicate
72f09f1 syntax: fix ascii class union bug
b537286 doc: fix some typos
258bdf7 changelog: 1.5.5
d130381 1.5.5
Additional commits viewable in compare view




Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@dependabot use these labels will set the current labels as the default for future PRs for this repo and language
@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

Reviewed-by: Artem Goncharov <Artem.goncharov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants