Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 3 structure-aware fuzzers #848

Closed
wants to merge 1 commit into from
Closed

Add 3 structure-aware fuzzers #848

wants to merge 1 commit into from

Conversation

addisoncrump
Copy link
Contributor

These are the fuzzers used to find CVE-2022-24713.

Effectively, it just marks all AST members as Arbitrary when the fuzzer feature is enabled, then generates ASTs which are then converted back to statements in regex. Then, the fuzzer parses the generated string and executes against an input.

@5225225
Copy link
Contributor

5225225 commented Mar 14, 2022

I was using

fuzz_target!(|data: (&str, &str, [bool; 7])| {
    let (
        pattern,
        input,
        [case_insensitive, multi_line, dot_matches_new_line, swap_greed, ignore_whitespace, unicode, octal],
    ) = data;

    let r = regex::RegexBuilder::new(pattern)
        .case_insensitive(case_insensitive)
        .multi_line(multi_line)
        .dot_matches_new_line(dot_matches_new_line)
        .swap_greed(swap_greed)
        .ignore_whitespace(ignore_whitespace)
        .unicode(unicode)
        .octal(octal)
        .dfa_size_limit(1024)
        .size_limit(1024)
        .build();

    if let Ok(re) = r {
        re.is_match(input);
    }
});

Looks like the only option that shows up as an option in the RegexBuilder that doesn't show up in the flags is octal. Might be worth fuzzing that option as well?

@BurntSushi
Copy link
Member

Thanks so much!

I am not sure if I am a fan of adding dependencies to regex-syntax, even optional ones. Because they now become part of the API of the crate and require careful management when it comes to release maintenance, particularly given that it create a public dependency on another crate.

I'd really strongly prefer that we don't introduce new dependencies to regex-syntax. Adding a dev-dependency would be quite all right though. Is there a way to structure the fuzzer to make that possible I wonder?

@addisoncrump
Copy link
Contributor Author

Looks like the only option that shows up as an option in the RegexBuilder that doesn't show up in the flags is octal. Might be worth fuzzing that option as well?

@5225225, I saw your PR earlier (it was actually what inspired me to write this one!) and would be happy to add some of the testing capabilities yours demonstrates.

I'd really strongly prefer that we don't introduce new dependencies to regex-syntax. Adding a dev-dependency would be quite all right though. Is there a way to structure the fuzzer to make that possible I wonder?

@BurntSushi, let me see what I can do. 🙂

@addisoncrump
Copy link
Contributor Author

@BurntSushi seems that this option isn't really available: rust-fuzz/cargo-fuzz#256

Unfortunately, it would be necessary to make it a feature. However, it is possible to test if it's being used in a fuzzer environment, so we can set up something like the following:

#[cfg(any(all(feature = "fuzzer", not(fuzzing)), all(feature = "arbitrary", not(fuzzing))))]
compile_error!("You cannot enable the fuzzer feature without fuzzing!");

This prevents people from pulling in the fuzzer features or arbitrary without fuzzing. Does this suffice?

@CityOfLight77
Copy link

@VTCAKAVSMoACE with this fuzzers how long it takes to find cve-2022-24713? I'm tested locally from your repo I can't trigger the crash. I have 50ish corpus and using regex dictionary

@addisoncrump
Copy link
Contributor Author

@VTCAKAVSMoACE with this fuzzers how long it takes to find cve-2022-24713? I'm tested locally from your repo I can't trigger the crash. I have 50ish corpus and using regex dictionary

Sorry for the late response. You need to set a timeout of a second or two, and you do not want to use a corpus.

@BurntSushi
Copy link
Member

I've given this some thought, and I'm just not okay with adding a public dependency on arbitrary for regex-syntax. I could stomach it as a dev-dependency, but even one that's optional and difficult to enable is too risky for me. And doing it just to enable some convenient testing does not feel like the right trade off to me. The problem it being in the dependency chain is that it's still considered part of the main set of dependencies and I believe there are various things that can impact building regex-syntax based on things that arbitrary does. (Perhaps the setting of rust-version, IIRC.)

@BurntSushi
Copy link
Member

I'm going to close this out for largely similar reasons as stated here: #959 (comment)

@BurntSushi BurntSushi closed this May 1, 2023
@addisoncrump
Copy link
Contributor Author

Hm, this seems a bit of a shame to me. Grammar fuzzing discovers hard-to-find bugs very fast in regex (e.g. reproducing the ReDoS in <1 minute); it seems a bit more than convenient testing. I'll think about alternative implementation strategies here since I think it's important, but I don't have much time to do this at the moment. I will try to nerd snipe some my friends into this 😉

@BurntSushi
Copy link
Member

Yeah I agree it's a shame. And honestly, my suggested work-around sucks. It blows chunks. I don't like it and I don't want it, haha.

This really does feel like a flaw in the cargo fuzz tooling. Basically, whenever it needs some additional infrastructure for testing, it mandates that it be added as a non-dev dependency. That's just not good at all. It might be acceptable for some crates to do that, but this is really what dev-dependencies are supposed to be for. So I really think cargo fuzz needs to figure out a way to support them.

I'll see if I can open up an issue with them and see if I can get anywhere that way.

@BurntSushi
Copy link
Member

I'm also reconsidering the other work-around suggested, which was mirroring the full Ast type in the fuzzer targets, and making them impl Arbitrary and then defining a From impl. All of the Ast types have a public representation, so this is possible to do. It is painful. But the Ast types very rarely change. So it might be feasible. At least, I think I like this better than my idea of copying and patching the entire regex-syntax crate.

@BurntSushi
Copy link
Member

I've added a request here: rust-fuzz/cargo-fuzz#338

@Manishearth
Copy link
Member

I closed the cargo-fuzz issue as unimplementable, but repeating some of my suggestions from there:

For this specific case, I'd really just recommend adding an arbitrary feature the way a lot of crates have serde features: other people may also want to include your types in their fuzzers. I recognize that this is similar to what has been proposed here and has already been rejected.

The hack for getting this to work is to instead manually define fuzz targets under examples/ using the libfuzzer crate, and figure out the sanitizer flags cargo fuzz passes in and pass them yourself using RUSTFLAGS=foo cargo rustc --examples bar -- <more flags>. That's basically all cargo-fuzz does: it creates a crate that depends on libfuzzer, sets up boilerplate, and then builds with the appropriate flags.

@BurntSushi
Copy link
Member

@Manishearth gotya, thanks. I think the RUSTFLAGS hack might not work, but I'm not sure. In particular, this project is hooked up to OSS-fuzz, and I'm not sure whether it's possible in that context to say, "do RUSTFLAGS=foo cargo rustc ... instead of cargo fuzz."

@addisoncrump
Copy link
Contributor Author

Regarding OSS-Fuzz: this is something that you would put into the config file. Quite possible 🙂

@BurntSushi
Copy link
Member

Ah! There's a lot that I don't know apparently. Haha.

@BurntSushi BurntSushi reopened this May 4, 2023
@BurntSushi
Copy link
Member

OK, so I'm going to move forward with just adding arbitrary as an optional dependency to regex-syntax. And it will be an advertised feature. Things that changed my mind:

  1. There's just not really any good alternative given the cargo fuzz setup today.
  2. This drastically speeds up fuzzing. And given that regex 1.9 is going to be a complete rewrite, getting it under a good fuzzer before a release sounds like a very responsible thing to do.
  3. I reviewed arbitrary itself and it looks like a solid crate with good dependency hygiene itself.
  4. I added it as an optional dependency to regex-syntax locally and played around with it. Everything still builds on Rust 1.60.
  5. If this ends up not working out, I can always back out the change because I can do new breaking change releases of regex-syntax at a higher frequency than regex itself.

@addisoncrump
Copy link
Contributor Author

addisoncrump commented May 4, 2023

So, what do you need from me? 🙂 Rebase and test?

@BurntSushi
Copy link
Member

Ah right, nope, you'll all set. I'm rolling this into #978. (I have other fuzz changes in that PR too, in order to improve coverage across various things.)

@addisoncrump
Copy link
Contributor Author

Okay. I think potentially something worthwhile is to also perform differential testing between the rewrite and the old version. This should be fairly quick to set up and I'll create a separate repository for this since it doesn't seem appropriate here.

@BurntSushi
Copy link
Member

@addisoncrump Sounds great. What does it entail? (Or feel free to avoid indulging my curiosity and just show me whenever the new repo is ready. :-))

@addisoncrump
Copy link
Contributor Author

I should be done in like 5 minutes? I'll just push it here since this PR is encapsulated by other changes.

@addisoncrump
Copy link
Contributor Author

Aha, yup, this immediately finds some differences. Test with cargo fuzz run -s none diff_new_fuzzer and enjoy your failing testcases 😉

@addisoncrump
Copy link
Contributor Author

Note that currently, it uses generated ASTs from the old implementation. If there are new features present in the rewrite, this won't test them, or it will test them wrong.

@BurntSushi
Copy link
Member

Yeah unfortunately not only are there new features, but there are bug fixes in match semantics too.

I'll take a look at some of the failing cases to confirm they fall in those buckets, but this might make differential testing in this way tricky. (I've done my own form of differential non-fuzz testing by ensuring that the old implementation passes the new test suite and that the new implementation passes the old test suite.) The new test suite is much bigger than the old one, and tests each individual internal engine much more thoroughly. It's not fuzz testing, but it's not nothing either. :)

@addisoncrump
Copy link
Contributor Author

addisoncrump commented May 4, 2023

And the differential fuzzer also just repro'd the old CVE on the new implementation with: (?:){250000}?... Seems to be in parse-time, again, too.

@BurntSushi
Copy link
Member

@addisoncrump Yeah I caught that one a few days ago. I might not have pushed my branch since then. But it's in the new NFA compiler, not parser. Happened for a different reason.

@BurntSushi
Copy link
Member

BurntSushi commented May 5, 2023

@addisoncrump Hmmm. Probably the simplest thing is to copy the properties from this section of UNICODE.md: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl12-properties

But that doesn't include each of the general category or script values. For example, neither Letter (a general category) nor Greek (a script) are listed there.

Otherwise there really isn't anything in regex-syntax (internal or otherwise) that makes such a thing convenient because it isn't an operation that is ever really needed. Technically everything you need are in the Unicode property name/alias tables, but you'd have to dig around in there. But none of that is exposed in a way that lets your fuzz target take advantage of it.

One thing worth mentioning is that the AST parser doesn't check whether the thing inside a \p{...} is valid or not. That's only done when the AST is translated to an HIR. The AST parser might do some basic validation in terms of which characters are allowed in \p{...}. I can't remember. But it specifically does not, for example, reject \p{foo}:

$ regex-cli debug ast '\p{foo}'                                        
parse time:  54.929µs

Class(
    Unicode(
        ClassUnicode {
            span: Span(Position(o: 0, l: 1, c: 1), Position(o: 7, l: 1, c: 8)),
            negated: false,
            kind: Named(
                "foo",
            ),
        },
    ),
)

@addisoncrump
Copy link
Contributor Author

I'll find a way 😁 But not this evening.

@addisoncrump
Copy link
Contributor Author

addisoncrump commented May 6, 2023

So, I looked a little further. I believe that using -use_cmp=0 will also improve the performance of the fuzzers that generate ASTs by arbitrary, as "cmp" will attempt to identify subsequences in the input to replace with the strings that are discovered dynamically. This will unfortunately poison the corpus with undesirable values now that the unicode class arbitrary is implemented, and even in previous versions this wouldn't work because of how arbitrary generates these values.

@Manishearth you might also wish to know this; maybe worth documenting in the rust fuzz book. I believe -use_cmp=1 (enabled by default) may be actively harmful for fuzz campaigns which utilise arbitrary as it will cause libfuzzer to suggest useless mutations for arbitrary-generated data. It will only be useful if libfuzzer is able to find data which is used as-is without transformation (such as arbitrary => fixed string); in the case of the PR above, this never happens, so the inferred mutation injects data which is not useful for the fuzz campaign.

BurntSushi pushed a commit that referenced this pull request May 14, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 14, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 18, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 18, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 18, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 18, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 21, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 21, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 22, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 22, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 22, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 22, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 24, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 24, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 24, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 24, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request May 25, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request May 25, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
@addisoncrump
Copy link
Contributor Author

oops, I just obliterated this while making another PR by accident. Closing, since these changes are already added.

BurntSushi pushed a commit that referenced this pull request Jun 13, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request Jun 13, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
BurntSushi pushed a commit that referenced this pull request Jul 5, 2023
This makes uses of the new 'arbitrary' feature in 'regex-syntax' to make
fuzzing much more targeted and complete.

Closes #848
BurntSushi added a commit that referenced this pull request Jul 5, 2023
It turns out that the way we were dealing with quit states in the DFA
was not quite right. Basically, if we entered a quit state and a match
had been found, then we were returning the match instead of the error.
But the match might not be the correct leftmost-first match, and so, we
really shouldn't return it. Otherwise a regex like '\B.*' could match
much less than it should.

This was caught by a differential fuzzer developed in #848.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants