Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switching to a magic number-based collection generator? #90

Open
Ekleog opened this issue Aug 18, 2022 · 7 comments
Open

Switching to a magic number-based collection generator? #90

Ekleog opened this issue Aug 18, 2022 · 7 comments

Comments

@Ekleog
Copy link
Contributor

Ekleog commented Aug 18, 2022

Hey!

I've just recently had a look at the TypeGenerator for Vec, and noticed that the collection generators seem to use a tag-length-value format.

Such a format is very quickcheck-friendly (because all inputs are valid), but it is not at all fuzzer-friendly, because a change of one in the length of one struct is going to completely mess with the whole structure of the rest of the input.

Google recommends a magic-number-based parser, as per https://github.com/google/fuzzing/blob/master/docs/split-inputs.md#magic-separator : instead of having a length, have a separator used to go to the next input.

With this scheme, quickcheck is much less happy (because it needs to somehow figure out this magic number), but fuzzers are much happier (because they can easily change the length of the contents of one cell without messing with the rest).

So I'm thinking, do you think it'd make sense to change the generators of cargo-bolero so that they would use a magic-number scheme? The only issue I can see would be with the quickcheck-based backend, which I think could instead always use mutate so it doesn't have the issue that most inputs would have a single element in the collection (unless the magic number magically appeared)

@camshaft
Copy link
Owner

Interesting! I wasn't aware of that recommendation. Yeah I think it would be worth doing. The driver would need to be updated in order to report what "mode" it was in. We do have the DriverMode enum already but it might be good to separate the concepts so they're not conflated.

@Ekleog-NEAR
Copy link
Contributor

So now that I understand Driver & DriverMode a bit: I think the way this would make most sense would be as an alternative Driver, like FuzzByteSliceDriver; that’d always run in Direct mode because it would really make no sense to have it run in Forced mode.

However, doing so would mean that we should hoist quite a lot more functions into Driver, basically adding a function that generates a list. This way the current drivers could use the current behavior; while FuzzByteSliceDriver would instead use a "magic separator"-based implementation.

Does that make sense to you? If not, I guess a DriverMode::DirectFuzz could work too; but TBH I still don’t really feel that good about DriverMode already, and adding one that would not even really be on the same axis as the current Direct and Forced would feel bad 😅

@Ekleog-NEAR
Copy link
Contributor

An example API for that would be:

trait Driver {
    // ...
    type SequenceIterator<'a, F, T>: Iterator<Item = T>;
    type SequenceDriver: Driver;
    /// Note that SequenceIterator is supposed to panic if it is dropped without having been polled to completion first
    fn gen_sequence<'a, F, T>(&'a mut self, allowed_lens: Range, gen: F) -> SequenceIterator<'a, F, T>
    where
        F: FnMut(&mut Self::SequenceDriver) -> T;
}

WDYT?

@camshaft
Copy link
Owner

That seems pretty complicated to me... Why not have a function to indicate to the driver that we're starting/ending a list/item and the driver can decide what to do with that information.

@Ekleog
Copy link
Contributor Author

Ekleog commented Jan 30, 2023

Hmm… so the thing is, in order to support magic number-based collection generation, we'd need to make sure the driver is the one actually defining the length.

This being said, something like this would probably work:

pub struct SequenceToken;

trait Driver {
    // ...
    fn start_sequence(&mut self, allowed_lens: Range) -> Option<SequenceToken>;
    /// Returns true iff the sequence has not completed yet
    fn next_sequence_item(&mut self, id: &SequenceToken) -> Option<bool>;
}

Where the usage would be expected to be like:

let s = d.start_sequence(len_range)?;
while d.next_sequence_item(&s)? {
    gen_item(&mut d)?;
}

This would work too, but then the API could be misused by buggy generators much more easily, eg. by not handling nesting properly, or by parsing an element before calling next_sequence_item.

Also, it would probably actually make the implementation of each Driver more complex, as eg. the RngDriver would need to manually remember information like how many items remain in each sequence of the current sequence stack, without being able to rely on the call stack to handle that.

However, it would definitely be an alternative API to the one I suggested above! It's just for these reasons that it's not the one I suggested at first :)

@Ekleog
Copy link
Contributor Author

Ekleog commented Jan 30, 2023

Oh and I actually didn't think of that when writing the message above (which stays correct), but: even more importantly, the API above does not actually work.

Indeed, in order to have a magic-number-based collection generator, we need the FuzzByteSliceDriver (thereafter fuzzdriver) to be able to decide how much of the data each child is allowed to consume.

For instance, assume input like the following, with an item parser that needs two integers:

MAGIC_START 0 1 MAGIC_NEXT 2 MAGIC_END MAGIC_END

Then we want to have the fuzzer fail to produce the input, so that it learns that it needs two integers between MAGIC_NEXT and MAGIC_END. And if we use a next_sequence_item-based scheme, then maybe the item generator already consumed MAGIC_END, and then it will think that MAGIC_END is part of the item instead.

This is especially important when nesting collections, as then the magic number must be different, and a sub-collection could otherwise way too easily eat the "next item" from the super-collection.

Does that make sense to you?

(Also, well, for the person who writes a Generator the first API just seems nicer to me, and people will more often write Generators than Drivers, but I guess that's just a subjective thing)

@Ekleog
Copy link
Contributor Author

Ekleog commented Jan 30, 2023

Oh. I think I may have just figured out an API to rule them all? Though it'd still be quite complicated:

/// This type's Drop implementation should panic if get_item did not return None yet
trait CollectionDriver {
    type ItemDriver<'a>: Driver;
    /// Returns None once there are no longer any elements in the collection
    fn get_item<'a>(&'a mut self) -> Option<ItemDriver<'a>>;
}
    
trait Driver {
    // ...
    type CollectionDriver<'a>: CollectionDriver;
    /// Note that SequenceIterator is supposed to panic if it is dropped without having been polled to completion first
    fn start_collection<'a>(&'a mut self, allowed_lens: Range) -> Self::CollectionDriver<'a>;
}

This is close to my first idea above; but instead of using Iterator and a gen function that forced all items of the collection to be of the same type, we use a CollectionDriver trait that returns a sub-driver that the user can use however they want to generate their collection.

(I've also renamed sequence to collection because it matches "collection of distinct types" better I think)

With this scheme other ideas like making TypeGenerator actually use a Collection for everything it generates would become possible, though I'm not sure how worth it this would be (more structure… but not necessarily better fuzzing? I guess it depends on where there is variable-length stuff, basically anything variable-length should probably be in a collection… but then it could also make sense to skip MAGIC_NEXT when the child itself is fixed-length… well lots of things to think about here)

Oh and by the way we may also need the ItemDriver to not necessarily be the same as the parent driver, because of the need for escaping magic numbers so literally every output stays representable in the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants