New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: Using machine learning to suggest new inputs #1495
Comments
that idea is already many years old, and so far - to my knowledge - not a single paper or implementation was able to create something useful from the ML overhead. |
Thanks for the reply. It's not too surprising that it hasn't been successful. It seems like for the applications where one leaves a computer in the closet to fuzz indefinitely, the overhead kind of doesn't matter. Once the fuzzer is not very effectively spinning its wheels, anything that might find a novel input is useful. |
@vanhauser-thc that is correct (to the best of my knowledge). @benjaminy obviously feel free to experiment with that. There is a couple of approaches that were tried in the past that sound very similar to what you describe. For example: https://arxiv.org/pdf/1807.05620.pdf and https://arxiv.org/pdf/1701.07232.pdf. Again, to the best of my knowledge, those approaches weren't easily reproduced by independent groups. I believe the fundamental problem with most ML based papers is that when you are training on the existing corpus, you are training on all the things that the fuzzer already knows about (and doesn't care about anymore). IF you are training on the current corpus, the ML won't be able to learn the unknown unknowns needed to find more coverage. Fundamentally I believe you'd need something like GP3, trained on millions of different file formats, to ensure the AI is able to find common patterns in man-made file formats and guess what the not-yet-dsicovered parts would look like. |
Thanks for the links @vanhauser-thc I don't anticipate having the time to experiment any time soon, but 🤞 Training on all the world's programs/data formats sounds promising in principle, but... uh... ambitious. I agree with your point that a network trained on observed executions of a program will tend to produce more inputs that will drive the program along the same paths. Here's a sketch of how I think one might be able to bust out of that.
|
I don't think a reasonably sized/reasonably fast ML model could help penetrate deeper into more edges/paths. AFL does a good job at this, because it has an assembly-level instrumentation that gives it feedback. Just my 50 cents, I may be wrong :D |
This is just the very beginning of an idea, but your "ideas" page says you're interested in "machine learning something something", so I figured I'd toss it out there. The problem that this idea starts with is the general issue that coverage-guided fuzzers don't "understand" program logic in a very deep way, and so can get stuck in a rut after many iterations.
Every iteration of a fuzzer starts with some input and results in some projection of the program's behavior (coverage, traces, etc).
The idea is to take all those input-behavior pairs, flip them around and train a ML model to predict program inputs from a description of the program's behavior.
Given such a model, you can then throw in different program behavior descriptions and see what program input comes out. You could imagine starting with existing program behavior descriptions and tweaking them a little bit ("Like this run, but take that branch"). Or throw in a lot of randomness just to see what happens.
I expect the training process would be quite expensive, so I'm imagining this as a complete separate "background" process that just pops up every once in a while to suggest a different input. I expect a model like the one I'm suggesting would be very noisy, but if it managed to "understand" something nontrivial about the structure of the program, it might be useful.
The text was updated successfully, but these errors were encountered: