It has been several months since my last update on this thread, but I am pleased to report some updates regarding what fuzzing infrastructure might look like in TVM. Namely, I’ve implemented some prototype scripts that demonstrate how this could work.
Test Oracles
A test oracle is a function that is responsible for determining if the compiler behaves correctly for a generated program or not. Since it is very difficult to automatically encode overall whether the compiler behaves completely according to the language specification, test oracles typically test simpler conditions that can be checked quickly and easily for large numbers of generated programs. The file oracle.py
gives the overall interface for test oracles and common_oracles.py
gives examples of some simple test oracles.
The oracles can take in configuration through a JSON file and are given a directory for any output they wish to generate (ideally for logging any errors encountered or information that could be helpful for reproducing the results). Oracles can test the compiler’s behavior by running a test that makes assertions; they should either return normally or exit with an error of some kind.
Here are the example oracles I’ve implemented:
- Type checking: It ensures that
relay.transform.InferType()
passes for the given program (this should always work for the fuzzer I’ve implemented, though I guess in principle we may eventually want to generate programs that we know fail type checking to test the error cases). - Parser round-tripping: Given a program, it tests that pretty-printing it and then parsing it again returns the same program.
- Default build: It attempts to compile the program using
relay.create_executor
and simply checks that there is no crash at compile time. - Differential equality: Given a program, the oracle compiles it using two different executors and checks that the results are the same between the two executors. We constrain the programs to take only tensor types and return a tensor value, since it is easy to generate random tensor values (in principle, we could extend this to handle all Relay types, though).
Generating Programs at Scale
The script generate_programs.py
calls a program generator and generates a specified number of programs, calling the program generators in parallel to maximize throughput. There is a simple mechanism for dispatching program generators by name and setting generation parameters through a JSON file.
Note that generated programs are saved as Pickle files. This is done to avoid having the testing infrastructure depend on the parser or text format, since these are things we may want to test (c.f., the parser round-tripping test oracle). I realize that Pickle files are not a great choice in general due to the possible security issues they may pose (executing arbitrary Python code), so I would welcome alternative suggestions, but I do think that whatever mechanism we use to save generated modules should be independent of the parser and text format. It would not be difficult in principle to serialize ASTs as JSON or another such format if that is desirable.
Running Tests at Scale
The script apply_oracle.py
, similarly to generate_programs.py
, applies test oracles to the saved (Pickled) generated programs in parallel, making a list of all tests that end in failure. The script dispatches oracles by name and is also capable of passing parameters to the oracles via JSON files. As discussed before, oracles may produce auxiliary files to aid in debugging; this script ensures each oracle has a directory to deposit its output files. Additionally, the script logs the standard output and standard error for each oracle run and places it in the appropriate directory; it also separately logs the stack trace in the case of an error, which we may use for categorizing errors (see below).
Next Steps
Given the relative ease with which these scripts can generate random programs and apply oracles to them, the next hurdle will be analyzing the results: It is easy to mark which test cases fail, but there can be multiple failures all corresponding to the same root cause. It might therefore be helpful to group failing tests based on whether the errors look similar (which could be accomplished by grouping based on the stack traces).
I welcome feedback on the infrastructure prototypes presented here and further ideas regarding infrastructure for analysis. If this post is not sufficiently detailed as to the infrastructure
As it happens, I’ve gotten some failure messages from the small-scale fuzzer test runs with some of these oracles, so I will have to investigate the reasons (I suspect it may have to do with various Relay executors not supporting all of the language’s features).
Broader discussion topics, some of which were also raised in my original post:
- As a community, how would this fuzzing infrastructure be used? There would have to be people in charge of running and maintaining the infrastructure as well as investigating the resulting errors (they do not have to be the same people in all cases). We could compare how the LLVM community handles fuzzing, though I am not familiar in detail with it. (Perhaps @Flandini might be able to offer some suggestions based on his experiences?)
- Where should the fuzzing infrastructure be implemented? There was a pre-RFC on a tools directory, but there have not been updates recently.
- We may encounter fairly trivial errors just from targeting executors and passes with unexpected output. This may nevertheless be useful since it would encourage documenting these requirements, but it would also require the maintainers of the fuzzer to encode whatever constraints are needed. Does this seem reasonable? Are there examples of constraints that might be tricky to encode in fuzzers? (Or any examples of documented requirements for comparison now?) Perhaps the otherwise little-used feature flags may be useful for this purpose.