[RFC] Type-Directed Relay Fuzzing Library

slyubomirsky · March 4, 2022, 4:22am

RFC: Type-Directed Relay Fuzzing Library

Summary

This RFC proposes to employ fuzzing (mass generation of random programs) for Relay in order to test the compiler. Fuzz testing for Relay can expose bugs and improve confidence in the compiler by generating vastly more test cases than can be written by hand. In particular, the RFC proposes libraries for making it easy to implement Relay fuzzers (different program generators can implement specialized generation policies, useful for testing particular passes) and infrastructure for running Relay fuzzers at scale in order to identify possible compiler bugs.

Motivation

Fuzzing is a common technique for exposing compiler bugs, namely by generating valid (or invalid) programs and ensuring the compiler produces correct output (defined by a language specification or a test oracle like “the compiler should never crash, even with invalid output” or “the same program should yield the same results when compiled with different standards-compliant compilers”). See this paper for a discussion of the classic CSmith fuzzer (which has found many bugs in C compilers); another well-known program fuzzer is jsfunfuzz.

As I described in my TVMCon talk on this subject (covering work done with my UW colleagues, Edward Misback and Michael Flanders), fuzzing is a promising means for us to test our compiler passes at scale – this RFC concerns fuzzing for Relay specifically, but the community would also do well to consider fuzzing TIR this way (though it would have different considerations; note that TVMFuzz and Tzer have targeted TIR). The large-scale generation of random tests will increase our confidence in our compiler implementation (passes, as well as the parser).

Overview

The main challenge in fuzzing is ensuring that the outputs will be valid programs (or otherwise ensuring that they will only be invalid in clearly defined ways). In the case of Relay, the type system provides rules for creating correct programs–hence my calling this a “type-directed fuzzer”–but the type relations on operators mean that a fuzzer would have to reason about tensor shapes. This RFC discusses different approaches for dealing with shapes in the fuzzer. Incidentally, since this fuzzer has to reason about typing rules, it has the secondary benefit of serving as an executable specification for some of the language rules.

For my TVMCon talk, I implemented a prototype, which is available here. I have since simplified the approach for dealing with tensor shapes and refined the interfaces in this prototype, which I will discuss in detail below.

In this RFC, I propose one way we can implement a fuzzer for Relay and infrastructure to support it. I have recently learned of Tzer, another work that has fuzzed Relay (and have greatly appreciated it), so I would also like to emphasize that this proposal is for one way of implementing fuzzing for Relay, ideally in a manner that will be extensible and allow for incorporating further techniques, rather than a suggestion that this is the only way we should approach fuzzing.

Goals

The additions proposed in this RFC serve two purposes:

To provide reusable libraries for generating Relay programs that type check, to be used for testing specific passes or behavior.
To provide a framework for fuzz-testing Relay at scale, implemented using the above libraries. The fuzzer should be capable of generating code that uses most of the language’s features, which can be used for testing general-purpose passes in Relay. The framework should allow for bugs found when compiling generated programs to be saved and categorized.

I emphasize that these are fuzzing libraries: It may be of interest to generate programs with a specific structure (for example, providing a lot of fusable operators), so these libraries should make it easier to create new generators that fulfill specific needs for testing passes. (Hence my comment on this proposal not coming to the exclusion of other fuzzing techniques.)

Fuzzer Implementation

As a proof of concept, I have written a prototype implementation of the general-purpose fuzzer and fuzzing libraries in 1744 lines of Python, which includes handling for 7 type relations and 21 operators. This is a more modular reimplementation of the prototype presented in my TVMCon talk, which I used to find a bug in the match completeness checker and a parser roundtripping bug (refs of refs would not parse back, though this has not yet been fixed).

Core Approach

The main approach for type-directed fuzzing is to build up programs recursively by following the typing rules “in reverse”:

Given a target type, choose an expression AST node that can yield a value of that type.
Use the target type to determine the types of subexpressions and generate those recursively.

Note that when generating an expression, we must keep track of which variables are in scope and with what type (let us call this the “environment,” a mapping of variables to types). This is because let expressions and function definitions introduce new variables into the scope, which dictates when it is valid for variables to appear.

Base cases for the recursion:

If there is a variable in scope of the appropriate type, we can simply return the variable.
We can return a literal of the appropriate type. In Relay, here are the literals for different types:
- Tensor types: A tensor constant
- Tuple types: A tuple expression
- Reference types: A ref expression
- Function types: A function expression
- ADTs (type calls): A call to a constructor of that ADT

For an example of a recursive case, let us consider the typing rule for let expressions (and for simplicity, let us assume we are not defining a recursive function). A let expression of the form let v1 = e1; e2 has the type T2 in the environment Γ if:

e1 has some type T1 under the environment Γ,
e2 has the type T2 under the environment Γ ∪ {v1 ↦ T1}

This means if we start with a type T2 and choose to produce a let expression, we would have to choose a type T1, generate an expression e1 of type T1, add v1 to the scope with type T1, and finally generate an expression e2 of type T2 using the updated scope.

We can ensure termination by forcing the expression generation to choose a base case after a certain point, e.g., by limiting AST depth. In the prototype, there is a “fuel” parameter that is decremented upon recursive calls and forces base cases once the fuel “runs out” (not a very sophisticated approach).

A note on graph bindings: Graph bindings in Relay can be treated like variable bindings. An expression can be reused for a graph binding as long as it contains only variables that are in scope and does not redefine existing variables (redefining an existing variable would fail Relay’s wellformedness check; we could replace the redefined variable with a fresh one, but then this would no longer be a graph binding). The prototype simply keeps all generated subexpressions in scope (grouped by type), except those that introduce new variables (lets, function definitions, and anything containing a let or function definition), and chooses between them when inserting a graph binding.

Handling Operators

The type checker references type relations to check operator calls. These relations are functions that take the call’s input types and output type (which may be underspecified TypeVars) and check that concrete types are correct or unify TypeVars with concrete types, ultimately determining whether the operator accepts those types. The bulk of the logic tends to deal with tensor shapes. See the type relation for bias addition for a simple (and fairly typical) example.

Generating operator calls differs from ordinary type checking in Relay in that the algorithm described in the previous section starts from the final return type: we know the output type of the operator call, but not the input types and we have not yet generated the input expressions either. This is less information than the type checker has. Yet, if we generate input expressions to the call without type checking them, it is possible that the types will be invalid and will mean that we wasted time generating the input expressions. Hence, it is important for us to solve the type relations by going backwards: given an output type, find input types that satisfy the relation. The main issue is that the relations are implemented in C++, so there is no convenient automatic way to derive solutions to the relations.

In my TVMCon talk, I discuss multiple approaches to generating operator calls:

Formalize type relations in a solver domain (e.g., constraint logic programming, integer linear programming) and solve for valid types – most of relations are only checking constraints on tensor shapes, so this would be feasible, but entails manual effort and requires understanding all of the type relations. Additionally, it introduces a dependency on the solver.
Use brute force to solve type relations: This approach does not require manual reimplementation of the type relations (in fact, the type relation can be reused as-is for checking the solutions). However, brute force seems wasteful and will not scale well to larger tensor shapes.
An entirely different approach: Instead of attempting to “solve” type relations in the normal recursive algorithm described above, instead we can “sample” solutions. In particular, we could randomly choose valid argument types for a type relation and then use the type relation implementation to solve for the result type, giving us a solution to the type relation and a solution. We can keep these sampled solutions in a cache. If we encounter the return type to a solution in the cache, we can reference the solution and use it to generate argument expressions for the operator call. This approach relies on the assumption that it’s easier to randomly generate valid arguments to an operator than to work backwards from the solution to find argument types. However, this approach is limited in generality unless there is a very large cache of solutions (otherwise, we will not be able to match the types very often). Notably, however, this approach does not require reformulating type relations and also does not entail a brute force search.

I was particularly pleased with the sampling approach when I implemented the first prototype, since it did not require a solver or independent reformulation of type relations, but while working on sampling for the more recent version, I realized that it did still entail manual work: It was necessary to define how to sample valid argument types. This, in turn, led me to realize that most of my implementations of the sampling procedure consisted of generating a return type and, well, manually working out how to set up the arguments from that – I had written manual solvers without realizing it!

Hence, the approach I followed in the most recent prototype was to simply write out logic for solving type relations. At least for the operators examined, this proved to be rather simple (simpler than the type relations themselves, since the solver only needs to give a solution with no constraints). These solvers are located in registered_ops.py in the prototype and are generally quite simple. (Note that the solver for conv2d does not handle all potential parameters to it.) Notice that the same file also includes sampling methods, which simply generate a return type and then call the solver.

The following components are used for generating operator calls:

A recognizer: This is a simple function that determines if a given type T is a possible return type for an operator call (i.e., is this an opportunity for an operator call?). These are generally very simple, checking only if a tensor is of a certain rank.
A solver: A function that takes a return type T and returns argument types and attributes that result in a return type of T for an operator.
A sampler: A function that returns a set of valid argument types, attributes, and return type for an operator.

Solvers do not need to be perfect in order to be useful: Even if a solver is not able to find all possible solutions, as long as it always returns correct solutions, the generated programs will type check. In particular, the following invariants must hold:

For a type T, if recognizer(T) returns true, solver(T) must find a valid solution.
If sampler() yields a return type of T, recognizer(T) must be true.

In principle, the sampler can be omitted, but I have included in the prototype in case the community favors this approach (an advantage, as mentioned above, is that solutions can be cached, perhaps in a database, and not require any computation at all, but the solvers presented in this prototype do not seem expensive).

The recognizer could also, in principle, be omitted if the solvers are specified to fail cleanly when given a return type they do not support. However, this proposal keeps solving and recognition of possible operator calls separate. Many operators that do not have the same type relation do share a recognizer in this implementation, meaning that the recognizer could be run only once and determine that several groups of operators are applicable at once.

Operator Registry

There are dozens of operators in TVM and the fuzzer would have to be able to reason about all of them. The prototype addresses this by having a registry of recognizers, solvers, and samplers, referenced by operator. The registry would permit operators to share implementations (this is most useful for recognizers – many operators are capable of returning a tensor of arbitrary shape, so the fuzzer can call one recognizer and know that any of these operators is permitted) and it would also allow for there to be multiple possible choices of implementation (in case it is desirable to have solvers that only return specific kinds of solution or recognizers that will restrict the tensor shapes accepted), with the convention that all operators are expected to have a “default” implementation listed. The registries are defined in op_properties.py. These registries should allow for flexibility and easy reuse. The implementation of recognizers, solvers, and samplers would only need to change if the underlying relation changes.

Expression, Type, and Module Generators

The rest of the prototype is structured in terms of separating out different kinds of generators: There are expression generators, type generators, and a module generator, the last of which is the front-end interface for generating programs. Included in the prototype are default implementations of the expression generator and the type generator, which are extended in the general-purpose fuzzer (which is meant to serve as an example for how to write other generators). These default implementations include some of the basic logic of the type system, allowing extensions to focus on generation policy. Each generator is essentially a visitor “in reverse”: rather than recursing down an AST, as a visitor does, it instead generates new AST nodes, choosing between which node to insert based on a policy (discussed below).

To illustrate the structure of generators (more concisely than in the prototype), here is pseudocode:

class ExprGenerator:
  def generate_expr(self, ty):
    if time_to_generate_literal():
       return self.generate_literal(ty)
    return self.generate_connective(ty)

  def generate_literal(self, ty):
    if isinstance(ty, relay.TensorType):
      return self.generate_tensor_literal(ty)
    if isinstance(ty, relay.TupleType):
      return self.generate_tuple_literal(ty)
    # etc

  def generate_connective(self, ty):
    if time_to_generate_let():
      return self.generate_let_expr(ty)
    if time_to_generate_if():
      return self.generate_if_expr(ty)
    # etc
    
  # implement generators for each case as well

Implementing Specific Generation Policies

The included general-purpose fuzzer includes various parameters to specify generation probabilities and allows them to be customized (see the test cases for examples of customization). However, while this parameterization can be useful (and can be taken beyond what the prototype exhibits), it would become unwieldy to have only a single program generator that is constantly extended with additional generation policies that are selected by parameterization. Instead, the proposal is to implement different generation policies by creating new generators that reuse components, which is why this proposal is phrased in terms of a fuzzing library, rather than having a single fuzzer per se.

For example, suppose it would be useful to test a compiler pass by inserting certain premade blocks into generated programs. This can be accomplished by extending the existing expression generator with a case for inserting the blocks (when the type is appropriate), with some probability, in the generate_expr() function. The prototype does not include such a generator, but it may be useful for testing specific passes, so I put the question to the community as to whether this kind of extensibility is necessary for fuzzing – my position is that it should be easy to create new, specialized program generators because it is difficult to predict what sorts of programs will be needed to test compiler passes.

Fuzzing Infrastructure

Given fuzzer implementations, it is necessary to generate programs at scale to identify possible compiler bugs, which requires further infrastructure to perform the generation of programs, run the tests, and analyze the results.

Program Generation

It is not actually necessary to generate programs particularly often. If existing type relations and language features do not change (and they do not change often), then the programs already generated will continue to be valid test cases. However, new operators may be added more often than that, so it would be desirable to create new test cases for those when the need arises. However, the essence is that program generation for fuzz testing only needs to be done occasionally (the fuzzer can be run in parallel for a defined period and the generated programs can be saved) and then these programs can be reused for tests. In particular, the generation should be offline and not included in the CI – this spares us from flaky tests (though the random generation seeds can be fixed to avoid that too) and limits the amount of computation necessary.

Running Tests

There are different notions of “correctness” that could make sense for generated programs. The simplest test oracle would be ensuring that the generated program compiles without errors or crashes. Other oracles of interest might be testing for parser round-tripping (this was one used in my TVMCon talk) and even comparing execution results between different implementations (comparing the VM and graph runtime, for example, or comparing against the reference interpreter I wrote some time ago for the purpose of fuzz testing). Any cases that fail can be saved to be analyzed. So long as the correctness properties can be easily checked automatically, they will be useful for fuzz testing.

When to Run

Program generation and execution at scale will be computationally expensive and will require resources that would, presumably, otherwise be used for CI. Even if a single large set of generated programs is generated and saved, running all the tests and all the oracles would still be expensive. Hence, fuzz testing should not be run particularly frequently (such as in CI); it would make sense to run the large-scale tests periodically and offline (perhaps after a certain number of merges). Failing test cases should be studied and those that expose important bugs should be added to the CI (though the CI itself should not contain random generation to avoid flakiness).

However, more specialized program generators that are intended for testing and prototyping specific features during development (hence the emphasis on extensibility) should be easily accessible to developers to run at small scales in order to test their features as they work on them.

Processing Errors

If a generated test case results in an error, this will have to be studied, but there is the possible issue of “fuzzer fatigue”: Generating programs at large scale can lead to many errors and if the generation is simple and random, there may be many programs triggering the same error (so it would not be very illuminating to manually examine all of them). Many techniques for addressing this problem have been proposed (they are generally called “fuzzer taming”); it seems that a common heuristic is comparing the tops and bottoms of stack traces and grouping errors based on those, allowing developers to focus their efforts on specific categories. (This would also incentivize having more informative error messages.) A system like Elastic Recheck may be useful for this kind of logging and searching.

It may also be the case that some errors are not worth addressing, either due to limited resources or simply because the errors are too unlikely to arise. If this is determined, it may be necessary to filter out some test failures based on the stack traces or, if there are some AST features that can be identified as the cause for the error, filter the programs out at the time of generation (or even modify the generator not to produce them).

In general, the generation of programs, running of tests, and analysis of errors are tasks that should be easily parallelizable and can be allocated specific computation budgets so as not to compete excessively with CI.

Assessing the Utility of Fuzzing

A common way of determining the efficacy of a fuzzing is tracking code coverage, so we may want to measure the code coverage of the tests (i.e., code coverage within TVM). Reasoning about coverage over the FFI boundary may be challenging, but it is certainly possible and has been done in TVMFuzz and Tzer. This will allow for assessing if the generation algorithm should be adjusted to trigger more paths in the compiler (especially in passes). Additionally, tracking coverage would make it feasible to use “white-box” fuzzing techniques (which guide the generation algorithm using coverage information or further analysis about how to trigger branches in the compiler), as in Tzer.

Possible Drawbacks

There would be a lot of implementation effort involved in creating and maintaining the fuzzer and the infrastructure, as well as in analyzing the results of fuzz tests. In addition to the human effort, there would also be computational resources spent on generating and running test cases. Furthermore, developers would likely have to specify how compiler passes should be tested using the fuzzer, requiring community effort and buy-in to make use of the fuzzer (yet another demand on developers’ attention). Whether these expenditures are worthwhile depends on the quantity and nature of the bugs found, which depends also on the ability of the fuzzers we write to probe the various code paths in the compiler.

(Arguably finding no bugs this way would increase confidence in the compiler – recent versions of LLVM and GCC often turn up no bugs during long fuzzing campaigns.)

Rationale and Alternatives

Without fuzz testing, we are entirely reliant on human review of code, manually crafted test cases, and user bug reports (or their absence) for ensuring the reliability of the compiler. Generating random test cases is likelier to probe cases that are tedious to construct by hand or were not anticipated by the developers.

As discussed in the TVMCon talk, the alternative of using a purely grammar-based fuzzer (much less effort to implement) is not very workable; my collaborators and I attempted to use it and less than 0.1% of the generated programs even type checked, and those that did tended to be trivial cases (incomplete match expressions with zero branches). It seems inevitable that fuzzing Relay programs will require some kind of reasoning about tensor shapes and to the best of my knowledge, I cannot imagine a simpler approach.

Unresolved Questions

Language Features Not Addressed in the Prototype

The prototype is able to generate programs incorporating many Relay features, but with some limitations:

The generated modules include only a new main function (as well as the content from the prelude).
The only polymorphic functions used are those from the prelude (map, fold, etc.); new ones are not generated.
Only the default ADTs (list, option, tree) from the prelude are used; new ones are not generated.
All tensors must have static shapes.
All variables have types annotated.

Of these limitations, (1) would be the easiest to address and would permit mutual recursion. (2) may be possible to address in principle, but would have some complexity (we would have to be careful with type variables and would need to keep track of whether we have a value of those types in scope, since we cannot arbitrarily create literals for type variables). (3) could also potentially be addressed, though it would be necessary to check whether it is possible to create instances of an arbitrary ADT (i.e., we must check that it has a base case). (4) and (5) would be more difficult to address. Dynamic shapes may be possible to address by starting with static shapes and then “relaxing” the solution by replacing some dimensions with Any (I am not sure if this is guaranteed to be correct). Removing annotations would require reasoning about type inference in Relay, which is not clearly specified; this could perhaps be addressed in a conservative manner (identifying specific situations where the type can clearly be inferred and removing annotations only in those cases, which would be useful even if some annotations left are not strictly necessary). Testing type inference would be a worthy aim, since it is a source of bugs.

Design Questions

I welcome all discussion of different aspects of this RFC. Below, I highlight some topics for which I would be particularly curious to have the community’s feedback.

There are tradeoffs related to the language used for implementing the fuzzer. The prototype is written in Python, but speed will be a priority for running tests at scale, so it may be more useful to implement the fuzzer in C++. On the other hand, it is easier to experiment in Python, which might make it easier to write program generators for use in testing a new feature.
There are various configuration options in the general-purpose fuzzer in the prototype, but there are more options I could imagine adding and being useful (e.g., a list of operators to include, more ways to customize generation chances). What are the best options to include for the general-purpose fuzzer and where should we draw the line between adding more configuration into a single general-purpose fuzzer and writing a new generator? My view is that new generators should be used to define new policies altogether.
Fuzzing infrastructure poses the most questions: Should it be defined within TVM’s codebase, or live separately? Where should the scripts for invoking it be defined as well?
What test oracles would be a high priority to include for fuzz testing? Are there pass-specific oracles we may want? (E.g., ensure that a pass accepts all inputs with a given property/correctly rejects inputs that fail the check.)
How often should fuzz tests be run? Where should results be stored? Who would be responsible for analyzing the results? What summary statistics would make sense?
I would be interested in technical advice on tracking code coverage, especially across the C++ and Python boundary, if it is desirable (I think it is).
Does sampling solutions make sense when the solvers are simple to use?
Where in the codebase should solver and recognizer definitions be located? Should they be near the type relation definitions or in a different part of the codebase altogether? (In the latter case, I think they should be grouped the same way the operators are grouped.)
The handling of graph bindings is an addition in the latest prototype that was not present in my earlier one, so I would appreciate particular scrutiny towards their handling and whether it is being done correctly. I could not find a good written specification of how graph bindings are intended to work, so I took a guess. (See these lines in the prototype. Note that variables and constants are not eligible, since they are atomic expressions–if I am mistaken that these do not count as graph bindings, please correct me.)

Future Possibilities

An important additional tool for fuzz testing TVM would be a program minimizer (CReduce is a famous example for C and C++), which takes a failing test case and removes parts of it until it finds the smallest program that yields the same error. This would make error cases easier to analyze and also be useful for analyzing bug reports. It may be the case that a type-directed approach like that used in the prototype program generator would be useful for minimization: Try replacing each AST subtree with a smaller one of the same type.

A program mutator (which simply replaces one subtree with another) may also be useful for creating test cases: A developer could provide a manually specified test case (like an existing model) and apply the mutator to create new test cases by swapping out parts of the program, perhaps including more language features in it. This could be implemented by choosing an AST subtree and applying the existing fuzzer to create a new expression of the same type, though reasoning about changes that change the type may be more difficult in this approach.

It will likely also be valuable to generate programs that are known to be incorrect, for ensuring that the compiler will reject those programs and not crash. The infrastructure presented here can likely be reused for generating known-incorrect programs for testing error cases; this can be done with mutation by replacing a subtree of one type with a value of an incorrect type, though it will be necessary to take care not to replace it with another value that may, coincidentally, also type check. This may be a feasible further line of investigation if the need for it proves significant.

Finally, many members of the community are participating in an exciting effort to design Relax, a successor to Relay. One of Relax’s design aims is focusing on support for symbolic shapes. It may be worthwhile to consider if any insights from this fuzzing work may be applicable to Relax, since fuzzing can be useful at early stages of development. The additional complexity of symbolic shapes may be a compelling argument for the use of CLP or ILP solvers.

Flandini · March 10, 2022, 11:43pm

I have some additional comments though I don’t work on TVM/Relay; I just helped Steven out with some parts of the fuzzer.

I’m in favor of just keeping the fuzzer (the main driving part of the fuzzer) in the top level repo since it is pretty small, but I don’t think it’d be much of a difference keeping it in another repo. As to where to storge component-specific fuzzing parts like oracles or pass-specific fuzzing harnesses, I think the two best options are:

Keep these things next to the code. For example, for transformation-specific fuzzing wrappers, just put these wrappers next to their transforms in the src/relay/transforms.
Keep these things in some tooling or test directory. For example, those same transformation-specific fuzzing wrappers would instead go in tests/python/relay (or some subdir there).

I’m not super familiar with Python, but I think that with the way Python modules work and the fuzzer currently written in Python, maybe the second option would be easier (?).

Probably the most important part of the infrastructure to figure out is how this is going to run. Since it sounds like the fuzzer will run quite frequently, the runs should probably be automated by some CI-fuzzing tool that has statistic and bug reporting abilities. I don’t have much experience with CI-based fuzzing toolchains, but I would like to throw ClusterFuzzLite (GitHub - google/clusterfuzzlite: ClusterFuzzLite - Simple continuous fuzzing that runs in CI.) into the mix for consideration. It is a simplified version of Google’s ClusterFuzz, runs in CI, has the reporting facilities, and can handle mixed Python/C++ projects. I found this a ujson fuzzer for ClusterFuzzLite: oss-fuzz/projects/ujson/hypothesis_structured_fuzzer.py at master · google/oss-fuzz · GitHub which mixes Python and C; this could be used as a starting point or template to getting the Relay fuzzer up and running.

For an earlier evaluation of the fuzzer, we used llvm’s SourceBasedCodeCoverage and SanitizerCoverage to instrument and track code coverage of the C++ parts of Relay. This injects code coverage tracking code at compile time, so we couldn’t get full coverage data because of the Python/C++ split in the fuzzer, but unless others have better ways of getting coverage data, I’m guessing this will be the way it gets coverage to start.

areusch · March 16, 2022, 4:02pm

We discussed this at the TVM Community Meeting this morning. A couple of notes:

What is the best reference paper for Relay? Where was the algebra for the type relations used here sourced from?

Steven: The Relay paper is still the most comprehensive reference. The language has evolved a bit since its introduction, but the type system hasn’t changed.
Have you considered how this may apply to Relax?

Steven: This approach might prove challenging to use with symbolic shapes. It might be necessary to use ILP with Relax, but this approach also might be sufficient. Need to try this.

slyubomirsky · April 29, 2022, 9:24pm

Not having heard objections in this thread or at the community meeting, I plan to make further iterations on this proposal en route to a full RFC:

First, I will implement a prototype testbench using the fuzzer, in which programs are generated according to some time or memory budget and the generated programs are compiled and tested against some set of simple test oracles.
If the prototype testbench works as intended, I will document it and present the material here along with the testbench in a full RFC.

slyubomirsky · October 13, 2022, 11:43pm

It has been several months since my last update on this thread, but I am pleased to report some updates regarding what fuzzing infrastructure might look like in TVM. Namely, I’ve implemented some prototype scripts that demonstrate how this could work.

Test Oracles

A test oracle is a function that is responsible for determining if the compiler behaves correctly for a generated program or not. Since it is very difficult to automatically encode overall whether the compiler behaves completely according to the language specification, test oracles typically test simpler conditions that can be checked quickly and easily for large numbers of generated programs. The file oracle.py gives the overall interface for test oracles and common_oracles.py gives examples of some simple test oracles.

The oracles can take in configuration through a JSON file and are given a directory for any output they wish to generate (ideally for logging any errors encountered or information that could be helpful for reproducing the results). Oracles can test the compiler’s behavior by running a test that makes assertions; they should either return normally or exit with an error of some kind.

Here are the example oracles I’ve implemented:

Type checking: It ensures that relay.transform.InferType() passes for the given program (this should always work for the fuzzer I’ve implemented, though I guess in principle we may eventually want to generate programs that we know fail type checking to test the error cases).
Parser round-tripping: Given a program, it tests that pretty-printing it and then parsing it again returns the same program.
Default build: It attempts to compile the program using relay.create_executor and simply checks that there is no crash at compile time.
Differential equality: Given a program, the oracle compiles it using two different executors and checks that the results are the same between the two executors. We constrain the programs to take only tensor types and return a tensor value, since it is easy to generate random tensor values (in principle, we could extend this to handle all Relay types, though).

Generating Programs at Scale

The script generate_programs.py calls a program generator and generates a specified number of programs, calling the program generators in parallel to maximize throughput. There is a simple mechanism for dispatching program generators by name and setting generation parameters through a JSON file.

Note that generated programs are saved as Pickle files. This is done to avoid having the testing infrastructure depend on the parser or text format, since these are things we may want to test (c.f., the parser round-tripping test oracle). I realize that Pickle files are not a great choice in general due to the possible security issues they may pose (executing arbitrary Python code), so I would welcome alternative suggestions, but I do think that whatever mechanism we use to save generated modules should be independent of the parser and text format. It would not be difficult in principle to serialize ASTs as JSON or another such format if that is desirable.

Running Tests at Scale

The script apply_oracle.py, similarly to generate_programs.py, applies test oracles to the saved (Pickled) generated programs in parallel, making a list of all tests that end in failure. The script dispatches oracles by name and is also capable of passing parameters to the oracles via JSON files. As discussed before, oracles may produce auxiliary files to aid in debugging; this script ensures each oracle has a directory to deposit its output files. Additionally, the script logs the standard output and standard error for each oracle run and places it in the appropriate directory; it also separately logs the stack trace in the case of an error, which we may use for categorizing errors (see below).

Next Steps

Given the relative ease with which these scripts can generate random programs and apply oracles to them, the next hurdle will be analyzing the results: It is easy to mark which test cases fail, but there can be multiple failures all corresponding to the same root cause. It might therefore be helpful to group failing tests based on whether the errors look similar (which could be accomplished by grouping based on the stack traces).

I welcome feedback on the infrastructure prototypes presented here and further ideas regarding infrastructure for analysis. If this post is not sufficiently detailed as to the infrastructure

As it happens, I’ve gotten some failure messages from the small-scale fuzzer test runs with some of these oracles, so I will have to investigate the reasons (I suspect it may have to do with various Relay executors not supporting all of the language’s features).

Broader discussion topics, some of which were also raised in my original post:

As a community, how would this fuzzing infrastructure be used? There would have to be people in charge of running and maintaining the infrastructure as well as investigating the resulting errors (they do not have to be the same people in all cases). We could compare how the LLVM community handles fuzzing, though I am not familiar in detail with it. (Perhaps @Flandini might be able to offer some suggestions based on his experiences?)
Where should the fuzzing infrastructure be implemented? There was a pre-RFC on a tools directory, but there have not been updates recently.
We may encounter fairly trivial errors just from targeting executors and passes with unexpected output. This may nevertheless be useful since it would encourage documenting these requirements, but it would also require the maintainers of the fuzzer to encode whatever constraints are needed. Does this seem reasonable? Are there examples of constraints that might be tricky to encode in fuzzers? (Or any examples of documented requirements for comparison now?) Perhaps the otherwise little-used feature flags may be useful for this purpose.

mbaret · October 17, 2022, 2:22pm

I’ll preface this by saying I’m very supportive of an effort to introduce a fuzzer to Relay and think it has the potential to greatly improve our testing coverage and overall code quality. We can only cover a tiny fraction of Relay programs in our unit/integration tests without such a tool. Also cc @Mousius who might be interested.

There’s a Relay ‘interpreter’ in TVM at the moment that I think could be/become a golden reference for these sort of tests. Do you think it’d be appropriate here and if not could we improve the interpreter so it can be used in this way? I mention this because the 3 main executors (graph/VM/AOT) all have some degree of feature/completeness divergence (a problem in its own right) which could make establishing a ground truth challenging.

Furthermore, how do you think we should handle accuracy in these tests? There’s a wide variety of tolerances used for equality testing in TVM, so maybe this sort of proposal would be a good point to establish a definitive notion of what ‘accurate’ means for a Relay program.

slyubomirsky · October 18, 2022, 11:22pm

Yes, I was thinking of using the interpreter as the reference for the differential equality oracle. There is also py_converter.py, which I originally wrote for fuzzing purposes, but I may have to update its interface to work over entire modules to make it useful for this fuzzer. I agree on the trickiness of “ground truth” in this context.

Accuracy is a tricky issue for differential equality. I am not sure what tolerances make sense, as there are subtle ways that differences could be introduced into floating point calculations that may not be indicative of an error (I don’t think our specification of Relay is so detailed as to specify the exact sequence in which floating point operations must be performed). Not sure what a perfect solution would be and it might make sense to re-examine this question after doing some small-scale tests.

mbaret · October 19, 2022, 1:29pm

Yeah agreed a perfect solution here is pretty challenging to arrive at. The TOSA effort immediately comes to mind as one such project trying to tackle this problem, but I don’t think it has dealt with floating-point yet.

I think stage one in tackling this is to establish what the ‘accurate’ result should be, then we can consider what an appropriate tolerance w.r.t. that result is. One option here I can think of would be to have ‘reference’ TE/TIR implementations of every Relay operator registered. We can have these be naive and slow, with an emphasis on readability/correctness. We might even consider aligning to TOSA where cross-over exists if we have no good reason to choose a different numerical behaviour. Then if the interpreter always selects these ‘reference’ strategies, it could provide our ground truth for differential equality testing.

slyubomirsky · October 19, 2022, 6:38pm

That sounds a little like what py_converter.py is trying to do, then. If we could indeed plug in some trusted reference implementations of operators into it, we would pretty much have what you’re proposing.

yuchenj · October 19, 2022, 11:36pm

Thanks @slyubomirsky for this pre-RFC and the POC, I’m very excited about the idea of introducing a fuzzing infrastructure to improve the robustness and testing coverage of TVM!

I have 2 questions:

TVM has Relay and TIR languages at the moment, and it will possibly have Relax and other dialects in the future. Is it possible to build a fuzzing infra that is able to co-generate TIR, Relax, and Relay programs and do testing? Also cc @ganler who has worked on TVMFuzz.
The POC solves the type relations by going backward. Does it mean that it cannot directly leverage the type/shape inference function on the C++ side, and when adding a new op, the developer needs to add recognizer, solver, and sampler to the fuzzer? If so, what are the pros and cons of solving the type/shape relations by working forward?

slyubomirsky · October 20, 2022, 12:08am

Good questions.

Regarding a unified fuzzing infrastructure, I would have to have a more specific idea of what the TIR and Relax fuzzers would look like. The fuzzer I’ve described here is specialized to Relay’s type system and the special considerations it leads to. There are some high-level aspects we could easily share across languages like the formats we use for configuring generators, expressing constraints for oracles, etc. It’s harder to judge what might be similar about program generation beyond the broadest APIs. I would be happy to discuss deduplication of effort with TVMFuzz with @ganler.

You are correct about not being able to reuse the type inference functions, which does result in some additional work for supporting operators in the fuzzer. I can only really imagine one way we can effectively reuse the type inference functions, which is to attempt to sample input types and use the type relation to produce the output type. Such an approach would either be inefficient (if we try to randomly sample types) or require manually specifying constraints on the input types and hence would require additional work from users anyway (this is what inspired the samplers in the current proposals). While manually registering recognizers, solvers, and samplers for existing operators may be slightly inconvenient (technically you can manage with only a sampler or only a recognizer and solver, but having all of them is more helpful), these did not prove to be very hard to implement and allow for a great degree of control over the generation policy. It’s also very easy to test the recognizers, solvers, and samplers against the existing type relation (see the unit tests in the POC).

I think it would make the most sense that if a new op is added, the developer should also add the basic callbacks to the solver. This would not be a large requirement and it has an immediate benefit: The developer can immediately use these properties to start fuzz-testing the new operator. So even though it is more work for the developer, there is also an immediate positive result from doing it.

I agree that it would be ideal if there were some way to use only the type relation to generate solutions to type relations, but it is not apparent to me how it could be done efficiently.

ganler · October 20, 2022, 1:12am

Thanks @slyubomirsky @yuchenj for cc’ing me! Fuzzing or automated testing is definitely a very promising and affordable direction to improve the software quality/safety while not wasting developers’ precious time to come up with complete test cases.

I have not completely gone through this pre-RFC but I will be reviewing it little by litte carefully and try to share something that I know. To get started, I’d like to share a bit about the related work (as I am specifically working in this research area).

The core technique of a language fuzzer is how to generate a random but still well-formed program.

In our case, we generate relay IR instead of TIR (by TVMFuzz or Tzer) or other similar C-like language (by CSmith/graphicsfuzz/clangfuzz/etc.). Therefore, I don’t think there is any “duplication” in Steven’s proposal. Meanwhile, TVMFuzz and Tzer are more like a “prototype”. Building a reliable and stable fuzzer for a specific system (the best example in my mind is Syzkaller for OS) still requires better effort. But of course, once the fuzzer is built, the effort of bug detection can be converted from manual time to machine time!

That said, Relay is a graph-level or high-level language which is different from those prior work targetting C-like low-level languages, and building a Relay fuzzer, or more fundamentally, building a generator of Relay programs, requires adopting specifc assumptions in the language (types, semantics, etc.), motivating the necessity.

I think a more closely related and better engineered tool should be NNSmith (sorry for the advertisement but I think it is really related), where we generate random DNNs (static graph tho) and translate it to PyTorch, ONNX and TensorFlow to test themselves and downstream compilers/engines (and we found 34 bugs in TVM during our prototyping stage, see ASPLOS'23-NNSmith-Bugs - Google Sheets). However, NNSmith models DNNs in DAGs which is a subset of representations that Relay is capable of. Therefore, it is still important to directly perform model generation using Relay if we want to find bugs in say if statements and while loops. Another project that I have noticed is GenCoG which generates Relay programs and the project maintainer has been submitting some bug reports to TVM before.

(I will give comments regarding the details of the pre-RFC in another reply).

slyubomirsky · October 20, 2022, 3:59pm

Thank you for the pointers. I had not previously been aware of GenCoG and will have to acquaint myself with it. I’ll study the interfaces of the prior tools and see whether I can improve the fuzzer I’ve implemented here using them (and whether GenCoG’s approach to dealing with type constraints is better).

ganler · October 20, 2022, 6:00pm

The RFC looks very cool! I like the formulation of type-directed fuzzing since Relay IR’s representation is a superset to “DAGs”.

I have a few questions or comments:

Is the operator constraints/type relation of operators implemented in a unified and approachable way that can be directly extracted by the fuzzer? In NNSmith, we hoped to not be coupled with any specific implementation of DL systems and wanted to keep everything in Python so we manually implemented such constraints from scratch. More specifically, we modeled “abstract operator” and “abstract tensors” (which are basically tensor types in the current context) and the rules need to be given by humans (see Listing 2 for an example) or the code.

Overall, I think it would be very beneficial to unify the way we write concrete constraints in type checking and make the information of type constraints extensible and approachable for the fuzzer.

In addition to the validity constraints, there are some “practical” constraints to “stabilize” the fuzzer. For example, it would be great to have some computation constraints to limit the FLOPS or memory consumption of the generated program, otherwise, it might lead to a slowdown and OOM.
It is great to see the “sampler” idea as actually I was building something similar. Meanwhile, in my experience, the overhead of the SMT solver is often negligible compared with TVM’s compilation. For example, the current NNSmith always uses SMT solving but still generates a 10-node ONNX model within 100ms but the compilation could take over 10 seconds (but it might be due to the bottleneck of TVM’s ONNX importer). But this idea can be very useful for other DL frameworks with fast/lightweight compilation.
The idea to make it a fuzzing library is amazing. This helps developers to reuse the generator to easily access inputs in desired patterns when writing tests.
Regarding the coverage, I built a tool called “memcov” to compile TVM or any other Python-C++ systems with runtime coverage information accessible in Python and has been used for TVM in my projects for a long time. With this, it is easy to write coverage-guided fuzzing loop in Python or just access uncompressed or compressed CFG coverage). The patch for TVM can be extremely simple as well!
Regarding the tradeoff of using C++ or Python: I think often time the bottleneck is not in generation (at least for most cases in TVM). And even if we say we only use it to test a few passes instead of the whole pipeline, the bottleneck IMO is in the solver so using C++ might not make much difference in performance. That said, I would suggest Python for making the tool more approachable if there is no functional limitation.

(still not fully finished with this thread. but will be sharing more comments later little by little)

wzh99 · October 21, 2022, 3:33am

@ganler Thanks for inviting me to this discussion. @slyubomirsky Your RFC is very nice and your efforts in developing a Relay fuzzer is appreciated. As is previously introduced by @ganler, I have implemented GenCoG, a computation graph generator for Relay. Since you are familiar with NNSmith, I may mainly discuss the similarities and differences between GenCoG and NNSmith here. @ganler Please point out if I get anything wrong about your work here.

Similarities:

Both GenCoG and NNSmith generate pure dataflow graphs without control flow.
Both methods require developers to manually specify constraints of operators and leverage a constraint solver to find solutions.
Both methods incrementally construct a computation graph.

Differences (actually there are many, but I may only discuss the major ones):

The most prominent difference is how developers write specifications of operators. GenCoG provides a simple DSL for specifying the constraints in a more organized, concise and comprehensible way. Let’s take reshape operator as an example. In GenCoG, we may specify its constraints as follows:

TypeSpec(
  attrs={
    Attr('newshape', List(Var(ran=dl_rank_ran), 
                          lambda _: Var(int, ran=dim_ran, tmpl=True)))
  },
  in_num=1,
  in_ranks=[Var()],
  in_shapes=[List(IN[0].rank, lambda _: Var(tmpl=True))],
  in_dtypes=[Var()],
  extra=[
    Reduce(IN[0].shape, ArithOp.MUL, 1) == Reduce(a('newshape'), ArithOp.MUL, 1)
  ],
  out_num=1,
  out_ranks=[Len(a('newshape'))],
  out_shapes=[a('newshape')],
  out_dtypes=[IN[0].dtype]
)

The solver in GenCoG automatically processes this specification and solves constraints involved. For NNSmith, the specification of reshape is much longer:

@mark_materialize("core")
class Reshape(UnaryOpBase):
    num_var_param = int_range(1, 4)
    in_dtypes = [(i,) for i in DTYPE_ALL]
    out_dtypes = [(i,) for i in DTYPE_ALL]

    def __init__(self, *target_shape):
        super().__init__()
        self.inp_ranks = [int_range(1, 4)]
        self.out_ranks = [(len(target_shape),)]
        self.target_shape: List[Union[int, z3.ExprRef]] = list(target_shape)

    def type_transfer(self, input_shapes: List[AbsTensor]) -> List[AbsTensor]:
        __MAX_SOLVE_SYMBOL__ = 8
        # otherwise OOM.
        ConstraintCheck.le(
            input_shapes[0].ndims + len(self.target_shape), __MAX_SOLVE_SYMBOL__
        )

        if -1 not in self.target_shape:
            return [AbsTensor(self.target_shape, dtype=input_shapes[0].dtype)]
        # else
        abs_tensor = AbsTensor(self.target_shape, dtype=input_shapes[0].dtype)
        auto_dim = -1
        accum = 1
        for i, v in enumerate(self.target_shape):
            # TODO: What to do about bitvectors here?
            if v == -1:
                SanityCheck.eq(auto_dim, -1)
                auto_dim = i
            else:
                accum = nnsmith_mul(accum, v)

        abs_tensor.shape[auto_dim] = nnsmith_div(
            reduce(lambda x, y: nnsmith_mul(x, y), input_shapes[0].shape, 1), accum
        )

        return [abs_tensor]

    def requires(self, input_shapes):
        ret = []

        inp = input_shapes[0]
        src_len, dst_len = inp.ndims, len(self.target_shape)
        if src_len == 0:
            src_len = 1  # special handling for scalar
        if dst_len == 0:
            dst_len = 1  # special handling for scalar
        gres_config = os.getenv("NNSMITH_GRES", "4")
        if gres_config == "5":
            ng = 1
        elif gres_config == "3":
            ng = min(src_len, dst_len)
        elif gres_config == "4":
            ub = min(src_len, dst_len)
            ng = random.choices(
                range(1, ub + 1), k=1, weights=[2**i for i in range(ub)]
            )[0]
        else:
            raise ValueError(f"NNSMITH_GRES={gres_config} is not recognized")
        src_group = random_group(src_len, ng)
        dst_group = random_group(dst_len, ng)
        self.ng = ng
        self.src_group = src_group
        self.dst_group = dst_group
        assert len(src_group) == len(dst_group) == ng, (src_group, dst_group)

        # group constraints
        src_vars = inp.shape
        dst_vars = self.target_shape
        if len(src_vars) == 0:
            src_vars = [1]  # special handling for scalar
        if len(dst_vars) == 0:
            dst_vars = [1]  # special handling for scalar
        cons_group = []
        for gid in range(ng):
            src_idx = src_group[gid]
            dst_idx = dst_group[gid]
            src_prod = reduce(nnsmith_mul, [src_vars[i] for i in src_idx], 1)
            dst_prod = reduce(nnsmith_mul, [dst_vars[i] for i in dst_idx], 1)
            cons_group.append(nnsmith_eq(src_prod, dst_prod))

        ret.extend(cons_group)
        if os.getenv("NNSMITH_CONS_RESHAPE", "off") != "off":
            # should not be too extreme!
            __DIM_LIMIT__ = 4096
            lim = __DIM_LIMIT__
            for s in self.target_shape[::-1]:
                ret.append(nnsmith_le(s, lim))
                lim //= 2
                lim = max(lim, 1)
        assert -1 not in self.target_shape
        return ret

    def deduct_inp_ranks_and_dtype(
        self, out_abs_tensor: List[AbsTensor]
    ) -> List[Tuple[int, DType]]:
        return [(-1, out_abs_tensor[0].dtype)]

Broadly speaking, the specification in GenCoG is more declarative while the one in NNSmith is more imperative (e.g., some manual sampling code is involved). I am aware that the specification in NNSmith contains more fine-grained handling of some corner cases, and therefore it may not be directly comparable with the one in GenCoG. However, GenCoG is still able to potentially make the specification simpler.

GenCoG does not contain the value searching technique in NNSmith, which improves numerical validity. It is still possible that GenCoG can be extended with this part.

That is what I have to share for now. I may add some more details and discuss other related issues later.

ganler · October 21, 2022, 3:52am

Thanks for the input. For the “the specification of reshape is much longer”, just want to quickly explain that the validity spec for reshape in NNSmith can be as short as one liner like return product(*self.target_shape) == product(*input_shape). It is longer, not because we handled more corner cases or the way we describe specification is not declaritive, but because we did some non-trivial optimization to constraint the solve space to make SMT solving affordable. This is because the complexity of the MIP problem of a * b * c == d * f * g is too high. If you read the code more carefully, we did constraint sub-grouping (credit to my collaborator Jinkun) to convert the original high-complexity constraint to a low-complexity one, trading off a bit solving space for efficiency. I will read other comments later. Thanks for the input again!

wzh99 · October 21, 2022, 6:10am

Many thanks. I did not completely understand what the code in requires is doing in the first place and made incorrect judgement in my comment. I apologize for that. Now I know and thanks again for your kind explanation.

There is one thing I am considering. Since the specification should be provided by human, I guess it should be as simple as possible. Maybe the developer only needs to write product(*self.target_shape) == product(*input_shape) in the specification and rest constraint solving implementation (including the optimizations like constraint sub-grouping you mentioned) can be done by the solver.

Then I may talk about my own experience. In the implementation of GenCoG, there is no such optimization that improves the solving efficiency. The solver directly feeds these non-trivial constraints to the SMT solver. However, the graph generation time of GenCoG is not so long. It takes less than one second to generate a 32-node graph. (It is not significantly slower than NNSmith, right?) In my opinion, though the general complexity of the MIP problem is high, for the cases that we encounter in operator generation, their actual complexity may still be acceptable.

I am not questioning the quality of your work. Definitely you did an excellent job in DL compiler fuzzing, and that’s why it is widely recognized by the community. I just want to discuss what may happen if we make different decisions in design and implementation of a graph-level fuzzer. Please feel free to point out my mistakes and I am looking forward to our further discussion.

ganler · October 21, 2022, 8:05am

Thanks for the discussion! I agree in most cases SMT solving are light-weight, given the majority of operators have simple constraints (e.g., element-wise or injective). Such optimization should only be considered in very rare cases. Taking the example of reshape again, solving one single reshape operator could bring up to 8 second solving time subjective to the number of symbols and prior constraints, which is why sometimes it is meanful to do some optimization.

Examples [click to expand]

But it could be specific to z3. There are other alternative solvers such as cvc5 and mip which might solve those constraints faster. BTW, we can talk more about the prior work elsewhere and let’s keep this thread more focused on RFC’s content.

slyubomirsky · October 21, 2022, 8:38pm

@ganler and @wzh99, I thank you both very much for sharing some insights about your explorations in fuzzing TVM. I am glad to hear that in your experience, SMT solvers have not proven a large burden to use. I preferred to avoid one here to avoid introducing another external dependency to TVM, but perhaps the convenience of being able to have a compact specification could be worth it in the long run.

Unfortunately (in response to @ganler’s question above), the type relations used in Relay’s compiler are opaque functions in C++ and so cannot easily be exported into SMT queries, so any approach will necessitate manually reimplementing the type constraints, either as solver queries or using the functions I defined here to directly generate solutions. It might be worth considering if reimplementing Relay’s type relations using a solver could be worth doing (I wondered about this in my TVMCon talk), but that would be a large project in its own right.

In addition to the validity constraints, there are some “practical” constraints to “stabilize” the fuzzer. For example, it would be great to have some computation constraints to limit the FLOPS or memory consumption of the generated program, otherwise, it might lead to a slowdown and OOM.

Implementing such restrictions would be a very good idea, though I am not sure about the best way to do it. I guess there could be counters approximating the performance of operators and some way to short-circuit generation if they prove too expensive. This also reminds me that I need to implement some kind of timeout in the execution phase because there is nothing to stop the fuzzer from generating an infinite loop…

Regarding the coverage, I built a tool called “memcov” to compile TVM or any other Python-C++ systems with runtime coverage information accessible in Python and has been used for TVM in my projects for a long time. With this, it is easy to write coverage-guided fuzzing loop in Python or just access uncompressed or compressed CFG coverage). The patch for TVM can be extremely simple as well!

Regarding the tradeoff of using C++ or Python: I think often time the bottleneck is not in generation (at least for most cases in TVM). And even if we say we only use it to test a few passes instead of the whole pipeline, the bottleneck IMO is in the solver so using C++ might not make much difference in performance. That said, I would suggest Python for making the tool more approachable if there is no functional limitation.

Thank you also for these comments. The subject of tracking our test coverage has come up outside the context of fuzzing, so this might be good to adopt in TVM’s CI as well (@driazati, perhaps you might be interested?). I also appreciate the insight that generation time may not necessarily be a huge impediment; it was certainly easy to assemble this prototype.

ganler · October 21, 2022, 9:05pm

The approximation for the memory side could be estimated by calculating the sum of tensor sizes according to their data types (to be more precise we can do some maximum liveness analysis to get the expected peak memory usage). Or a simpler way can be limiting the dimension sizes to a reasonable number. A timeout aborting can be implemented by process forking. However, timeout could be a bug theoretically that in the future maybe we can put some constraints to generate a program that must terminate reasonably.