[RFC] Type-Directed Relay Fuzzing Library

Yeah, coverage can be an important metric for developers to understand which part of the compiler is not properly verified, which is beyond fuzzing.

Just want to mention that there is actually some tiny differences in coverages in terms of implementation/use cases. To get readable (i.e., source code lelve) coverage reports for the developers, we can use LLVM-COV or GCOV and it can smoothly work with TVM by adding a few compile flags. However, in fuzzing, source code coverage are not used, like AFL, LibFuzzer and Tzer (i.e., memcov) all implement a CFG-level (LLVM IR level) coverage with LLVM CovSan which is more efficient and can be applied with many optimizations (say hashing the coverage to reduce memory footprint). I also had a blog article related to coverage usages in C++ (if it helps).


We discussed a few aspects of this at the community meeting this morning:

  • When should we generate programs to test? When should the tests be run?
    • @slyubomirsky Other fuzzers have many different approaches, some generate and save programs and others generate them each time. Maybe we should do both in order to test the fuzzer
    • @driazati we should get the fuzzer running in CI at some scale now, we can grow or change it as we try it out
  • As many (potentially thousands) of generated programs hit the same kinds of bugs, how do we group stack traces to simplify error reporting / bug hunting?
    • @slyubomirsky Fuzzing frameworks also differ here, some look at parts of the stack trace
    • @gromero Maybe it’s possible to look at program features to identify what triggers similar bugs
  • How do we save generated programs without relying on the Relay text format?
    • @slyubomirsky pickleing the IRModule works out of the box but is pretty opaque
    • @driazati pickle might be ok, we can probably disable the arbitrary code execution parts of it (another note: pickle also has a text format that might be easier to stomach, but there is no parser for it afaik)
    • @sebastianboblestetas We should do the JSON format if the serializer is simple enough

I’ve attempted to implement a JSON serialization format, but I’m facing some difficulties when trying to serialize the Relay Prelude and get it back. I’m not sure what might be going on, but the main point of the matter is that preserving reference equality where Relay expects it is fairly tricky, so it would take a fair bit of validation to make sure that this serialization is correct. (I’m still not completely sure of what is going wrong here, incidentally.) The implementation was also fairly large (~700 lines), though much of that was boilerplate. Fixed, see below edits… I’ll post the serializer when I have tests ready and we can decide if that’s preferable to pickle.

If we are very intent on having this JSON format rather than pickle, would anyone be interested in discussing how it should be implemented or trying to figure out what issues are arising with the Relay Prelude?

Edit: @driazati, do you think we might be able to cryptographically sign off on our Pickle files somehow to ensure potentially malicious ones can’t be inserted? I’d be willing to explore other options related to pickle, as it’s seeming like developing another serializer will be a nontrivial task.

Edit 2: I got a new idea for a simpler approach to JSON serialization (just cache all types and exprs by their pointer value and use that to determine reference equality) and I’ll see if that works instead.

Edit 3: The simpler approach is still encountering the same odd problem when trying to serialize the Prelude. I have no clue why this shouldn’t work, since it is tracking reference equality for every part of the AST. Happy to provide code for debugging this, but it’s strange and makes me think the pickle option is the better one

Edit 4: I think I may have found the issue. It turns out the contents of attrs could be arbitrary TVM objects…

I’ve implemented a JSON serializer if we think that’s preferable to using pickle. The implementation is about 700 lines, much of which is boilerplate, but it’s not very complicated other than having to deal with numpy values (in constants) and attributes nodes. See the implementation here: https://github.com/slyubomirsky/tvm/blob/fuzzer-poc/python/tvm/relay/testing/ast_dict.py

Tests here (I’ve run a lot of fuzzing trials on it and it’s worked, so I’m confident it works): https://github.com/slyubomirsky/tvm/blob/fuzzer-poc/tests/python/relay/fuzzing/test_ast_dict.py

It seems to be a bit slower than pickle, but it could definitely be implemented in C++ and thus be a lot faster.

CC: @driazati and @sebastianboblestetas (since we had discussed it in the community meeting)

1 Like

LGTM on a first glance, maybe a small comment: You often use the construct set(map(lambda …) like here:

Couldn’t you simply use {gv.name_hint for gv in mod.get_global_vars()} ?

Thank you for taking a look. Yeah I think you’re right that there’s no advantage to using map in such cases.

I suppose it would make sense to present the JSON serializer as one option in the eventual RFC.

Given that it’s implemented and (I’m assuming) fast enough, is there any reason to use pickle/cpickle at this point?

IMO, the main reasons to use pickle over the JSON serialization format besides speed are also that the TVM community would not be on the hook for maintaining pickle and that pickle will also almost certainly be futureproof with respect to any changes in Relay.

1 Like

Note for future reference: this paper contains a survey of stack trace deduplication methods. The most effective rely on some learned components, which we wouldn’t be able to train until we’ve already done some fuzzing, so I will initially try some of the simpler edit distance–based approaches.

I’m pleased to note that I’ve added a simple script for clustering stack traces, allowing for duplicate stack traces to be easily grouped together and thus making it easier to start analyzing the bugs that have emerged. The script is in tests/scripts/relay_fuzzing/cluster_traces.py in the fuzzer proof-of-concept fork. (Note: It requires the Levenshtein library and SK-Learn.)

With this tool, I think it would be fair to say that the “fuzzing pipeline” here is complete: With the code in my proof-of-concept, it is possible to generate programs, run tests for different conditions on them, and analyze the resulting stack traces to start debugging. With these changes, I think it would be useful to start discussing how workable this infrastructure would be and perhaps to start moving on to a formal RFC.

In my small-scale tests, I found it effective to use a very simple form of clustering: Just agglomerative (hierarchical) clustering based on the Levenshtein distance (i.e., edit distance) between the string dumps of the stack traces. The script simply uses a similarity percentage (Levenshtein distance divided by the length of the longer of the two strings) with a 10% similarity cutoff. It’s a very simple method and requires no training.

In the future, we can explore using one of the more advanced methods for clustering stack traces if this proves inadequate in practice, but from my cursory examination, the resulting clusters made sense (different kinds of crashes were indeed in different clusters).

Special thanks to @ehsanmok for giving me some advice on this subject.

1 Like