[Guideline] Relay AOT

tqchen · March 13, 2020, 7:14pm

This RFC outlines the steps we would like to take to bring an AOT compiler for deploying models. Some of the proposed approaches depends on the Unified IR enhancements, but we feel that it is good to discuss the technical choices first with the community so we can be prepared and drive the related designs to the right direction.

Motivation

For domains where functional safety requirements are important. Some users would like an AOT compilation mode of TVM which does not depend on the graph runtime, and potentially allows a user to ship smaller binaries in a consistent and minimal dependency way for eg embedded devices.

Here AOT means we not only compile the operators, but also compile the graph interpretation part of the execution.

This is a draft RFC to outline the key design decisions that are relevant in an AOT compiler design

Runtime API

We will build a consistent, runtime. Module based API as in [DISCUSS] Module based Model Runtime Interface

This same API can be used for both AOT and graph runtime in the same way.

The only technical challenge for the runtime is that we will need to have an alternative minimum version that only wraps the C API(without c++11 due to lack of support typically found in micro controllers). Such runtime can be implemented via languages like C, rust, or generated through codegen.

Example Raw C API usage:

    void* lib, fset_input, frun, fget_output;
    TVMModLoadFromFile(“resnet.so”, “so”, &lib);
    TVMModGetFunction(“setinput”, lib, &fset_input);
    TVMModGetFunction(“setoutput”, lib, &fset_output);
    TVMModGetFunction(“run”, lib, &frun);
    // call into Packed Functions

Graph AOT vs Fully Featured Relay AOT

As a starting point, we could start with a Graph AOT, to support a limited subset of relay program. One goal would be eventually support fully featured Relay AOT, which will bring dependencies on dynamic memory allocator but also support advanced features like control flow.

Runtime State Data Structure

The runtime should still depend on a minimum set of basic primitives, in particular, ways to allocate an array of DLTensor, and setup the memory space in a way that can be accessed through generated code. This means we need a data structure(let us name it GraphRuntimeState) that holds Array of DLTensor and PackedFunc. This data structure need to be accessible from the generated code, which means it is best to implement it as a C ABI compatible way.

One way to unify this data structure with the runtime system is to make use of the Object protocol, so that the GraphRuntimeState can be accessed from any of the languages in the frontend compatible way.

Possible Technical Path

P0: Relay -> C/C++

The simplest approach based on the Relay AOT POC, that directly transpiles a relay program into C API calls into the TVM runtime. The drawback of this approach is that it will bring dependency to the C API. We could also create an LLVM backend, however, see P1.

P1: Relay -> TIR::Function -> runtime.Module

As an alternative approach, we can first lower the relay function into a TIR::Function that corresponds to the low-level actions taken by the runtime. Then we can call the existing code generator to lower the TIR::Function into the final runtime.

This is a more desirable approach in the world of unified IR. Because we don’t have to build a specific code generator backend for relay, but can directly reuse the TIR’s code generator.

Most of the key technical challenges in this path depend on making TIR::Function to be expressive enough to represent the low-level operations of a graph executor. The code below shows a mock up text representation of what the low-level IR could look-like. In order to be able to lower this IR. we will need to be able to handle object(GraphRuntimeState and Array) in the TIR. But once we are able to do that, we can have a bring in flexible implementations, including support additional data structures(via Object).

    # mocked up syntax to show the corresponding low-level IR
    def @graph_init():
    		%arr = @Array.Create()
    		@NDArray.push_back(%arr, @NDArray.empty([%const_shape0]))
    		@NDArray.push_back(%arr, @NDArray.empty([%const_shape1]))
    
    def @graph_run():
    		%ctx = @context.GetGraphRuntimeState()
    		@call_packed("layer0", %ctx.data[0], %ctx.data[1])
    		@call_packed("layer1", %ctx.data[1], %ctx.data[2])
    		@call_packed("layer2", %ctx.data[2], %ctx.data[3])

cgerum · November 11, 2020, 5:06pm

Not sure whether it makes sense to revive such an old RFC, but has there been any progress on a relay AOT? IMHO this would be very much needed, especially for microcontrollers.

areusch · November 13, 2020, 9:59pm

hi @cgerum, I have a prototype of P0 here. it’s not ready to merge and I think we should move to the P1 approach before we do. Feel free to take a look at it if you like.

Andrew

aca88 · November 16, 2020, 2:09pm

Hey Andrew,

Thanks for sharing your progress.

Sadly I can’t get your branch to build

[ 77%] Building CXX object CMakeFiles/tvm_objs.dir/src/relay/backend/param_dict.cc.o
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc: In member function ‘void tvm::relay::backend::AotCodegen::FinishFunctionDecl(int, tvm::runtime::Array<tvm::Integer>)’:
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:243:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = nargs + 1; i < storage_token_sizes.size(); ++i) {
                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:247:31: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = nargs + 1; i < storage_token_sizes.size(); ++i) {
                             ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc: In member function ‘void tvm::relay::backend::AotCodegen::WriteDLTensor(std::ostream&, std::__cxx11::string, std::__cxx11::string, size_t, std::vector<long int>, std::__cxx11::string)’:
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:257:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = 0; i < ndim; ++i) {
                     ~~^~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc: In member function ‘void tvm::relay::backend::AotCodegen::_SidToArg(int, const tvm::runtime::Array<tvm::runtime::Array<tvm::Integer> >&, tvm::relay::Expr, std::vector<std::__cxx11::basic_string<char> >*, std::vector<std::__cxx11::basic_string<char> >*)’:
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:301:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     if (uint64_t(sids[0][0]) == return_sid_) {
         ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:303:92: error: ‘class std::basic_ostream<char>’ has no member named ‘str’; did you mean ‘setf’?
       values->emplace_back((std::stringstream() << "values[" << return_value_index << "]").str());
                                                                                            ^~~
                                                                                            setf
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:304:92: error: ‘class std::basic_ostream<char>’ has no member named ‘str’; did you mean ‘setf’?
       tcodes->emplace_back((std::stringstream() << "tcodes[" << return_value_index << "]").str());
                                                                                            ^~~
                                                                                            setf
[ 77%] Building CXX object CMakeFiles/tvm_objs.dir/src/relay/backend/vm/compiler.cc.o
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:311:74: error: ‘std::ostream {aka class std::basic_ostream<char>}’ has no member named ‘str’; did you mean ‘setf’?
     std::string sid_name = (std::stringstream() << "sid_" << sids[0][0]).str();
                                                                          ^~~
                                                                          setf
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc: In member function ‘void tvm::relay::backend::AotCodegen::AddFunctionCall(std::__cxx11::string, const tvm::relay::CallNode*, std::__cxx11::string, const tvm::Map<tvm::RelayExpr, tvm::runtime::Array<tvm::runtime::Array<tvm::Integer> > >&)’:
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:323:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = 0; i < nargs; ++i) {
                     ~~^~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:328:80: error: ‘class std::basic_ostream<char>’ has no member named ‘str’; did you mean ‘setf’?
         values.emplace_back((std::stringstream() << "values[" << index << "]").str());
                                                                                ^~~
                                                                                setf
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:329:80: error: ‘class std::basic_ostream<char>’ has no member named ‘str’; did you mean ‘setf’?
         tcodes.emplace_back((std::stringstream() << "tcodes[" << index << "]").str());
                                                                                ^~~
                                                                                setf
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:332:85: error: ‘class std::basic_ostream<char>’ has no member named ‘str’; did you mean ‘setf’?
         values.emplace_back((std::stringstream() << "&" << value.first << "_param").str());
                                                                                     ^~~
                                                                                     setf
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:348:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = 0; i < values.size(); i++) {
                     ~~^~~~~~~~~~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:349:48: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       ss_ << "            " << values[i] << (i < (values.size() - 1) ? ", " : "") << std::endl;
                                              ~~^~~~~~~~~~~~~~~~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:353:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int i = 0; i < tcodes.size(); ++i) {
                     ~~^~~~~~~~~~~~~~~
/home/areusch_tvm/src/relay/backend/graph_runtime_codegen.cc:354:48: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       ss_ << "            " << tcodes[i] << (i < (tcodes.size() - 1) ? ", " : "") << std::endl;
                                              ~~^~~~~~~~~~~~~~~~~~~~~
[ 77%] Building CXX object CMakeFiles/tvm_objs.dir/src/relay/backend/vm/inline_primitives.cc.o

Also looking at the example output provided, I see sid_2 being allocated but never being used.

I have a question about the return values of the function calls. More specifically, the output of fused_layout_transform_2. Is the return tensor values[1] or is it subcall_ret_value?

seeing line 156, I would say its in values[1], but then what is subcall_ret_value?

Again thanks for showing some of the progress

ds1231h · November 18, 2020, 6:22am

P1: Relay -> TIR::Function -> runtime.Module

As an alternative approach, we can first lower the relay function into a TIR::Function that corresponds to the low-level actions taken by the runtime.

Sounds great! BTW, does it mean the operators in the graph or sub-graph will be described in C++ in a graph AOT compiler? And will the operator still program in python DLS or in C++ if we want to add new operators?

cgerum · November 18, 2020, 10:43am

@areusch Thanks for sharing the progress. I absolutely agree that P1 would be the better solution. Is there a migration Path from your work to P1 or would one need to implement it from scratch.

areusch · November 19, 2020, 3:13am

hi @aca88 @ds1231h @cgerum,

thanks for your comments! first off, it looks like I had some code committed that compiled on my mac but maybe not more broadly. the fix seems to be simple (use ostringstream instead of stringstream), but please pull again from the brnach and give it a try to see if that compiles now for you. I just retested the branch and it does seem to work for me if you execute python test_graph.py after compiling.

@aca88 says:

Also looking at the example output provided, I see sid_2 being allocated but never being used.

correct. in this prototype, it actually replaces sid_2 with p0_param, a statically allocated tensor! a TODO cleanup is to omit the sid_2 allocation, as it’s not needed here. I plan to merge support for linked parameters in PR 6917.

I have a question about the return values of the function calls. More specifically, the output of fused_layout_transform_2 . Is the return tensor values[1] or is it subcall_ret_value ?

interesting point. so right now if an operator function returns something, you’re right we don’t support that and will throw it away. fortunately, operator functions only operate on their parameters. the typical “return value” of an operator function seems to be subcall_values[-1] (python index). rv is meant to catch low-level errors from e.g. TVMBackendAllocWorkspace, which indicate bigger problems like out-of-memory. subcall_ret_value would be the value to look at if we did support a PackedFunc-style return. this is an open question going forward.

@ds1231h says:

BTW, does it mean the operators in the graph or sub-graph will be described in C++ in a graph AOT compiler?

using TIR would mean we are agnostic to the TVM backend, though in practice we would likely mean LLVM or C, I believe. really, any backend that is suitable to be used from target_host. even with this prototype, the operator impls can be generated by the LLVM backend.

And will the operator still program in python DLS or in C++ if we want to add new operators?

what do you mean by “python DLS”? I don’t think this should affect adding new operators.

@cgerum says:

Is there a migration Path from your work to P1 or would one need to implement it from scratch.

Somewhere between the two. The overall design can be reused, but everything here needs to be ported to generate TIR instead of generating C++. although this prototype doesn’t show it, eventually the AOT runtime needs to generate an implementation of the Module-based model runtime interface. To that end, a limitation of TIR is its inability to return values from functions. I believe this is being worked on with user-defined data type support in TIR.

Hope this helps! Andrew

ds1231h · November 19, 2020, 8:13am

Thanks @areusch for reply!

However, my maily question is that how do we express the op, like conv, in a subgraph. Still TE with python API or other C++ expression?

I mainly focus on the expression of the operators like conv, which programming in python by compute(some placeholder and lambda) and schedules. This might because the graph compiler is not AOT(which means not C/C++, in my mind) before and the operators in a graph can be expressed by python sufficiently.

Now we meet an AOT graph compiler. So is it necessary to program operators in a AOT way like C/C++?

I’m not sure if my understanding is wrong. Feel free to point it out. Thanks!

ds1231h · November 19, 2020, 8:40am

Ah, I realize that I got it wrong before. Graph AOT compiler will not affect the operator’s programming model.

But I still hope you @areusch could help me confirm whether my current thinking is correct.

Thanks a lot!

areusch · November 19, 2020, 8:04pm

hi @ds1231h,

Your thinking seems correct to me. Choice of GraphRuntime and AOT runtime don’t affect the operator implementation. Ultimately, the goal in compiling a full model is to produce an implementation of the Module-based Model Runtime Interface. TVM performs operator fusion and simplification of the model, which produces 0 or more implemented operator functions. The goal of a graph-level runtime such as Graph or AOT is to manage memory associated with the overall model and invoke these operator functions in the correct order.

If you’re just interested in the generated code for a particular operator, I don’t think you need to be too worried about Graph/AOT runtimes. One exception is if you are considering subgraph offloading to e.g. BYOC as a mechanism to implement operators; though even then, this approach operates at a higher level (TE) than GraphRuntime and looks like operator fusion to the graph runtime.

Andrew

ds1231h · November 20, 2020, 2:00am

Thanks a lot! @areusch

monklof · February 18, 2021, 12:43pm

Hi, @tqchen Is there any plan on the “Fully Featured Relay AOT” Compiler?

We have a dynamic-shaped model, which is memory-intensive (the execution time of a single operator is short), the vm’s approach is hard to meet our performance requirements, the overhead is too large compared to the operators execution time.

So, we want a solution to minimize the overhead introduced by the extra code related to shape calculation and memory management.

AOT solution sounds like a good choice, since we can inline the computation related to shape calculation and eliminate the overhead related to interact with the VM.

giuseros · February 18, 2021, 6:23pm

Hi @monklof ,

We in Arm are currently working on this AOT flow and we will soon publish a strawman RFC about it.

Thanks, Giuseppe

monklof · February 19, 2021, 6:16am

Thanks, looking forward to it. Do you have a time schedule?

giuseros · February 19, 2021, 11:54am

The RFC should be published next week (I am writing it right now ). As for a time schedule for the actual feature, it mainly depends by how quickly we (as a community) can reach an agreement on the API changes to make.

tqchen · February 19, 2021, 2:05pm

Thanks @monklof , This seems to be indeed interesting. Would you mind to share a bit more what the model looks like? cc @jroesch

monklof · February 22, 2021, 7:08am

@tqchen thanks.In our case it is distilled bert, the sequence length and hidden size are small, so the computation workload is not that large, and we want to take advantages of tvm’s operator fusion’s capabilities to speedup execution.

And the batch size is dynamic.

By our profiling results, the overhead introduced by relay vm in dynamic mode (5.52ms, BatchSize=1,Relay Batch Dim=Any) is unacceptable compared to static mode(1.24ms, BatchSize=1, Relay Batch Dim=1)

We have made a detailed analysis, the overhead comes mainly from three parts:

Relay Ops introduced by relay frontend, in BatchMatmul and StridedSlice, we add some new ops to do something like dynamic reshape, this ops are placed on GPU by default. TVM Ops(fused) increased to 2.24X to static mode.
TVM Ops introduced by relay pass ManifestAlloc to calculate memory assignment size at relay level, including shape functions and storage size calculation ops, this ops are placed on CPU by default. TVM Ops(fused) increased to 5.92X to static mode.
Instructions introduced by Relay VM, mainly load_const, alloc_tensor and alloc_storage, Instructions increased to 8.90X to static mode, the overhead introduced by executing vm’s instruction is also unacceptable.

Because of 1&2, after all relay passes is done and compiled to VM Bytecodes, there are left large amount of small pieces of tvm ops(PackedFunc), and the overhead is dominated by the calling to PackedFunc and related instructions such like alloc_tensor and load_const. Of course, we can do some work to reduce the number of auxiliary TVM Ops and VM Instructions, but ideally, the AOT approach should be more appealing, cause we can inline the small pieces of cpu calculation and eliminate the overhead of VM instructions, and the performance roofline should be higher than VM’s approach.