Software Compatibility for AI accelerator

Hi all,

AI accelerator is becoming more and more popular nowadays, however before it goes into uncountable terminals, binary compatibility of AI applications must be considered. Without compatibility, uncountable applications need to be re-builded for so many terminals, even with the same accelerator ISA inside, but they may have different memory/buffer size, also all releases need to be stored and maintained in somewhere for users. that would be a huge cost for companies in industry.

The main compatibility challenge for accelerator is that cache/buffer management is done via static compiling software instead of runtime/os/hardware compared to other platforms. Storage rewrite pass in TVM is doing that job when we building operators, it will analyze the data flow inside of constant shape operators and try to give a best plan for buffers allocations for constant buffers of accelerator. However, when we have another specification of the same accelerator ISA, which means, it has different buffer/memory size, the binary needs to be re-compiled. For example, it’s very normal to have high/mid/low version for the same phone.

So, How to solve this challenge in TVM stack?

Thanks,

We have compatibility issue for vendors libraries, also have compatibility issue for application developers’ customized operators, because TVM gives application developer chances to develop their own AI operators.

@tqchen @thierry @merrymercy

That’s a very good question, I don’t think there is an easy solution to this problem. One solution as you hinted towards is to move away from static compilation and adopt runtime-based JIT-ting. This is our approach in VTA which to some extent allows for runtime-flexibility to adapt the same schedule to different VTA architectures. You can find the runtime source under vta/src/runtime.cc. However our runtime doesn’t define how on-chip buffer management is done since that is specified by the TVM schedule with a combination of tiling, reordering and compute-ats schedule primitives.
It seems like what you are asking for is to have a binary that automatically re-configures its data access/caching scheme to take advantage of underlying resources. That seems difficult to achieve in TVM since that is dictated by the schedule and therefore has to be specified statically. As such by design, this has to be part of the schedule which is provided statically.

As assume that for now this is acceptable since we may not have that many variants of the same accelerator. But I can see how this could lead to library/binary bloat as more variants/devices start to appear. I think it’s worth thinking about how one could expose knobs directly into the binary so that when grabbing a schedule we feed the parameters to the binary rather than feed them to the compiler that then produces the binary. I think it might be doable from a TVM autotuner/compiler/code-gen perspective. Maybe we can work on an issue together to clearly define the problem at hand.

1 Like

This is a challenge that exists for all the hardware and deep learning compiler, including CPUs like ARM, if you want the best performance. A general solution recipe is as follows:

Schedule -> some storage state -> runnable

Choices of Storage State

Different solutions make different choices of what is the intermediate states are.

    1. The simplest way is to make the state final program dll, and relies on dynamic library loading
    1. Alternatively, you can also make the state the schedule parameters(or some generic intermediate code) and relies on a runtime generator
    1. Just use (1), but compress the multiple variant runtime module code.

Both ways rely on the runtime module loading, which TVM already support. You can all the approach as different ways of compression. In the case of code generation, the schedule parameter corresponds to the compressed data, and the final runnable module is the de-compressed approach. As a matter of fact, (3) may be just as effective as (2) without having to introduce too much engineering overhead, if the code only differs slightly between different hardware targets

What does existing approaches do

You can find existing approaches are generally special cases of the current approach.

  • CPU Libraries might derive schedule parameters, and then run the code, and such runtime derivation is usually not optimal
  • CUDA libraries, on the other hand, will pre-pack all the possible code and do runtime selection on the fly.

Eager vs Lazy Fetch of the Required Modules

The second difference, mainly mechanically, is how do you fetch the corresponding modules to the device. We can either pre-pack all the device variants and apply compression(in whatever way), or lazily download the module depending on the device(when the network is available). The lazy download approach is not too bad, as long as there is one default fallback.

AutoTVM provides infrastructure for both of these.

Take away

Software compatibility is always a problem. For normal programs, performance is not as much as an issue. So as long as the instruction is compatible(e.g. ARM) it is fine. If you want the best performance, this will always be a problem that everyone has to solve, and it is only about a matter of making one of these choices. The choices boils down to:

  • What are the storage state and compression algorithm you want to use
  • Whether to use lazy vs eager pre-packing of library

Thanks for your reply. @tqchen @thierry.

We can summarize the solutions into two categories, both of them can be found in CUDA solution.
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#application-compatibility

  1. virtual ISA + JIT .
    The abstraction level of IR used for virtual ISA depends on problems we want to solve. for example, we can save the ir before tvm goes into storage rewrite, reload and compile this ir in runtime.
    In this case, we need to keep original compute before JIT, and lazy execute schedule transform in JIT, we need to,
    a), redesign tvm for JIT, such as removing dependency of python, since no official python release for Android,
    b), review and reorder passes in tvm for JIT, tell which of them are better in AOT, which of them are needed in JIT, design AOT and JIT two flows: optimizations for original compute can be in AOT, tiling schedule and its transformation need to be in JIT.
    c), make some schedule primitives like split, tile lazy auto execute in pass
    d), make codegen primitives like dma copy/tensorize lazy auto execute in pass, must behind schedule passes
    e), make tiling as a pass, in which takes both data shape and machine info as inputs, and outputs the tiling schedule plan. tiling pass can use cost model trained by autotvm offline.
    f), let passes transform ir according to tiling pass’s results.
    g), and most important, the Virtual ISA need to be defined.

Also we need a very light weight and fast compiler. Cuda is used in PC and sever, so using NVCC(LLVM based) as JIT compiler is not a big issue, but for light weight terminals, this may remain another challenge.

  1. fatbin.
    we can save a lots of versions in library for different machine targets, and choose the right one in runtime.

From vendors view, only having 2) will pay a huge cost in software distribution, actually Cuda has both of them.
From developers view, if no 1) supporting, they can never have their own operators and only way to utilize the accelerator is via calling vendors libraries, since customized operators’ compatibility can not be guaranteed. if in that case, the burden of vendors will becoming heavier, and intelligence and potential of developers can not be released.

All the solutions can be summarized as a way of compression of certain form. JIT itself is just using ISA as the target of the compression.

If the cycle of the hardware is around 1 year per generation. Packing everything together or lazy download is not a too bad solution. As long as the old code can run(maybe not optimally). Note we will face that same problem in CUDa, generated PTX code for one generation of GPU do not necessary runs best for a newer one

In short, I think it should not be hard to make the old code run on new platforms and such virtual ISa(like PTx) can be supported in a low level PTx like form, or even just use the loop ir

However, it is very hard to make code run optimally while still have a lightweight runtime. And such thing is not supported, even in cuda.

@xqdan I like the idea of TVM JIT schedule transformation (1.b), it could be quite useful to reducing DLL bloat. The idea would be to pass the scheduling knobs as dynamic inputs to a program.

Overall, I wonder if this would also facilitate software updates - say if the template doesn’t change, and that new schedules are found to perform better, you woulnd’t have to download new DLLs, or re-compile them which saves quite a bit of time.

@tqchen brings a good point: if architectures don’t change that often, it might not be worth implementing, especially given the runtime overheads…

To comment on your proposal. A parametrizable IR plus low level codegen is an interesting. Since TVM’s IR is serializable in all stages, it is doable by just making a cut. Most python dep are optional and most low level code gen are in cxx.

But my guess is that for android, we should just solve backward compatible problem first(make the old code run). Then use AutoTVM infrastructure, to enable user to auto update the performance critical ones for new hardwares when there is network connection available, or simply do so via software update. This will likely makes the runtime minimum for mobile devices.

@tqchen I like the serializable and deserializable IR, we can save serialized IR when building, euqal as ptx, reload and continue to compile it in runtime. Even we can use this to debug, we can save the problematic serialized ir in testing env, in which it’s not convenient for debugging, re-construct stmt AST and feed stmt to the buggy pass manually in testcase.

I love your idea of offline training, updating and compiling on device, with it, both compatible and performance can be ensured.