Mapping TensorIR/TE to Heterogenous Systems

areusch · April 7, 2021, 5:50pm

I think this is somewhat similar to memory scopes being implemented by @csullivan. There is definitely additional work to be done to handle memory planning in a memory scope world. I think some of that falls under [µTVM] microTVM M2 Roadmap.

There are a couple of points to this proposal I’d like to highlight to drive the discussion:

P1. “map to hardware” is currently ambiguous in TVM. Specifically, the “hardware” part. Ostensibly, at runtime, this means “running a computation on a particular DLDevice.” However, getting from the compiler to the runtime is tricky because:

the compiler’s concept of a “device type” is merely the “type” field of DLDevice
- in particular, BYOC devices are all considered “ext_dev.” For BYOC device in which the underlying hardware is identical but they are programmed differently (e.g. imagine several small FPGA instances), this is very limiting. There is no way to express “accelerator type.”
- even for e.g. CPU co-processors, there isn’t a good way to identify them outside of an integer index. How does a programmer know that DLDevice(kDLCPU, 1) means the DSP core? Do they have to actually maintain some enumeration in both Python (e.g. to drive TVM) and C (e.g. at runtime)? This seems terrible.
We currently conflate the concept of “relay backend” with both the concepts of “accelerator type” (e.g. how is this accelerator programmed? is it used for e.g. convolution or pooling?) and “code generator” (e.g. generating C code implies it will run on target_host–to do something different, subclass CodegenC and name it differently).
In general, in the TVM C++ runtime, this is less of a problem because much of the “device programming” is pushed to “load time,” which in the C++ runtime case, is actually typically done as late as possible by Module#GetFunction. This clashes with what you’d expect in microTVM: pushing as much of the programming into the C compiler as possible. In general, the TVM compiler now is somewhat unaware of its runtime environment.

P2. There are a couple of different ways to interpret set_scope:

Inputs and outputs must be in this memory region
Only outputs need to be in this memory region

P3. When copying between memory scopes, there is often:

synchronous copy, handled by CPU
async copy, handled by e.g. DMA, accelerator

It’d be good to discuss P2 & P3 a bit further to better understand the impact on the runtime and memory planner.

Towards solving P1

It seems like we could help solve this problem by motivating a new e.g. target_arch concept in the compiler, which effectively is device_type but with additional provision for BYOC accelerators and multi-core SoC. target_arch could be things like:

target_host_cpu
dsp_cpu
cuda
fpga_conv2d
accelerator (maybe there is just one of these in a system)

target_arch should be tied to both a codegen and a “load” process, but could largely be a platform-specific string. It should have meaning to the end user e.g. “dsp_cpu” should be a concrete concept to them.