Hi, all~
This RFC is to upstream the support for our TY-NNP accelerator backend. We are from the AI accelerator toolchain team of Intellifusion, who has been focusing on developing vision processor that accelerates deep neural networks in visual recognition and searching in endpoints, such as IP cameras and robots, as well as in cloud.
Nowadays, TVM has become the most important component in our AI software stack and we would like to upstream our work back. We believe participating in the open-source ecosystem will benefit both the internal software infrastructures and our customers!
Overall architecture
The TY-NNP refers to the neural network accelerator architecture serving a wide range of our edge AI scenarios. TY-NNP takes a typical NPU design to offload neural network computation workloads to various kinds of domain-specified designed computing units. Generally, there are three kinds of computing units:
-
NU (neural units)
NU is designed for high-throughput computation of typical neural-network workloads such as Conv/Matmul. Comparing to TensorCores in NVGPU, NU works in a coarse-grained fashion from a software perspective. Instead of software-programming of fine-grained M * N * K mma intrinsics, NU provides CISC-style instructions and a bundle of hardware configurations to developers. The NU components automatically load input/weight data from input buffers, execute fine-grained mma operations with hardware tiling control, and store result to output buffers.
In TVM, we program NU with customized TIR intrinsics. Developers should use schedules to lower the specified computation patterns to NU intrinsics, arrange the on-chip input/output buffers, and perform tuning to determine the best hardware configurations.
-
VU (vector units)
VU accelerates general computation workloads which can not fit NU. TY-NNP provides a set of on-chip VU cores, each taking its own on-chip buffer (called VM), and a set of vectorized/scalar function units and physical registers. VU programming is just like general vectorized programming on CPUs.
In TVM, to offload the computation to VU, developers should schedule the computations into vectorizable form, arrange the on-chip input/output buffers, and mark the proper computation axis with
vectorize
or replace it with VU intrinsics. -
CU (control units)
CU can be seen as a small on-chip core and does not provide high computation abilities. It aims to control the on-chip execution flow and the whole on-chip kernel execution wiil starts from CU.
TY-NNP takes an explicitly managed memory hierarchy, each computing unit has its own buffer and there is a global on-chip buffer (called DM) to transfer data between each unit. Data transfer is explicitly done by asynchronous DMA operations and explicit/implicit synchronizations are used to avoid hazards. In TVM, DMA and synchronization are also represented by TIR intrinsics.
An off-chip storage (called DDR) is managed to transfer data between host and device, which takes much larger space than on-chip buffers and supports dynamic memory allocations. In TVM the DDR storage just corresponds to the storage scope kGlobal
and is managed by runtime.
Implementation design
The current TVM compilation stack for TY-NNP is as follows:
Relay level
- We use a fusion pass based on a dedicated hardware cost model. Beyond traditional heuristic-based fusion for
conv-bn-relu
like patterns, it performs a much more aggressive strategy to merge multiple anchor ops like conv into a single device kernel. This brings opportunities to schedule multiple anchor ops simultaneously, which we think is essential to saturate our NPU hardware. - A schedule-aware layout rewrite mechanism is added. Our tir schedule phase would rewrite tensor layouts to fit hardware features, so we modify the compile engine to give a chance of compatible updates at relay level.
TIR level
A key difference from the current cpu/gpu design is that we try to schedule&tune blocks for multiple ops. It is ok to compute a single heavy op for a single kernel on a gpu device. But we think NPU may prefer to launch a block of consecutive ops to avoid frequent kernel launches. Thus, the proposed fusion pass described above is a way to achieve this.
Also, since the main efforts of tvm community are on cpu/gpu backend, there do exist pain points when developing tir supports for NPU fashion backend. We take some struggling to make it work through the standard schedule â lower flow.
- We use TensorIR schedule ([RFC] TensorIR: A schedulable IR for TVM) to schedule the computations. This is the first trial of TensorIR schedule on NPU infrastructures as far as we know.
- A set of new schedule primitives are added to utilize hardware features.
- A set of new tir passes are added to utilize hardware features.
- We use
device_scope
attr to mark the kernel part of the code. The community host-dev split mechanism works just well for us.
Target level
- For codegen, we developed
class CodeGenTYNNPLLVM: public CodeGenLLVM
- For runtime, we developed
class TYNNPDeviceAPI: public DeviceAPI
How to run
Dependencies
The TY-NNP backend depends on the following prebuilt binaries:
- LLVM libraries with TY-NNP target support
- TY-NNP assembler
- TY-NNP driver libraries with integrated simulator
They are available after upstreaming. Also, we are more than glad to provide Docker environments for anyone interested in our hardware.
Playing
All dependencies are integrated into codegen and runtime, so users can just use general interfaces in a normal way with only two extra cmake options.
# enable TY-NNP support in config.cmake
set(USE_TYNNP ${path to TY-NNP toolchains})
set(USE_LLVM ${path to llvm-config of TY-NNP target support})
# test from tir
with ty_nnp.build_config(): # customized pass context
dev = tvm.ty_nnp(0)
a = tvm.nd.array(a_np, dev)
b = tvm.nd.array(b_np, dev)
f = tvm.build(primfunc, target="ty-nnp")
f(a, b)
# test from relay
with ty_nnp.build_config(): # customized pass context
dev = tvm.ty_nnp(0)
a = tvm.nd.array(a_np, dev)
lib = tvm.build(relay_module, target="ty-nnp")
m = graph_executor.GraphModule(lib["default"](dev))
m.set_input(0, a)
m.run()
b = m.get_output(0)
CI Integration
Although we have managed full scenarios tests in our internal repositories, it would be great if some key features (eg, conv op) could get covered by community CIs. We could provide Docker images which enable the backend testing environments. Any detailed suggestions for CI integration are very welcome!
What we want to contribute
Currently, our backend codes lie in contrib
of corresponding code directories:
- c++:
src/contrib/ty_nnp
(except codegen/runtime) - python:
python/tvm/contrib/ty_nnp
- unittests:
tests/python/contrib/ty_nnp
They can be summarized as following aspects:
TY-NNP codegen and runtime
Runtime is in src/runtime/contrib/ty_nnp
and LLVM codegen is in src/target/ty_nnp
- This will introduce a new device type
kDLTYNNP
and a new target nameTY-NNP
. The corresponding codegen/runtime codes are incremental and do not affect upstream source codes. - A set of new
StorageRank
enums have to be added to specify different on-chip buffer types. Ideally, we are glad to know the best way to define these target-related informations.
TIR optimizations on TY-NNP target
TIR codes are mainly in src/contrib/ty_nnp/tir
- This will introduce a set of backend TIR passes for TY-NNP hardware features, such as DMA intrinsics, synchronizations, static address allocations and etc. They are designed for our hardware only. Users call
ty_nnp.build_config()
to get the specific pass context. - In tvm.build process, we introduce more flexible configurations, such as disabling standard passes which are incompatible with ours.
TensorIR schedule proposal
-
We would like to introduce a set of new schedule primitives
-
Imperative loop partition
Users can either partition the loops and blocks at the schedule phase immediately or lazily perform it in
loop_partition
pass. It helps a lot in non-perfect tiling cases or where boundary conditions are not directly supported by the hardware._, _, h_axis, w_axis, _, = s.get_loops(block) # imperative partitioned = s.loop_partition([h_axis, w_axis], lazy=False) # partitioned is a tree structured data structure tracing partitioned blocks my_visit(partitioned) # lazy, only hint tag added s.loop_partition([h_axis, w_axis], lazy=True)
-
Buffer/loop primitives duality
TVM has already provided very convenient primitives for loops. However, it could be great to explicitly manage memory orders as well as computation orders. We believe for many NPU scenarios, it is very essential to control data layouts of on-chip memory buffers. TensorIR can control buffer dim alignment but it is not enough. On-chip buffers with locality to NPU specified function units (imagine TensorCore) can take totally different memory layouts. It also benefits any infrastructure with manageable memory hierarchies.
Just like we get nested loops by
get_loops(block)
, we make dualed designs to get buffer axes likeget_write_buffer_axes(block, write_idx)
and conduct buffer layout schedule on these axes. Below is a table listing for primitives duality, the highlighted are proposed new primitives:Loop schedule Buffer schedule get_loops get_write_buffer_axes, get_read_buffer_axes split buffer_split fuse buffer_fuse reorder buffer_reorder loop_extent_align buffer_dim_align
-
-
Accommodated scheduling and tuning. Mainly in
python/tvm/contrib/ty_nnp/topi
Currently the schedule/tuning logic is designed for our hardware features only. However, we are very interested in whether there are common methodologies of such NPU schedule designs. We would like to refine our codes to a more general schedule/tuning support into TensorIR modules if such opportunities exist!
Relay accommodation
Mainly in python/tvm/contrib/ty_nnp/relay
and src/contrib/ty_nnp/relay
As described in the implementation design
- Currently our fusion pass depends on hardware specified cost models. Weâd like to refine our code to form an auto-fusion framework with third-party cost models if it is possible.
- Schedule-aware layout rewrite transformation. We add a relay pass to perform a âpre-scheduleâ which determines the best data/weight layout, and then the pass can rewrite the relay level layouts according to the signature of primfunc. Currently, we have to hack the compile engine to find the pre-scheduled PrimFunc from a standalone cache, we are glad to know what is the best way to achieve this goal.
- To utilize the scheduling described above, we propose to insert a customization point in compile engine, which could be different from the fallback schedule, auto-schedule and meta-schedule.
- We add some customized relay ops such as
sum_pool2d
and etc, glad to add them as standard relay ops if they are generally useful.
Summary
- We implemented TY-NNP runtime and codegen. They are introduced as standalone modules with USE_TYNNP compile option.
- We integrate TensorIR (and corresponding relay adaptions) to perform schedule and optimization for our target. This will introduce some adaptations and new features to upstream codes. Perhaps we should split them into standalone PR/RFCs?
Thanks for all your attention, and any suggestions or comments would be appreciated. We are proud to contribute consistently as part of the community.