We have developed a compiler for a new AI processor, the compiler front end is based on Clang, and we add a new backend for this ASIC by LLVM.
The ASIC has two onchip memory(implemented by SPM, i.e. scratch pad memory) to buffer the neural network input, ouput and the synapse data, we call them NRAM(Neural Data RAM) and WRAM(Weight Data RAM). It also provides a large off-chip memory which to hold input and output data.
The ASIC also provide some powerful instructions to complete some nn operations, e.g. convolution, pooling, activation, matmul, transpose, mlp and etc. It also provides some variable length(not fixed size) vector instructions, e.g. vector addition, substraction, less than, greater than and etc. And some IO instructions to load data from off chip memory to on chip memory(NRAM or WRAM), and store the onchip result to off chip memory.
The high level programming language we designed is similar to CUDA, it is an extension and subset of C / C++. It provides some memory space attributes to represent the location of data, e.g. __nram__ means the data is buffered in the NRAM, __wram__ means the data is buffered in the WRAM; Besides the compiler provides a series of intrinsic functions which is similar to AVX in x86, e.g.: __sv_add will be lowered into vector addition by the llvm backend, while __conv will be lowered into the convolution instructions;
Let’s take AXPY kernel for an example, the device code is shown below:
#define LEN 1024
__global__ void AxpyKernel(half* a, half *x, half *y, half *out) {
__nram__ half t_a[LEN];
__nram__ half t_x[LEN];
__nram__ half t_y[LEN];
__nram__ half t_out[LEN];
__load(t_a, a, LEN * sizeof(half));
__load(t_x, x, LEN * sizeof(half));
__sv_mul(t_out, t_a, t_x, LEN);
__load(t_y, y, LEN * sizeof(half));
__sv_add(t_out, t_out, t_y, LEN);
__store(t_out, out, LEN * sizeof(half));
}
The above kernel is the basic implementation with a small data scale. However, if the input data size exceeds the on chip buffer size, highly-tuned implementations require loop tiling, vectorization, loop unrolling, double buffer to overlap IO and computation, and data prefetching
and etc. Manual optimization may take weeks to complete. Moreover, if we want to implement a more complex nn operator, e.g. roi-pooling and proposal
in Faster-RCNN or some new coming nn operators
, it will take more engineering efforts.
Therefore, we want to retarget TVM to our ASIC. For the first step, we prefer to take TVM as an in-house code generator to emit device kernel code. I have already glanced over the TVM’s implementation, much is about GPU supporting.
If I want to port TVM into our ASIC from the scratch, much time will be consumed at runtime support, but I only prefer to take TVM as an in-house code generator to emit device kernel source code
which contains the intrinsic functions or even LLVM IR or LLVM backend assembly code, so do your guys have some constructive suggestions?