Presenting the generation of code for the Gemmini accelerator

fPecc · December 16, 2022, 8:13am

Hello all!

During the past months I have been working on integrating the Gemmini accelerator into the TVM framework. The main goal of the work was to use microTVM to generate C code together with the specific instructions and function calls for the Gemmini accelerator. I open this thread in order to explain to the community what the integration brings and how it works. This is the code that was used to generate the results of the paper “Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework” (url). My plan is to upload it to GitHub and generate a pull request in the very near future, so I will update this post when this is ready.

Introduction

The Gemmini project is a tightly coupled tensor operation accelerator and part of the Chipyard ecosystem. It communicates with the Rocket core using RoCC commands, which can be generated using custom RISC-V instructions.

The main computation unit is a square systolic array of dimensions DIM x DIM built from individual tiles, each tile conformed of one or more processing elements (multiply and accumulate units). It also consists of 2 internal SRAMs, the scratchpad and the accumulator, and a DMA engine to transfer data between these memories and the external DRAM.

The project is highly configurable, in order to tackle a variety of platforms and use cases. As such, the software that generates the code for this accelerator must also be able to generate C code based on the parameters of the implemented hardware. This is why I think that the TVM project, using its BYOC workflow, is the ideal framework to generate the necessary code.

About Gemmini’s RoCC instructions

Gemmini presents a series of instructions to the CPU in order to give the programmer the possibility of implementing the DNN operators in a variety of different forms. 2 different kind of instructions can be used:

Intrinsic instructions: These are the basic, more simple instructions provided by Gemmini. They provide fine-grained control of the accelerator, and consist of:
- Configuration and maintenance instructions: they are used to flush Gemminis internal TLB and configure the move pipeline and the execution pipeline of the accelerator.
- Move instructions: these instructions move patches of specific numbers of rows and columns between the external DRAM and the internal SRAM.
- Execute instructions: these instructions execute the actual matrix multiplication using data available in the scratchpad and/or the accumulator.
Loop instructions: this are complex instructions that execute a complete operation using hardcoded finite state machines. Two different types of loop instructions exist:
- Matrix multiplication: performs a tiled matrix multiplication. Allows for bigger matrixes than the DIM x DIM systolic array. All matrix sizes are configurable during runtime.
- Convolution: performs a tiled 2D convolution. All matrix sizes are configurable during runtime.

Integration details

Once a prequantized model is imported into TVM using the standard TensorFlow Lite frontend, a relay preprocess pass must be executed to modify the IRModule, by grouping supported operations into a custom Gemmini operator. This pass uses the standard pattern-matching language of TVM in order to group different relay operators and replace them with the appropriate Gemmini operator.

Supported operators

Some operators have 2 implementation strategies available, in order to generate code using only the intrinsic instructions or the functions provided by the Gemmini developers (which internally use the “CISC type” loop instructions). Having these 2 strategies available is useful because one can use AutoTVM to generate code using the intrinsic instructions and compare its throughput against the functions provided by the Gemmini developers.

Support for the following operators will be provided:

2D convolution (pattern matching around qnn.conv2d)
Depthwise 2D convolution (pattern matching around qnn.conv2d)
Fully connected layer (pattern matching around qnn.dense)
Max pooling layer (pattern matching around nn.max_pool2d)
Adding layer (pattern matching around qnn.add)

The pattern matching also supports merging operators into other operators. For example, Gemmini provides the support to accelerate a convolution followed by a max pooling operator directly in hardware. This is supported by the integration, replacing the sequence of qnn.conv2d + nn.max_pool2d with only one operator, which will offload the convolution and the max pooling to the hardware.

A note about quantization

Defining our custom operators for the 5 previously mentioned operators allows us to implement some algebraic transformations in order to merge the quantization parameters into the bias of each of these layers. This is similar to how the qnn.conv2d manages the output requantization, but this goes one step further. This is because Gemmini only allows scaling the output when reading data from the SRAM into the DRAM, but does not provide a way to offset this data. So, the output offset of the requantization operator that is merged into the custom operator is also integrated into the bias of the layer.

Provided examples

This integration also provides for each of the supported layers a tutorial that can be used as an example of how to use this integration. This integration also provides a tutorial compiling an entire MobileNet for the Gemmini accelerator. The tests are executed on the Spike RISC-V ISA Simulator, which already has a functional simulator of the Gemmini accelerator.

Open questions and pending work

Is it possible to run the Spike simulator on the CI? We should run actually a particular patched version of the Spike simulator, which provides the extension for the functional simulator for Gemmini.
Where would be the correct place to copy the tutorial files? There is no “contrib” folder for tutorials, right? I have place them under tvm/contrib/gemmini, together with most of the rest of the code, but if there is another place to put them, please let me know!

Edit: pull request can be found here

cbalint13 · December 19, 2022, 9:57am

@fPecc,

Thanks for making public such outstanding work !

Personally I am looking forward to see this part of TVM contributed features.

I try answer to second question:

 Q2:  A good place and way to plug into TVM is:
      * https://github.com/apache/tvm/blob/main/gallery/tutorial/uma.py
      * https://tvm.apache.org/docs/tutorial/uma.html

UMA is also a TVM contrib module. It is worth thinking about coupling Gemmini over UMA interface which offers very elegant way to the core layers of TVM: target, strategy, passes, codegen.

But let’s also see core developers opinion on this.

I also have a question, perhaps a bit beyond Gemmini:

Related to auto-{schedule,tune} flows where operations are optimized, most importantly for a accelerator (having dedicated ISA around some kind of CPU) operations are tensorized there is a problem with TVM metric/measurment/evaluation step:

They should run on real targeting CPU+accelerator (with rpc / remote).
Absence of the real target to autotune/schedule implies some kind of simulator.

I see that your approach is using a Spike.

The question would be if we could use (“on the fly”) for measurement / simulation steps:

qemu, a universal one (riscv ready) already used in some TVM examples.
tvm verilator module for cycle accurate simulation of extended ISA part.

Looking forward for your opinion on the question, I am also facing the same issue targeting custom ISA.

fPecc · January 9, 2023, 9:06am

Hi @cbalint13, thanks for the feedback!

Regarding UMA, I am already in contact with the developers of this feature in order to see if we can migrate what I did to UMA. We are still working a little on that part.

Regarding the auto-tuning question, you are right. Real hardware needs to be used to measure the execution of the different schedules via RPC. Although I will provide functional tests that run on Spike, it’s true that the Spike simulator cannot be used to auto-tune the schedules, because it’s only a functional simulator. In my case, I executed the autotuning by measuring on an FPGA implementation, using a Rocketchip RISC-V core coupled with the Gemmini accelerator, communicating with the Linux host and exchanging RPC messages via UART. My idea was to provide first the functional tests with the Spike simulator, and then later see how we can also provide examples to measure the auto-tuning on the real hardware.

cbalint13 · January 9, 2023, 10:30am

Thank you @fPecc for sharing your thoughts.

Facing similar issues, yes hardware is ultimate in the auto-scheduling scheme for accurate results.

I found a way of doing differently the sim part, I would like to discuss with TVM folks to see if we can improve such “simulation” part, I try describe it here a bit:

Describe the autoscheduler TIR desc/sched parts but with T.call_pure_extern() instead llvm_intrinsic.

a) the extern function can be a NOP f'__asm("nop") so it would be pure simulation
b) the extern function can be a veriated HDL that mimics the behaviour of the invoked module.

The results of #a and #b are fine, it always end up with a well tensorized code, always able to emit e.g. C code (just replace back the NOP/HDL extern’s with real HW calls). Goal is (auto)tensorized code without real hardware, but approach also fits well for a simulation on e.g. x86 host hardware.

Will open a new topic on this way of simulate things, if don’t mind will refer you as Cc.

SamiraAhm · December 22, 2023, 9:01am

Hi @fPecc,

I just came across your great work. I am really curious about your work on integration of the Gemmini via UMA. Have you got any progress regarding this? Looking forward to your answer. Samira

fPecc · January 4, 2024, 9:21am

Hi @SamiraAhm ,

I am still planning to tackle that in the near future, but I havent started yet.