Getting started with the VTA Chisel backend

Ravenwater · June 18, 2019, 8:50pm

I want to get started on modifying the Chisel VTA design to add support for arbitrary posit configurations and want to get the basic documentation how to bootstrap a Chisel VTA design developed, tested, and validated.

vegaluis · June 18, 2019, 8:58pm

Hi Theo,

I wrote some “alpha” documentation that should allow you to build VTA, run ISA or unit-tests, together with some matrix-multiplication and conv2d workloads used in ResNet-18.

github.com

vegaluisjose/vta-tsim-tests/blob/master/README.md

VTA
===

# Setup

1. Install `verilator` and `sbt`
2. Get tvm `git clone https://github.com/dmlc/tvm.git`
3. Change VTA target in `tvm/vta/config/vta_config.json` from `sim` to `tsim`
4. Build [tvm](https://docs.tvm.ai/install/from_source.html#build-the-shared-library)
5. Set environment variables using [Method 1](https://docs.tvm.ai/install/from_source.html#tvm-package)
6. Go to chisel directory `tvm/vta/hardware/chisel`
7. Build hardware shared library by running `make`

# Run unit tests

1. Go to `tvm/vta/tests/python/unittest`
2. Run `python3 test_vta_insn.py`

# Run other tests

This file has been truncated. show original

Let me know if you find any issue.

Ravenwater · June 18, 2019, 8:58pm

The target boards we have available are:

Zedboard
Ultra96
Xilinx VC707
Micron Advanced Computing Solutions AC-520 module with an Intel Arria 10
Achronix Speedster22i HD

Ravenwater · June 18, 2019, 9:00pm

We’ll also have access to an Achronix Speedster7t, which has special support for DL NN features.

If I am not mistaken, UofWashington will have some of these boards as well.

From their description: The Speedster7t FPGA family represents a new class of technology. Based on a new, highly optimized architecture, the Speedster7t family goes beyond traditional FPGA solutions, delivering ASIC-like bandwidth performance, FPGA adaptability and enhanced functionality to streamline design. Manufactured on TSMC’s 7nm FinFET process, Speedster7t FPGAs feature a revolutionary new 2D network-on-chip (NoC), an array of new machine learning processors (MLPs) optimized for high-bandwidth and artificial intelligence/machine learning (AI/ML) workloads, high-bandwidth GDDR6 interfaces, 400G Ethernet and PCI Express Gen5 ports — all in a single device.

It is the applicability and targetability of the MLPs by the TVM compiler and the VTA architecture that is of interest to explore and benchmark against GPUs and TPUs.

vegaluis · June 18, 2019, 9:06pm

Regarding the fpga support, I am working on a prototype for F1. This will showcase the infrastructure needed to support other boards. Adding a fpga backend is not only about generating a bitstream, there are other things that have to be figured out so the user can do import tvm from python and make everything works.

Ravenwater · June 23, 2019, 12:07pm

Hi Luis: where does the vta python module reside? I am getting the following error after a successful build:

stillwater@sw-desktop-300:~/dev/clones/tvm/vta/tests/python/unittest$ python3 test_vta_insn.py
Traceback (most recent call last):
File “test_vta_insn.py”, line 23, in
import vta
ModuleNotFoundError: No module named ‘vta’

Ravenwater · June 23, 2019, 12:17pm

After I added $TVM_HOME/vta/python to the PYTHONPATH, I run into an LLVM dependency:

stillwater@sw-desktop-300:~/dev/clones/tvm/vta/tests/python/unittest$ python3 test_vta_insn.py Traceback (most recent call last): … File “/home/stillwater/dev/clones/tvm/python/tvm/_ffi/_ctypes/function.py”, line 209, in call raise get_last_ffi_error() tvm._ffi.base.TVMError: Traceback (most recent call last): [bt] (2) /home/stillwater/dev/clones/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7fcd262296c5] [bt] (1) /home/stillwater/dev/clones/tvm/build/libtvm.so(+0x3be337) [0x7fcd25a6e337] [bt] (0) /home/stillwater/dev/clones/tvm/build/libtvm.so(tvm::codegen::Build(tvm::Array<tvm::LoweredFunc, void> const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)+0xc9a) [0x7fcd25b62b5a] File “/home/stillwater/dev/clones/tvm/src/codegen/codegen.cc”, line 46 TVMError: Check failed: bf != nullptr: Target llvm is not enabled

so looks like two missing pieces of info in the guidance: vta/python to PYTHONPATH and make explicit that LLVM is required.

Ravenwater · June 23, 2019, 12:44pm

Success!

stillwater@sw-desktop-300:~/dev/clones/tvm/vta/tests/python/unittest$ python3 test_vta_insn.py
Initialize VTACommandHandle…
Load/store test took 619 clock cycles
Padded load test took 8579 clock cycles
GEMM schedule:default test took 648 clock cycles
GEMM schedule:smt test took 755 clock cycles
ALU SHL imm:True test took 1067 clock cycles
ALU MAX imm:True test took 1067 clock cycles
ALU MAX imm:False test took 1654 clock cycles
ALU ADD imm:True test took 1067 clock cycles
ALU ADD imm:False test took 1654 clock cycles
ALU SHR imm:True test took 1067 clock cycles
Relu test took 1738 clock cycles
Shift/scale test took 386 clock cycles
Close VTACommandhandle…

Ravenwater · June 23, 2019, 12:48pm

To be explicit for others that will encounter this problem: $TVM_HOME/vta/python is where the vta python module lives

Ravenwater · June 23, 2019, 2:25pm

Ok, any tests that you want me to write?

vegaluis · June 23, 2019, 2:44pm

Yeah, I forgot to mention this glad that you figured out.

vegaluis · June 23, 2019, 2:45pm

Did you run matrix-multiply and the conv2d-resnet example from the repo I shared?

hjiang · June 24, 2019, 10:02pm

@vegaluis, @Ravenwater, I just tried the tsim and want to share my experience at this same thread,
I tried test_vta_insn.py and Sample Matrix-multiply and resetnet18 tutorial, the clock cycles measurement seems like awesome, but i also experienced couple crash and have following questions.

#1. When doing test_vta_insn.py I saw some test case get failed like “ALU MAX imm:False test took 1654 clock cycles”,
could i know what is this ‘False’ means?

#2. When i run’ matrix multiply ', ‘rest18’ , python
kernel keep get crashed on libvta, could
I know is there any work ground to fix such
issue?

vegaluis · June 24, 2019, 10:10pm

Check the print-message-code, the alu tests uses “immediate” as boolean. So, False means no-immediate.

For the crash, can you elaborate a bit more? are you getting an issue with loading the symbol? or could paste a log or something?

hjiang · June 24, 2019, 10:38pm

Hi Vegaluis,

Thanks for the prompt reply, following is the assembly code, somehow the libvta can not find symbol even after i can the O2 into O0, int chisel Makefile and tvm CMakefile.txt

Regards
Hua

Dump of assembler code for function _ZN3vta4tsim6Device6LaunchEmmmmmmjj:
0x00007f830cf585de <+0>: push %rbp
0x00007f830cf585df <+1>: mov %rsp,%rbp
0x00007f830cf585e2 <+4>: sub $0x30,%rsp
0x00007f830cf585e6 <+8>: mov %rdi,-0x8(%rbp)
0x00007f830cf585ea <+12>: mov %rsi,-0x10(%rbp)
0x00007f830cf585ee <+16>: mov %rdx,-0x18(%rbp)
0x00007f830cf585f2 <+20>: mov %rcx,-0x20(%rbp)
0x00007f830cf585f6 <+24>: mov %r8,-0x28(%rbp)
0x00007f830cf585fa <+28>: mov %r9,-0x30(%rbp)
0x00007f830cf585fe <+32>: mov -0x8(%rbp),%rax
0x00007f830cf58602 <+36>: mov 0x10(%rax),%rax
=> 0x00007f830cf58606 <+40>: mov (%rax),%rax
0x00007f830cf58609 <+43>: add $0x38,%rax

(gdb) i r rax
rax 0x0 0

hjiang · June 25, 2019, 3:04am

After call “simulator.tsim_init(“libvta_hw”)” at matrix production example , crash issue gone, but the tsim stuck at
/tsim/tsim_driver.cc:126 val = dev_->ReadReg(0x00);

Ravenwater · June 25, 2019, 2:33pm

Running into a segfault:

Blockquote stillwater@sw-desktop-300:~/dev/clones/tvm$ python3 vta/tests/python/integration/test_benchmark_topi_conv2d.py key=resnet-cfg[1] Conv2DWorkload(batch=1, height=56, width=56, in_filter=64, out_filter=64, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=1, wstride=1) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00708832 sec/op, 32.6186 GOPS key=resnet-cfg[2] Conv2DWorkload(batch=1, height=56, width=56, in_filter=64, out_filter=64, hkernel=1, wkernel=1, hpad=0, wpad=0, hstride=1, wstride=1) ----- CONV2D CPU End-to-End Test------- Time cost = 0.0011501 sec/op, 22.3373 GOPS key=resnet-cfg[3] Conv2DWorkload(batch=1, height=56, width=56, in_filter=64, out_filter=128, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=2, wstride=2) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00399722 sec/op, 28.9215 GOPS key=resnet-cfg[4] Conv2DWorkload(batch=1, height=56, width=56, in_filter=64, out_filter=128, hkernel=1, wkernel=1, hpad=0, wpad=0, hstride=2, wstride=2) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00031024 sec/op, 41.4036 GOPS key=resnet-cfg[5] Conv2DWorkload(batch=1, height=28, width=28, in_filter=128, out_filter=128, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=1, wstride=1) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00686752 sec/op, 33.6673 GOPS key=resnet-cfg[6] Conv2DWorkload(batch=1, height=28, width=28, in_filter=128, out_filter=256, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=2, wstride=2) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00465832 sec/op, 24.817 GOPS key=resnet-cfg[7] Conv2DWorkload(batch=1, height=28, width=28, in_filter=128, out_filter=256, hkernel=1, wkernel=1, hpad=0, wpad=0, hstride=2, wstride=2) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00045912 sec/op, 27.9776 GOPS key=resnet-cfg[8] Conv2DWorkload(batch=1, height=14, width=14, in_filter=256, out_filter=256, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=1, wstride=1) ----- CONV2D CPU End-to-End Test------- Time cost = 0.0063071 sec/op, 36.6588 GOPS key=resnet-cfg[9] Conv2DWorkload(batch=1, height=14, width=14, in_filter=256, out_filter=512, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=2, wstride=2) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00296182 sec/op, 39.0319 GOPS key=resnet-cfg[10] Conv2DWorkload(batch=1, height=14, width=14, in_filter=256, out_filter=512, hkernel=1, wkernel=1, hpad=0, wpad=0, hstride=2, wstride=2) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00025994 sec/op, 49.4155 GOPS key=resnet-cfg[11] Conv2DWorkload(batch=1, height=7, width=7, in_filter=512, out_filter=512, hkernel=3, wkernel=3, hpad=1, wpad=1, hstride=1, wstride=1) ----- CONV2D CPU End-to-End Test------- Time cost = 0.00715512 sec/op, 32.3141 GOPS key=resnet-cfg[0] Conv2DWorkload(batch=1, height=224, width=224, in_filter=16, out_filter=64, hkernel=7, wkernel=7, hpad=3, wpad=3, hstride=2, wstride=2) ----- CONV2D End-to-End Test------- Initialize VTACommandHandle… Segmentation fault (core dumped) Blockquote

Any pointers to debug this?

vegaluis · June 27, 2019, 10:04pm

I think your problem might be related to building vta with “tsim” target enable. Can you do?

cat /tvm/vta/config/vta_config.json

if you changed the target after building vta, then it will def. crash.

vegaluis · June 27, 2019, 10:08pm

This benchmark is the same as the one in the repo I created resnet18_conv2d.py

The reason why the other one is crashing is because the hardware-shared-library is not being loaded. We still working on how to the time-evaluator will behave for VTA-TSIM (main reason why is not mainstream yet).

cnjsdfcy · August 12, 2019, 9:14am

Hi Luis,

Both tests in https://github.com/vegaluisjose/vta-tsim-tests failed after updating TVM repo, please have a check.

Thanks.