[VTA] [FPGA] Stuck on load module vadd.o

Hej,

I am trying to run the VTA logic either on the DE10-Nano Board or the DE10-Standard. So far I tested it with both.

I am fairly certain that my CMA/DMA is running correctly. I can allocate and free memory in through the cma driver. Once I start the rpc server, it immediatly reserves/takes about 180MB of CMA-Memory, only leaving roughly 100KB left.

When running the vta_get_started.py, my host also properly connects to the rpc server and my custom bit-stream is programmed onto the board. (The provided precombiled bitstream doesn’t work for me and throws an error when trying to flash it)

However once It uploaded the “vadd.o”, both server and get_started script are just looping indefinetly. I do not get any error messages from the kernel module im “dmesg”.

Right now I am out of ideas on where to continue my debugging journey. Does it hang cause I no longer have enough free CMA-Memory available ? Or do I miss something entirely different ? How much CMA-Memory does the VTA-Core need anyway? 180MB seems like a lot to me and pretty aggresive. Could it be that the bitstream generated through chisel and the intel workflow is not properly set up so no communication happens between hps and vta-core?

RPC-Server log:

root@de-10-stdanrad-flspr:~# ./work/tvm/apps/vta_rpc/start_rpc_server.sh
/root/work/tvm
{'INP_WIDTH': 8, 'WGT_WIDTH': 8, 'ACC_WIDTH': 32, 'OUT_WIDTH': 8, 'BATCH': 1, 'BLOCK_IN': 16, 'BLOCK_OUT': 16, 'UOP_BUFF_SIZE': 32768, 'INP_BUFF_SIZE': 32768, 'WGT_BUFF_SIZE': 262144, 'ACC_BUFF_SIZE': 131072, 'OUT_BUFF_SIZE': 32768, 'INP_ELEM_BITS': 128, 'WGT_ELEM_BITS': 2048, 'ACC_ELEM_BITS': 512, 'OUT_ELEM_BITS': 128, 'INP_ELEM_BYTES': 16, 'WGT_ELEM_BYTES': 256, 'ACC_ELEM_BYTES': 64, 'OUT_ELEM_BYTES': 16, 'acc_dtype': 'int32', 'inp_dtype': 'int8', 'wgt_dtype': 'int8', 'out_dtype': 'int8', 'BITSTREAM': '1x16_i8w8a32_15_15_18_17', 'MODEL': 'de10nano_1x16_i8w8a32_15_15_18_17', 'mock_mode': False, '_mock_env': None, '_dev_ctx': None, '_last_env': None}
{'INP_WIDTH': 8, 'WGT_WIDTH': 8, 'ACC_WIDTH': 32, 'OUT_WIDTH': 8, 'BATCH': 1, 'BLOCK_IN': 16, 'BLOCK_OUT': 16, 'UOP_BUFF_SIZE': 32768, 'INP_BUFF_SIZE': 32768, 'WGT_BUFF_SIZE': 262144, 'ACC_BUFF_SIZE': 131072, 'OUT_BUFF_SIZE': 32768, 'INP_ELEM_BITS': 128, 'WGT_ELEM_BITS': 2048, 'ACC_ELEM_BITS': 512, 'OUT_ELEM_BITS': 128, 'INP_ELEM_BYTES': 16, 'WGT_ELEM_BYTES': 256, 'ACC_ELEM_BYTES': 64, 'OUT_ELEM_BYTES': 16, 'acc_dtype': 'int32', 'inp_dtype': 'int8', 'wgt_dtype': 'int8', 'out_dtype': 'int8', 'BITSTREAM': '1x16_i8w8a32_15_15_18_17', 'MODEL': 'de10nano_1x16_i8w8a32_15_15_18_17', 'mock_mode': False, '_mock_env': None, '_dev_ctx': None, '_last_env': None}
2024-05-07 12:12:40.624 INFO bind to 0.0.0.0:9091
2024-05-07 12:13:58.878 INFO connected from ('192.168.137.164', 56286)
2024-05-07 12:13:58.880 INFO start serving at /tmp/tmpihxtvvr6
INFO:root:Skip reconfig_runtime due to same config.
INFO:root:Loading VTA library: /root/work/tvm/vta/python/vta/../../../build/libvta.so
DE10-Nano-Mgr: Programming FPGA from image /tmp/tmpihxtvvr6/output_file.rbf
DE10-Nano-Mgr: Open RBF file                    : PASS in 25 us
DE10-Nano-Mgr: Enable FPGA configuration        : PASS in 10 us
DE10-Nano-Mgr: Wait for FPGA to reset           : PASS in 4 us
DE10-Nano-Mgr:             stat: 0x00000049
DE10-Nano-Mgr:             msel: 9
DE10-Nano-Mgr:             mode: RESET_PHASE
DE10-Nano-Mgr: Release FPGA from reset          : PASS in 3 us
DE10-Nano-Mgr: Wait for configuration phase     : PASS in 1085 us
DE10-Nano-Mgr:             stat: 0x0000004a
DE10-Nano-Mgr:             msel: 9
DE10-Nano-Mgr:             mode: CONFIG_PHASE
DE10-Nano-Mgr: Clear nSTATUS interrupt bit      : PASS in 4 us
DE10-Nano-Mgr: Enable configuration on AXI      : PASS in 3 us
DE10-Nano-Mgr: Write configuration Image        : PASS in 3498339 us
DE10-Nano-Mgr: Write configuration Image        : written 7007204 B
DE10-Nano-Mgr: Wait for CONF_DONE               : PASS in 5 us
DE10-Nano-Mgr:             stat: 0x0000004b
DE10-Nano-Mgr:             msel: 9
DE10-Nano-Mgr:             mode: INIT_PHASE
DE10-Nano-Mgr: Disable configuration on AXI     : PASS in 4 us
DE10-Nano-Mgr: Clear DCLK DONE status           : PASS in 4 us
DE10-Nano-Mgr: Send DCLK for init phase         : PASS in 4 us
DE10-Nano-Mgr: Wait for DCLK                    : PASS in 4 us
DE10-Nano-Mgr: Clear DCLK status flag           : PASS in 3 us
DE10-Nano-Mgr: Wait for FPGA user mode          : PASS in 1081 us
DE10-Nano-Mgr: Release control                  : PASS in 4 us
DE10-Nano-Mgr: EXIT SUCCESS in 3500982 us
INFO:root:Program FPGA with output_file.rbf
INFO:root:Loading VTA library: /root/work/tvm/vta/python/vta/../../../build/libvta.so
2024-05-07 12:14:03.876 INFO load_module /tmp/tmpihxtvvr6/vadd.o

Vta_get_started.py Log:

# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((1, 64, 1, 16), "int32"), B: T.Buffer((1, 64, 1, 16), "int32"), C: T.Buffer((1, 64, 1, 16), "int8")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        A_buf = T.allocate([1024], "int32", "global")
        B_buf = T.allocate([1024], "int32", "global")
        A_buf_1 = T.Buffer((1024,), "int32", data=A_buf)
        for i1, i3 in T.grid(64, 16):
            cse_var_1: T.int32 = i1 * 16 + i3
            A_1 = T.Buffer((1024,), "int32", data=A.data)
            A_buf_1[cse_var_1] = A_1[cse_var_1]
        B_buf_1 = T.Buffer((1024,), "int32", data=B_buf)
        for i1, i3 in T.grid(64, 16):
            cse_var_2: T.int32 = i1 * 16 + i3
            B_1 = T.Buffer((1024,), "int32", data=B.data)
            B_buf_1[cse_var_2] = B_1[cse_var_2]
        A_buf_2 = T.Buffer((1024,), "int32", data=A_buf)
        for i1, i3 in T.grid(64, 16):
            cse_var_3: T.int32 = i1 * 16 + i3
            A_buf_2[cse_var_3] = A_buf_1[cse_var_3] + B_buf_1[cse_var_3]
        for i1, i3 in T.grid(64, 16):
            cse_var_4: T.int32 = i1 * 16 + i3
            C_1 = T.Buffer((1024,), "int8", data=C.data)
            C_1[cse_var_4] = T.Cast("int8", A_buf_2[cse_var_4])
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.command_handle
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.command_handle
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.uop_push
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.coproc_dep_push
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.coproc_dep_pop
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.command_handle
[14:14:03] /home/flspr/work/tvm/src/script/printer/tir/expr.cc:246: Warning: No TScriptPrinterName attribute for tir.vta.coproc_sync
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((1, 64, 1, 16), "int32"), B: T.Buffer((1, 64, 1, 16), "int32"), C: T.Buffer((1, 64, 1, 16), "int8")):
        T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
        vta = T.int32()
        with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 2):
            T.call_extern("int32", "VTALoadBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), A.data, 0, 64, 1, 64, 0, 0, 0, 0, 0, 3)
            T.call_extern("int32", "VTALoadBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), B.data, 0, 64, 1, 64, 0, 0, 0, 0, 64, 3)
            with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_uop_scope", "VTAPushALUOp"):
                T.call_extern("int32", "VTAUopLoopBegin", 64, 1, 1, 0)
                T.tir.vta.uop_push(1, 0, 0, 64, 0, 2, 0, 0)
                T.call_extern("int32", "VTAUopLoopEnd")
            T.tir.vta.coproc_dep_push(2, 3)
        with T.attr(T.iter_var(vta, None, "ThreadIndex", "vta"), "coproc_scope", 3):
            T.tir.vta.coproc_dep_pop(2, 3)
            T.call_extern("int32", "VTAStoreBuffer2D", T.tvm_thread_context(T.tir.vta.command_handle()), 0, 4, C.data, 0, 64, 1, 64)
        T.tir.vta.coproc_sync()