[RFC][VTA]A HLS C VTA bug

ffffc · May 20, 2020, 8:14am

VTA can’t get correct in Zynq Ultra Scale based device. We once thought this was a coherency problem, but recently I found there may be a bug in HLS C VTA hardware with my partner @hht.

Only FINISH instruction can set VTA_COMPUTE_DONE register in compute module. And only GEMM/ALU instruction will reset VTA_COMPUTE_DONE register, while LOAD/STORE can’t.

The last instruction in instruction queue is FINISH. The processor queries VTA_COMPUTE_DONE register to confirm that the VTA compute is complete. But VTA_COMPUTE_DONE remains set until VTA runs again and excutes a GEMM/ALU instruction. If processor queries the register before VTA excutes a GEMM/ALU instruction, it will misunderstand that the VTA compute is completed.

I modified the source code of compute module IP to make VTA_COMPUTE_DONE register clear on read.

reg [1:0] rise_done_buf;

always @(posedge ACLK) begin 
    if(ARESET)
        rise_done_buf <= 2'b0;
    else if (ACLK_EN) begin 
        rise_done_buf[0] <= done_o[0];
        rise_done_buf[1] <= rise_done_buf[0];
    end
end
wire rise_done = rise_done_buf[0] & (~rise_done_buf[1]);

reg [31:0] int_done_tmp;
always @(posedge ACLK) begin
    if (ARESET)
        int_done_tmp <= 32'b0;
    else if (ACLK_EN) begin
        if (rise_done)
            int_done_tmp <= 32'b1;
        else if (ar_hs && raddr == ADDR_DONE_O_DATA_0)
            int_done_tmp <= 32'b0; // clear on read
    end
end

After modifying, I get correct results on ZCU104 platform.

@thierry @liangfu

liangfu · May 22, 2020, 9:07am

Thanks for locating the bug @ffffc! Is there any way to modify this in HLS instead of modifying the generated Verilog, so that more people could avoid the pitfall?

hht · May 22, 2020, 9:59am

What about modifying finsh instruction to write a bool into a piece of shared memory? By this way, cpu can clear it after reading. @liangfu

Clear On Read in register is the definately efficient way. But if hls programming cannot easily achieve this aim. Clear On Read in shared memory might be a workaround.

hjiang · October 8, 2020, 4:33am

@hht, thanks for this post, problem the said solution try to addressed pretty like this April PR https://github.com/apache/incubator-tvm-vta/pull/7 by @zhanghaohit , could you help to verify if this #7 independently can fix your problem without the said verilog change?

hht · October 8, 2020, 6:43am

@hjiang, thanks for your reply. I am glad the bug has been fixed.

youxiudeshouyeren · January 21, 2022, 8:48am

I encountered this problem again, on the pynq Zu development board. If I keep the default setting, the program displays “RPC server: load module”, and then the device gets stuck. If I set kbuffercoherent = false in vta/runtime.cc, I can run vta_ get_ started. Py, but it will give wrong classification results in deploy_classification.py. This bug has been fixed and why it still appears. My environment is: TVM 0.8 FPGA: pynq Zu (zynq ultrascale + xczu5eg-sfvc784-1-i). It is very similar to ultra 96.