Hello, I have a question about VTA hardware.
While trying to understand how VTA hardware works, it is difficult to know how dependence queues work.
By extracting instructions with debug_flag=0x6, the following instructions are shown.
INSTRUCTION 0: NOP-STORE-STAGE dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0 l2g_queue = 0, g2l_queue = 0 s2g_queue = 1, g2s_queue = 0 INSTRUCTION 1: NOP-STORE-STAGE dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0 l2g_queue = 0, g2l_queue = 0 s2g_queue = 2, g2s_queue = 0 INSTRUCTION 2: LOAD UOP dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0 DRAM: 0x18600000, SRAM:0x0000 y: size=1, pad=[0, 0] x: size=8, stride=8, pad=[0, 0] l2g_queue = 0, g2l_queue = 0 s2g_queue = 1, g2s_queue = 0 INSTRUCTION 3: GEMM dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0 reset_out: 1 range (0, 8) outer loop - iter: 56, wgt: 0, inp: 0, acc: 1 inner loop - iter: 2, wgt: 0, inp: 0, acc: 448 l2g_queue = 0, g2l_queue = 1 s2g_queue = 1, g2s_queue = 0 INSTRUCTION 4: LOAD UOP dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0 DRAM: 0x18600008, SRAM:0x0008 y: size=1, pad=[0, 0] x: size=8, stride=8, pad=[0, 0] l2g_queue = 0, g2l_queue = 1 s2g_queue = 0, g2s_queue = 0 INSTRUCTION 5: GEMM dep - pop prev: 0, pop next: 0, push prev: 1, push next: 0 reset_out: 1 range (8, 16) outer loop - iter: 56, wgt: 0, inp: 0, acc: 1 inner loop - iter: 2, wgt: 0, inp: 0, acc: 448 l2g_queue = 0, g2l_queue = 2 s2g_queue = 0, g2s_queue = 0 INSTRUCTION 6: LOAD INP dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0 DRAM: 0x06040000, SRAM:0x0000 y: size=9, pad=[1, 0] x: size=56, stride=56, pad=[1, 1] l2g_queue = 0, g2l_queue = 1 s2g_queue = 0, g2s_queue = 0 INSTRUCTION 7: LOAD WGT dep - pop prev: 0, pop next: 0, push prev: 0, push next: 1 DRAM: 0x00600d00, SRAM:0x0000 y: size=2, pad=[0, 0] x: size=9, stride=36, pad=[0, 0] l2g_queue = 1, g2l_queue = 1 s2g_queue = 0, g2s_queue = 0 INSTRUCTION 8: LOAD INP dep - pop prev: 0, pop next: 1, push prev: 0, push next: 0 DRAM: 0x06040000, SRAM:0x0244 y: size=9, pad=[1, 0] x: size=56, stride=56, pad=[1, 1] l2g_queue = 1, g2l_queue = 0 s2g_queue = 0, g2s_queue = 0 INSTRUCTION 9: LOAD WGT dep - pop prev: 0, pop next: 0, push prev: 0, push next: 1 DRAM: 0x00600d48, SRAM:0x0012 y: size=2, pad=[0, 0] x: size=9, stride=36, pad=[0, 0] l2g_queue = 2, g2l_queue = 0 s2g_queue = 0, g2s_queue = 0
I guessed that it up to instruction 5 is a initializing phase since it has reset_out, and LOAD INP and LOAD WGT should be presented before LOAD UOP and GEMM.
So I listed what is happening in dependence queues.
However, even seeing these queues, Iām not sure how task-level parallelism is achieved.
Can anyone tell me how this mechanism works?
Thank you for your help.