Hello everyone! I’m trying some experiments on this connection. Currently I’m hitting a bottleneck
How to optimize dequantization gemv on hexagon dsp to achieve higher efficiency
Hello everyone! I’m trying some experiments on this connection. Currently I’m hitting a bottleneck
How to optimize dequantization gemv on hexagon dsp to achieve higher efficiency
I modified some test scripts and the connections are as followshexagon_dequant_gemv
目前的结果:
[1*2048] * [2048,2048] = 92us
dequantize_gemv = 2 ms
Regarding dequantization, part of the schedule may need to be optimized. I’d be grateful for any advice,thanks
Allocate weight matrix in vtcm is not available for e2e models. As we somehow need to move data from global(DDR) to vtcm(SRAM). It can be a separate step, but also can be part of the computation. But anyway we must to count the time during evaluation.
HI ,Hzfengsy
dequantize = T.alloc_buffer((2048, 2048), “float16”, scope=“global.vtcm”)
I tried to have the dequantized variable generated inside the function, but I don’t know much about how to control the lifetime of the variable inside the function.
For example, if the lifetime of the variable can be controlled inside the function, I can use some temporary variables to control the vtcm variable. This may be more flexible for the use of vtcm.
You are right. There are several points:
Thanks for your reply, the modified code is here,update it should look fair now. The weights are loaded in ddr and passed to the kernel, which creates a vtcm buffer for later calculations.I will learn how to use sch later. The current scheduling optimization is still very simple
Is there a method to allocate a weight matrix in VTCM for e2e models, considering the need to transfer data from DDR to SRAM? Can this be a separate step or integrated into computation, and how do we account for time during evaluation?
Directly allocating in vtcm does not seem to work because vtcm is only 8Mb and it takes a long time to copy from global drr to vtcm. At present, I have tried to use l2 fetch to improve the cache hit rate. However, a dequantized gemv 2048*2048 takes about 200us, which is not very fast.