Need help on Hexagon HVX Optimization

Hzfengsy · August 28, 2023, 5:11am

Hello everyone!

I’m exploring how to run LLMs on Hexagon, and I began with a GEMV operator. I’ve tested a script on an 8G2 device. The results are interesting: it performs well with N = K = 512, but slows down significantly with N = K = 1024.

# N = K = 512
Time (ms): 0.0078       Total Bytes (MB): 0.50  Memory (GB/s): 64.35
# N = K = 1024
Time (ms): 0.1753       Total Bytes (MB): 2.00  Memory (GB/s): 11.43

The scheduled and the lowered func are:

@T.prim_func
def func(A: T.Buffer((1, 1, 1024), "float16"), B: T.Buffer((1024, 1024), "float16"), C: T.Buffer((1, 1, 1024), "float16")):
    T.func_attr({"tir.noalias": T.bool(True)})
    # with T.block("root"):
    for i0, i1 in T.grid(1, 1):
        for i2_0 in T.parallel(4):
            for i2_1 in range(2):
                for i2_2_init in T.unroll(2):
                    for i2_3_init in T.vectorized(64):
                        with T.block("NT_matmul_init"):
                            v_i0, v_i1 = T.axis.remap("SS", [i0, i1])
                            v_i2 = T.axis.spatial(1024, i2_0 * 256 + i2_1 * 128 + i2_2_init * 64 + i2_3_init)
                            T.reads()
                            T.writes(C[v_i0, v_i1, v_i2])
                            C[v_i0, v_i1, v_i2] = T.float16(0)
                for k_0 in range(256):
                    for i2_2 in T.unroll(2):
                        for k_1 in range(4):
                            for i2_3 in T.vectorized(64):
                                with T.block("NT_matmul_update"):
                                    v_i0, v_i1 = T.axis.remap("SS", [i0, i1])
                                    v_i2 = T.axis.spatial(1024, i2_0 * 256 + i2_1 * 128 + i2_2 * 64 + i2_3)
                                    v_k = T.axis.reduce(1024, k_0 * 4 + k_1)
                                    T.reads(C[v_i0, v_i1, v_i2], A[v_i0, v_i1, v_k], B[v_k, v_i2])
                                    T.writes(C[v_i0, v_i1, v_i2])
                                    C[v_i0, v_i1, v_i2] = C[v_i0, v_i1, v_i2] + A[v_i0, v_i1, v_k] * B[v_k, v_i2]
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A: T.Buffer((1, 1, 1024), "float16"), B: T.Buffer((1024, 1024), "float16"), C: T.Buffer((1, 1, 1024), "float16")):
        T.func_attr({"tir.noalias": T.bool(True)})
        for i2_0 in T.parallel(4):
            for i2_1 in range(2):
                cse_var_1: T.int32 = i2_0 * 256 + i2_1 * 128
                C_1 = T.Buffer((1024,), "float16", data=C.data)
                C_1[cse_var_1:cse_var_1 + 64] = T.Broadcast(T.float16(0), 64)
                C_1[cse_var_1 + 64:cse_var_1 + 64 + 64] = T.Broadcast(T.float16(0), 64)
                for k_0 in range(256):
                    A_1 = T.Buffer((1024,), "float16", data=A.data)
                    B_1 = T.Buffer((1048576,), "float16", data=B.data)
                    for k_1 in range(4):
                        C_1[cse_var_1:cse_var_1 + 64] = C_1[cse_var_1:cse_var_1 + 64] + T.Broadcast(A_1[k_0 * 4 + k_1], 64) * B_1[k_0 * 4096 + k_1 * 1024 + i2_0 * 256 + i2_1 * 128:k_0 * 4096 + k_1 * 1024 + i2_0 * 256 + i2_1 * 128 + 64]
                    for k_1 in range(4):
                        cse_var_2: T.int32 = cse_var_1 + 64
                        C_1[cse_var_2:cse_var_2 + 64] = C_1[cse_var_2:cse_var_2 + 64] + T.Broadcast(A_1[k_0 * 4 + k_1], 64) * B_1[k_0 * 4096 + k_1 * 1024 + i2_0 * 256 + i2_1 * 128 + 64:k_0 * 4096 + k_1 * 1024 + i2_0 * 256 + i2_1 * 128 + 64 + 64]

Since this is my first time working with Hexagon and I have little knowledge to the target, I have a few noob questions:

Could the performance difference be due to cache misses? I thought I had ensured continuous and aligned data access.
Are there any examples of optimizing GEMM/GEMV on HVX? I found some i8 schedules, but LLM tasks seem to favor fp16.
I attempted to use VTCM by caching buffer A to on-chip memory, but it made things slower. Is this expected?

Any insights would be greatly appreciated. Thank you!

cc @kparzysz @sanirudh

sanirudh · August 28, 2023, 8:36am

Hi @Hzfengsy,

I’m not sure about cache misses and I don’t remember examples off the top of my head, but I’ll certainly post one if I find it.

As for using VTCM, I don’t know whether this would help, but considering you’re using v73 architecture whose VTCM capacity is 8MB, we can allocate all 3 buffers (A,B and C) on VTCM and peform the computation, and atleast for the 1024 case, I’m guessing that should increase the performance.

I tried it out quickly with your script and modifying the ndarray allocation to something like this:

args = [allocate_hexagon_array(device, data=arg, mem_scope="global.vtcm") for arg in np_args]

and I’m getting the below result:

# N = K = 512
Time (ms): 0.0280       Total Bytes (MB): 0.50  Memory (GB/s): 17.93
# N = K = 1024
Time (ms): 0.1060       Total Bytes (MB): 2.00  Memory (GB/s): 18.90

I’m not sure why the 512 version performs poorly with VTCM, that’s something I might have to dig into a bit, but for 1024 case, it does seem to improve. Without VTCM allocation, I’m seeing similar results as what you’ve posted.

wen · February 28, 2024, 4:29am

Are you still working on hexagon? I don’t understand a bit. According to your script, will hvx-related code be generated through codegen? thanks @Hzfengsy

sanirudh · February 28, 2024, 5:20am

Hi @wen,

The Hexagon backend in TVM does use its LLVM based codegen, so any vectorized code would get lowered to vector instructions in LLVM IR and then the Hexagon LLVM backend would be used to generate HVX code.

So yeah, to answer your question, it will generate HVX code.

wen · February 28, 2024, 11:39am

hi @sanirudh Thank you for your reply. I feel very confused about the results obtained above, because I use the gemv interface of qblas in hexagon sdk. When n k = 512, I get about 70us, but in tvm, it can reach 7us when n k = 1024. It is about 200us. Although I used vtcm buffer, the speed is not much different from that without vtcm buffer, so I am not sure whether I used it successfully. The interface I tested is the qhblas_hvx_matrix_vector_mpy_ahf interface in hexagon sdk. Even if my results are not using vtcm, then qblas does not look as good as tvm

wen · February 28, 2024, 1:19pm

HI,@Hzfengsy On your code，I tried adding this statement like j0, j1, j2, j3 = sch.split(j, factors=[4, None, 8, vec_len]) Change the medium from 2 to 8 and then apply for vtcm buffer. I tried K N 1984 and the results are as follows. Time (ms): 0.1820 Total Bytes (MB): 7.52 Memory (GB/s): 41.29

It seems that the speed is not bad for the small llm model. If we continue to optimize, the 2048*2048 matrix speed can reach 100us, then the decoding speed is acceptable. After all, it does not occupy the gpu. But here, about how to use tvm Scheduling in gemv to perfectly use vtcm requires further experiments. I don’t know what you think.

wen · February 29, 2024, 3:03am

To add, when only the two-dimensional NK matrix is placed in vtcm and the vector is placed in dsp global ddr, the data obtained NK equals 2048 Time (ms): 0.0920 Total Bytes (MB): 8.01 Memory (GB/s): 87.04

sanirudh · February 29, 2024, 7:37am

If I understand your reply correctly, TVM is generating much better code compared the qhblas library. That is surprising. I don’t have much experience with qhblas implementations, so I’ll have to check that.

But in general, I think for op like GEMV, which is a relatively easy op to generate code for, I think TVM should be pretty good. Only when optimizing for complex ops with things like casting between int to float or ops involving divisions/exponentiation etc. are where I think qhblas is expected to be much better as that’s where hand-written code can take full advantage of the ASM. But for GEMV, if it’s slower, that is certainly a confusing thing and is definitely not expected.