Discussion of GPU auto vectorization

Hi all,

As we all know, vectorization (use hfma2 / dp4a instructions) can provide a significant performance boost when programming with nvidia gpu cudacore, and I previously browsed a post on our community:

Which says the tvm auto scheduler can not support dp4a intrins to accelerate int8, and I think it is still the situation as it was then, the kernel generated from auto scheduler doesn’t contain any intrins.

But when I try to analyze the performance of the code, I found that nvcc can partly handle the vectorization, the sass code compiled by nvcc contain dp.4a or hfma2, it’s interesting.

        /*0e40*/                   LDS.U8 R44, [R85+0x101] ;                          /* 0x00010100552c7984 */
                                                                                      /* 0x000fe20000000000 */
        /*0e50*/                   IDP.4A.S8.S8 R74, R46.reuse, R43, R74 ;            /* 0x0000002b2e4a7226 */
                                                                                      /* 0x040fe4000000064a */
        /*0e60*/                   IDP.4A.S8.S8 R73, R46.reuse, R42, R73 ;            /* 0x0000002a2e497226 */
                                                                                      /* 0x040fe20000000649 */
        /*0e70*/                   LDS.U8 R35, [R85+0x1] ;                            /* 0x0000010055237984 */
                                                                                      /* 0x000ea20000000000 */
        /*0e80*/                   IDP.4A.S8.S8 R39, R46, R31, R39 ;                  /* 0x0000001f2e277226 */

Here is the code I customized from tvm tutorial to do a 1024-1024-1024-int8 gemm’s auto schedule on gpu, and I saved the best task into search.cu, then we compile with nvcc, use cuobjdump to view the gpu sass code.

python3 search.py
nvcc -o search.o search.cu -arch=sm_86 --cubin
cuobjdump -sass search.o | tee search.ptx

version of nvcc: 11.1.

The interesting find is even the nvcc can generate dp4a intrincs but the performance of the kernel when we have about 10000 trials is still not as good as other baseline with inlined vector instruction, like cutlass and autotvm.

Does anyone have experience with it, is the nvcc’s vectorization strategy not good enough?

I suppose MetaScheduler has dp4a support.