Auto schedular performance on AMDGPU: the first attempt

masahi · December 6, 2020, 10:09pm

I ran tuning again with NHWC and wino enabled on rocm, using 25000 trials. Other changes include

VerifyGPUCode is run properly
using max_registers = 65536
max_threads_per_block = 1024 (previously 256)

EstimatedLatency improved from 10.148 ms to 7.426 ms . And the final time evaluator measurement is 7.84 ms. Here is the log at the end.

----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.022 |           0.19 |    128 |
|    1 |        0.281 |          14.58 |    896 |
|    2 |        0.018 |          -0.00 |     64 |
|    3 |        0.125 |         824.07 |    448 |
|    4 |        0.262 |         543.33 |   2560 |
|    5 |        0.177 |         582.20 |   1152 |
|    6 |        0.125 |         824.60 |    832 |
|    7 |        0.104 |         496.07 |    448 |
|    8 |        0.108 |         958.30 |    384 |
|    9 |        0.154 |         746.20 |   3008 |
|   10 |        0.130 |         790.86 |   2112 |
|   11 |        0.103 |         996.54 |   1664 |
|   12 |        0.083 |         621.97 |    320 |
|   13 |        0.099 |        1044.90 |    384 |
|   14 |        0.138 |         917.40 |   1856 |
|   15 |        0.104 |         989.16 |   1024 |
|   16 |        0.096 |        1072.63 |    960 |
|   17 |        0.071 |         722.58 |    320 |
|   18 |        0.095 |        1109.23 |    320 |
|   19 |        0.122 |        1051.61 |   1216 |
|   20 |        0.099 |        1040.23 |    640 |
|   21 |        0.089 |        1162.63 |    640 |
|   22 |        0.031 |         842.31 |    128 |
|   23 |        0.040 |          54.81 |    192 |
|   24 |        0.197 |        1203.66 |    640 |
|   25 |        0.086 |        1197.56 |    320 |
|   26 |        0.198 |        1038.55 |    640 |
|   27 |        0.222 |         926.45 |    768 |
|   28 |        0.269 |         764.47 |    896 |
-------------------------------------------------

Estimated total latency: 7.426 ms       Trials: 24960   Used time : 49879 s     Next ID: 1
...
Compile...
Evaluate inference time cost...
Mean inference time (std dev): 7.84 ms (0.03 ms)

I said earlier that AutoTVM result is 6.45 ms. That result was obtained 2 years ago. To be sure, I also ran the AutoTVM relay tutorial on today’s TVM and rocm.

Surprisingly, there is a big regression compared to the result 2 years ago, probably due to AMD’s fault (my gpu is from 2015, fairly old): the current AutoTVM result is 8.08 ms.

Here is the output log:

[Task  1/24]  Current/Best:  182.66/ 419.17 GFLOPS | Progress: (816/2000) | 1454.43 s Done.
[Task  2/24]  Current/Best:  513.47/ 706.49 GFLOPS | Progress: (1128/2000) | 2509.76 s Done.
[Task  3/24]  Current/Best:  825.40/1011.09 GFLOPS | Progress: (2000/2000) | 6889.62 s Done.
[Task  4/24]  Current/Best: 1111.02/1378.50 GFLOPS | Progress: (1452/2000) | 5051.27 s Done.
[Task  5/24]  Current/Best:  779.76/ 873.93 GFLOPS | Progress: (936/2000) | 2951.39 s Done.
[Task  6/24]  Current/Best:  864.52/1013.70 GFLOPS | Progress: (1548/2000) | 5934.36 s Done.
[Task  7/24]  Current/Best: 1497.05/2184.88 GFLOPS | Progress: (1224/2000) | 3759.74 s Done.
[Task  8/24]  Current/Best: 1056.79/1234.30 GFLOPS | Progress: (1128/2000) | 3775.86 s Done.
[Task  9/24]  Current/Best: 1055.22/1203.91 GFLOPS | Progress: (936/2000) | 3012.56 s Done.
[Task 10/24]  Current/Best:  504.19/ 640.72 GFLOPS | Progress: (912/2000) | 3009.30 s Done.
[Task 11/24]  Current/Best:    4.03/ 813.25 GFLOPS | Progress: (684/2000) | 2403.45 s Done.
[Task 12/24]  Current/Best: 1718.92/2001.00 GFLOPS | Progress: (792/2000) | 2469.34 s Done.
[Task 13/24]  Current/Best:  746.74/1068.61 GFLOPS | Progress: (612/2000) | 2099.67 s Done.
[Task 14/24]  Current/Best:  963.92/1126.56 GFLOPS | Progress: (1188/2000) | 4144.49 s Done.
[Task 15/24]  Current/Best:  222.68/ 487.16 GFLOPS | Progress: (768/2000) | 1786.05 s Done.
[Task 16/24]  Current/Best:  354.85/ 589.89 GFLOPS | Progress: (1128/2000) | 3175.10 s Done.
[Task 17/24]  Current/Best: 1100.63/1906.40 GFLOPS | Progress: (996/2000) | 1942.92 s Done.
[Task 18/24]  Current/Best:  541.85/ 795.74 GFLOPS | Progress: (1476/2000) | 4162.37 s Done.
[Task 19/24]  Current/Best:  750.30/ 889.98 GFLOPS | Progress: (612/2000) | 1687.54 s Done.
[Task 20/24]  Current/Best:  104.64/ 233.15 GFLOPS | Progress: (732/2000) | 1447.95 s Done.
[Task 22/24]  Current/Best: 1356.81/1746.76 GFLOPS | Progress: (924/2000) | 2712.59 s Done.
[Task 23/24]  Current/Best:  225.49/ 557.91 GFLOPS | Progress: (960/2000) | 1927.89 s Done.
[Task 24/24]  Current/Best:  301.74/ 702.27 GFLOPS | Progress: (1428/2000) | 2814.64 s Done.
Compile...
Cannot find config for target=rocm -keys=rocm,gpu -max_num_threads=256 -mcpu=gfx803 -model=unknown -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64, workload=('dense.rocm
', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Evaluate inference time cost...
Mean inference time (std dev): 8.08 ms (0.16 ms)

So in summary, time evaluator measurement after tuning using auto scheduler and AutoTVM, on current TVM and rocm:

Auto sch: 7.84 ms (stddev 0.03 ms)
AutoTVM: 8.08 ms (stddev 0.16 ms)

It’s great to see auto sch matching and slightly outperforming AutoTVM! Note that the final dense layer is very slow on auto sch (0.281 ms with 14.58 GLOPS, the slowest of all layers), so convolution only measurement would look better in favor of auto sch.

What I don’t understand is that the tuning log from AutoTVM shows much bigger GFLOPS estimates than those from auto sch, while the final time evaluator measurement is slower. Do AutoTVM and Ansor use different GFLOPS estimates, so comparing them is not meaningful?