Auto schedular performance on AMDGPU: the first attempt

masahi · December 4, 2020, 11:48pm

Updated by @merrymercy: see post20 for the new results

I tried runnning the relay auto schedular tutorial on my Radeon R9 Nano (8 TFLOPS peak) via rocm backend. It didn’t work out of the box, but after a simple fix, I got the following result on resnet50. It uses NCHW layout, since rocm backend currently doesn’t support NHWC.

|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |                                                                                                                   [56/1998]
-------------------------------------------------
|    0 |        0.023 |           0.17 |     64 |
|    1 |        0.312 |          13.14 |    192 |
|    2 |        0.014 |          -0.00 |     64 |
|    3 |        0.148 |         699.13 |    128 |
|    4 |        0.381 |         607.11 |    512 |
|    5 |        0.195 |         528.04 |    320 |
|    6 |        0.134 |         770.08 |    192 |
|    7 |        0.285 |         180.70 |    128 |
|    8 |        0.132 |         783.08 |     64 |
|    9 |        0.254 |         911.56 |    704 |
|   10 |        0.174 |         590.92 |    448 |
|   11 |        0.112 |         922.53 |    320 |
|   12 |        0.096 |         534.17 |    128 |
|   13 |        0.117 |         891.38 |    128 |
|   14 |        0.223 |        1036.67 |    448 |
|   15 |        0.121 |         852.42 |    192 |
|   16 |        0.121 |         853.46 |    192 |
|   17 |        0.074 |         692.40 |    128 |
|   18 |        0.117 |         896.98 |    128 |
|   19 |        0.221 |        1046.90 |    384 |
|   20 |        0.113 |         914.47 |    128 |
|   21 |        0.113 |         917.25 |    128 |
|   22 |        0.032 |         810.28 |     64 |
|   23 |        0.042 |          52.90 |     64 |
|   24 |        0.224 |        1062.46 |    128 |
|   25 |        0.116 |         882.35 |    128 |
|   26 |        0.224 |         916.39 |    128 |
|   27 |        0.285 |         721.63 |    128 |
|   28 |        0.423 |         485.46 |    256 |
-------------------------------------------------
Estimated total latency: 10.148 ms      Trials: 6016    Used time : 13537 s     Next ID: 4

So auto sch performance on NCHW resnet 50 is about 10.2 ms. For comparison, AutoTVM performance on the same model on the same GPU is 6.45 ms.

Performance comparison

Auto sch: 10.2 ms (done last week)
Auto TVM: 6.45 ms (done two years ago)
TVM + MIOpen: 6.15 ms (done two years ago)

Even though there is some big gap between AutoTVM result, I’d say getting this number without manual template is already impressive!!

Here are my questions:

Has anybody tried Ansor on AMDGPU?
Does the above result look reasonable?
How can we close the gap between AutoTVM? Would introducing NHWC support help?

I think rocm backend would be interesting for Ansor because it is the only well supported backend that does GPU codegen via LLVM.

comaniac · December 4, 2020, 11:59pm

Glad to know that you are impressive by Ansor Here is my two cents and I believe @merrymercy could comment more.

I don’t recall anyone has tried Ansor on AMD GPU. Hopefully someone who has played with it could see this post.
The above result looks reasonable to me, considering you only tuned for about 3-4 hours. It would be interesting to see the performance comparison after tuning for a longer time (e.g., 24 hours). To me, 24 hours is still fast because you may need ~2 hours for each AutoTVM task, which would be >48 hours using AutoTVM. Please note that you can directly use the current log file to continue tuning.
We need to first identify where the gap is in addition to the tuning time. Introducing NHWC compute may help.

Also cc @FrozenGene

merrymercy · December 5, 2020, 12:20am

We only enabled Winograd for NHWC layout. Why cannot you try NHWC layout? We don’t need anything else. We just use topi.nn.conv2d_nhwc. You can convert the layout of your model with https://tvm.apache.org/docs/dev/convert_layout.html

masahi · December 5, 2020, 12:21am

Yes I stopped tuning in 3-4 hours because I thought the tuning had mostly converged at that point: I only see an order of only 0.001 ms improvement after that trials.

One thing I found, compared to tuning on NVIDIA GPU, is that there are way more errors generated during AMD tuning. The error is always due to invalid kernels that uses more shared memory available. On AMD GCN, shared mem per block is capped at 64KB, I don’t think it is too different from NV GPU.

merrymercy · December 5, 2020, 12:23am

For resnet-50, 7000 trials is not enough. I will go for 20000 or 25000.

masahi · December 5, 2020, 12:25am

This is simply because rocm op strategy doesn’t support NHWC layout at the moment. I hit the error at

https://github.com/apache/tvm/blob/main/python/tvm/relay/op/strategy/rocm.py#L92

But yes, since rocm backend just uses CUDA topi, there is really no reason rocm cannot support NHWC.

merrymercy · December 5, 2020, 12:28am

Hi,

I think the gap comes from two reasons:

Winograd conv2d is not used. You can copy https://github.com/apache/tvm/blob/0d46cf7d15ba1494f302084715db24441aae953f/python/tvm/relay/op/strategy/cuda.py#L232-L239 and https://github.com/apache/tvm/blob/0d46cf7d15ba1494f302084715db24441aae953f/python/tvm/relay/op/strategy/cuda.py#L167-L171 to the rocm op strategy and give another try.
Tuning trials is not enough.

I believe ansor should at lease match the performance of autotvm with 20000 trials. Because autotvm just uses CUDA templates for ROCM.

masahi · December 5, 2020, 12:29am

Is it worth continue tuning even if after about 7000 trials, improvement after each round is almost negligible (say only 0.001 ms faster) ? How do you decide if tuning is converged?

merrymercy · December 5, 2020, 12:34am

You cannot just rely on several rounds to decide the convergence. The search can get stuck sometimes. If you use the latest version. A file total_latency.tsv will be dumped into the same folder for easier monitoring of the “Estimated Latency”. You can stop if the “estimated latency” does not improve after 20 or 30 rounds.

Also, the “Estimated Latency” is not very accurate. After you finish the tuning, you can go back and replay the log to get a more accurate tuning curve.

merrymercy · December 5, 2020, 12:38am

Does the output of this line (https://github.com/apache/tvm/blob/0d46cf7d15ba1494f302084715db24441aae953f/tutorials/auto_scheduler/tune_network_cuda.py#L293) match the “EstimatedLatency”?

masahi · December 5, 2020, 12:40am

Yes mostly, the output of that line is a bit slower because (I think) it includes measurement for all layers, including the ones not tuned and thus not part of EstimatedLatency.

masahi · December 5, 2020, 12:40am

ok thanks @merrymercy @comaniac, I will try NHWC + winograd + 20000 trials on rocm and see how much it improves.

merrymercy · December 5, 2020, 12:51am

Do you set the correct hardware parameters?

github.com

apache/tvm/blob/0d46cf7d15ba1494f302084715db24441aae953f/src/auto_scheduler/search_task.cc#L57-L78


} else if (target->kind->device_type == kDLGPU) {
  auto ctx = TVMContext{kDLGPU, 0};
  auto func = tvm::runtime::Registry::Get("device_api.gpu");
  ICHECK(func != nullptr) << "Cannot find GPU device_api in registry";
  auto device_api = static_cast<tvm::runtime::DeviceAPI*>(((*func)()).operator void*());

  tvm::runtime::TVMRetValue ret;
  device_api->GetAttr(ctx, tvm::runtime::DeviceAttrKind::kMaxSharedMemoryPerBlock, &ret);
  int max_shared_memory_per_block = ret;

  device_api->GetAttr(ctx, tvm::runtime::DeviceAttrKind::kMaxRegistersPerBlock, &ret);
  int max_registers_per_block = ret;

  device_api->GetAttr(ctx, tvm::runtime::DeviceAttrKind::kMaxThreadsPerBlock, &ret);
  int max_threads_per_block = ret;

  device_api->GetAttr(ctx, tvm::runtime::DeviceAttrKind::kWarpSize, &ret);
  int warp_size = ret;

  int max_vthread_extent = warp_size / 4;

This file has been truncated. show original

Can you print these values and let me double check?

masahi · December 5, 2020, 1:37am

Yes that’s a good point. On my GPU (R9 Nano, gfx803), device_api reports the following:

kMaxSharedMemoryPerBlock: 65536
kMaxRegistersPerBlock: 0 (!!)
kMaxThreadsPerBlock: 1024
kWarpSize: 64

kMaxRegistersPerBlock == 0 doesn’t make any sense, so I set to 256 or 1024. This choice didn’t make a big difference in tuning result.

Although max threads per block is reported as 1024, based on my previous experience, using 1024 threads on my GPU has been unstable. So I set it to 256. But it is definitely worth trying 1024 threads per block. The previous AutoTVM result was also obtained using max 256 threads, if I remember correctly.

merrymercy · December 5, 2020, 1:46am

The parameters look good to me. Actually, auto-scheduler only uses kMaxSharedMemoryPerBlock, kMaxThreadsPerBlock and kWarpSize. It does not use kMaxRegistersPerBlock, so your setting and experiences are right.

Why do you get more errors than NVIDIA GPU? We use the VerifyGPUCode pass to make sure the usage of shared memory is below kMaxSharedMemoryPerBlock. Otherwise, we won’t sent them for measurement. Is there anything wrong here?

masahi · December 5, 2020, 1:54am

Yes this is also what puzzled me. During AMD tuning, especially at the beginning, I get many error like error: local memory limit exceeded x in default_kernel0, where x there is always larger than 65536, often much bigger and local memory refers to shared memory (using NV terminology). It seems this error is coming from rocm runtime, not TVM.

So it seems Ansor is trying many kernels that are way beyond HW limit. I have no idea what’s happening or why VerifyGPUCode doesn’t seem to be working. I didn’t see shared mem usage error during NV tuning. I can dig into this issue deeper.

merrymercy · December 5, 2020, 2:08am

Sorry, I checked the code and found kMaxRegistersPerBlock is actually used here.

github.com

apache/tvm/blob/fd5ce645941153972ecee404c90479b2b391df15/src/auto_scheduler/feature.cc#L1312-L1316


{"max_shared_memory_per_block", task->hardware_params->max_shared_memory_per_block},
{"max_local_memory_per_block", task->hardware_params->max_registers_per_block},
{"max_threads_per_block", task->hardware_params->max_threads_per_block},
{"max_vector_bytes", task->hardware_params->vector_unit_bytes},
{"max_vthread", task->hardware_params->max_vthread_extent},

It is passed to max_local_memory_per_block. But I think this is a bug. The local_memory_per_block in the VerifyGPUCode is not the same as registers.

For CUDA, it does not matter, because kMaxRegistersPerBlock returns a very large value similar to kMaxSharedMemoryPerBlock. So this check just does nothing. For your AMD GPU, I suggest setting it to 65536 (the same as kMaxSharedMemoryPerBlock). If you use a value too small such as 1024 in your case. The VerifyGPUCode will filter out many good candidates.

To summarize, we can

use NHWC layout with winograd by copying op strategy from CUDA.
use n_trials > 20000
set kMaxRegistersPerBlock to 65536

masahi · December 5, 2020, 3:16am

@merrymercy I found a reason why I get many invalid kernel errors on AMD: VerifyGPUCode only runs for CUDA target (kDLGPU) !

github.com

apache/tvm/blob/main/src/auto_scheduler/feature.cc#L1299


bool disable_vectorize =
    pass_ctx->GetConfig<Bool>("tir.disable_vectorize", Bool(false)).value();
bool instrument_bound_checkers =
    pass_ctx->GetConfig<Bool>("tir.instrument_bound_checkers", Bool(false)).value();

if (noalias) {
  f = WithAttr(std::move(f), "tir.noalias", Bool(true));
}
auto mod = IRModule(Map<GlobalVar, BaseFunc>({{global_var, f}}));

if (task->target->kind->device_type == kDLGPU) {
  auto pass_list = Array<tvm::transform::Pass>();
  // Phase 0
  pass_list.push_back(tir::transform::InjectPrefetch());
  pass_list.push_back(tir::transform::StorageFlatten(64, instrument_bound_checkers));
  // Phase 1
  pass_list.push_back(tir::transform::NarrowDataType(32));
  pass_list.push_back(tir::transform::Simplify());
  pass_list.push_back(tir::transform::VectorizeLoop(!disable_vectorize));
  pass_list.push_back(tir::transform::InjectVirtualThread());
  pass_list.push_back(tir::transform::StorageRewrite());

Fix sent to https://github.com/apache/tvm/pull/7038 (including NHWC + wino on rocm)

FrozenGene · December 5, 2020, 6:12am

cc @jcf94 who did experiment on mac’s amd gpu using opencl.

masahi · December 6, 2020, 10:09pm

@merrymercy @comaniac

I ran tuning again with NHWC and wino enabled on rocm, using 25000 trials. Other changes include

VerifyGPUCode is run properly
using max_registers = 65536
max_threads_per_block = 1024 (previously 256)

EstimatedLatency improved from 10.148 ms to 7.426 ms . And the final time evaluator measurement is 7.84 ms. Here is the log at the end.

----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.022 |           0.19 |    128 |
|    1 |        0.281 |          14.58 |    896 |
|    2 |        0.018 |          -0.00 |     64 |
|    3 |        0.125 |         824.07 |    448 |
|    4 |        0.262 |         543.33 |   2560 |
|    5 |        0.177 |         582.20 |   1152 |
|    6 |        0.125 |         824.60 |    832 |
|    7 |        0.104 |         496.07 |    448 |
|    8 |        0.108 |         958.30 |    384 |
|    9 |        0.154 |         746.20 |   3008 |
|   10 |        0.130 |         790.86 |   2112 |
|   11 |        0.103 |         996.54 |   1664 |
|   12 |        0.083 |         621.97 |    320 |
|   13 |        0.099 |        1044.90 |    384 |
|   14 |        0.138 |         917.40 |   1856 |
|   15 |        0.104 |         989.16 |   1024 |
|   16 |        0.096 |        1072.63 |    960 |
|   17 |        0.071 |         722.58 |    320 |
|   18 |        0.095 |        1109.23 |    320 |
|   19 |        0.122 |        1051.61 |   1216 |
|   20 |        0.099 |        1040.23 |    640 |
|   21 |        0.089 |        1162.63 |    640 |
|   22 |        0.031 |         842.31 |    128 |
|   23 |        0.040 |          54.81 |    192 |
|   24 |        0.197 |        1203.66 |    640 |
|   25 |        0.086 |        1197.56 |    320 |
|   26 |        0.198 |        1038.55 |    640 |
|   27 |        0.222 |         926.45 |    768 |
|   28 |        0.269 |         764.47 |    896 |
-------------------------------------------------

Estimated total latency: 7.426 ms       Trials: 24960   Used time : 49879 s     Next ID: 1
...
Compile...
Evaluate inference time cost...
Mean inference time (std dev): 7.84 ms (0.03 ms)

I said earlier that AutoTVM result is 6.45 ms. That result was obtained 2 years ago. To be sure, I also ran the AutoTVM relay tutorial on today’s TVM and rocm.

Surprisingly, there is a big regression compared to the result 2 years ago, probably due to AMD’s fault (my gpu is from 2015, fairly old): the current AutoTVM result is 8.08 ms.

Here is the output log:

[Task  1/24]  Current/Best:  182.66/ 419.17 GFLOPS | Progress: (816/2000) | 1454.43 s Done.
[Task  2/24]  Current/Best:  513.47/ 706.49 GFLOPS | Progress: (1128/2000) | 2509.76 s Done.
[Task  3/24]  Current/Best:  825.40/1011.09 GFLOPS | Progress: (2000/2000) | 6889.62 s Done.
[Task  4/24]  Current/Best: 1111.02/1378.50 GFLOPS | Progress: (1452/2000) | 5051.27 s Done.
[Task  5/24]  Current/Best:  779.76/ 873.93 GFLOPS | Progress: (936/2000) | 2951.39 s Done.
[Task  6/24]  Current/Best:  864.52/1013.70 GFLOPS | Progress: (1548/2000) | 5934.36 s Done.
[Task  7/24]  Current/Best: 1497.05/2184.88 GFLOPS | Progress: (1224/2000) | 3759.74 s Done.
[Task  8/24]  Current/Best: 1056.79/1234.30 GFLOPS | Progress: (1128/2000) | 3775.86 s Done.
[Task  9/24]  Current/Best: 1055.22/1203.91 GFLOPS | Progress: (936/2000) | 3012.56 s Done.
[Task 10/24]  Current/Best:  504.19/ 640.72 GFLOPS | Progress: (912/2000) | 3009.30 s Done.
[Task 11/24]  Current/Best:    4.03/ 813.25 GFLOPS | Progress: (684/2000) | 2403.45 s Done.
[Task 12/24]  Current/Best: 1718.92/2001.00 GFLOPS | Progress: (792/2000) | 2469.34 s Done.
[Task 13/24]  Current/Best:  746.74/1068.61 GFLOPS | Progress: (612/2000) | 2099.67 s Done.
[Task 14/24]  Current/Best:  963.92/1126.56 GFLOPS | Progress: (1188/2000) | 4144.49 s Done.
[Task 15/24]  Current/Best:  222.68/ 487.16 GFLOPS | Progress: (768/2000) | 1786.05 s Done.
[Task 16/24]  Current/Best:  354.85/ 589.89 GFLOPS | Progress: (1128/2000) | 3175.10 s Done.
[Task 17/24]  Current/Best: 1100.63/1906.40 GFLOPS | Progress: (996/2000) | 1942.92 s Done.
[Task 18/24]  Current/Best:  541.85/ 795.74 GFLOPS | Progress: (1476/2000) | 4162.37 s Done.
[Task 19/24]  Current/Best:  750.30/ 889.98 GFLOPS | Progress: (612/2000) | 1687.54 s Done.
[Task 20/24]  Current/Best:  104.64/ 233.15 GFLOPS | Progress: (732/2000) | 1447.95 s Done.
[Task 22/24]  Current/Best: 1356.81/1746.76 GFLOPS | Progress: (924/2000) | 2712.59 s Done.
[Task 23/24]  Current/Best:  225.49/ 557.91 GFLOPS | Progress: (960/2000) | 1927.89 s Done.
[Task 24/24]  Current/Best:  301.74/ 702.27 GFLOPS | Progress: (1428/2000) | 2814.64 s Done.
Compile...
Cannot find config for target=rocm -keys=rocm,gpu -max_num_threads=256 -mcpu=gfx803 -model=unknown -mtriple=amdgcn-amd-amdhsa-hcc -thread_warp_size=64, workload=('dense.rocm
', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Evaluate inference time cost...
Mean inference time (std dev): 8.08 ms (0.16 ms)

So in summary, time evaluator measurement after tuning using auto scheduler and AutoTVM, on current TVM and rocm:

Auto sch: 7.84 ms (stddev 0.03 ms)
AutoTVM: 8.08 ms (stddev 0.16 ms)

It’s great to see auto sch matching and slightly outperforming AutoTVM! Note that the final dense layer is very slow on auto sch (0.281 ms with 14.58 GLOPS, the slowest of all layers), so convolution only measurement would look better in favor of auto sch.

What I don’t understand is that the tuning log from AutoTVM shows much bigger GFLOPS estimates than those from auto sch, while the final time evaluator measurement is slower. Do AutoTVM and Ansor use different GFLOPS estimates, so comparing them is not meaningful?