Auto tunning with TVM ROCm

TVM version: 0.8.dev0

Target: ROCm4.0(Vege20) with miopen and rocblas

  1. Compile without auto tunning

    tvmc compile --target "rocm" --output resnet50-v2-7-tvm.tar resnet50-v2-7.onnx

    tvmc run --inputs imagenet_cat.npz --output predictions.npz resnet50-v2-7-tvm.tar --device 'rocm' --print-time --repeat 100

result:

Execution time summary:

mean (ms) max (ms) min (ms) std (ms)

6.15 6.21 6.11 0.02

  1. Compile with auto tunning

    tvmc tune --target "rocm" --output resnet50-v2-7-autotuner_records.json resnet50-v2-7.onnx

[Task 1/28] Current/Best: 76.92/ 76.92 GFLOPS | Progress: (1/35) | 4.28 s Done.

[Task 2/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 sMemory access fault by GPU node-1 (Agent handle: 0x2b6ddc0364b0) on address 0x2b6fbdd40000. Reason: Page not present or supervisor privilege.

[Task 2/28] Current/Best: 56.41/ 571.30 GFLOPS | Progress: (35/35) | 64.37 s Done.

[Task 3/28] Current/Best: 22.56/ 119.80 GFLOPS | Progress: (35/35) | 65.39 s Done.

[Task 4/28] Current/Best: 16.81/1796.01 GFLOPS | Progress: (35/35) | 69.57 s Done.

[Task 5/28] Current/Best: 2.06/ 355.35 GFLOPS | Progress: (32/35) | 61.96 sMemory access fault by GPU node-1 (Agent handle: 0x2b62fc037fb0) on address 0x2b645e68b000. Reason: Page not present or supervisor privilege.

[Task 5/28] Current/Best: 0.00/ 355.35 GFLOPS | Progress: (35/35) | 79.58 s Done.

[Task 6/28] Current/Best: 9.15/ 44.48 GFLOPS | Progress: (35/35) | 67.24 s Done.

[Task 7/28] Current/Best: 0.29/ 486.41 GFLOPS | Progress: (35/35) | 101.46 s Done.

[Task 8/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 s:0:rocdevice.cpp :2303: 446421582070 us: Device::callbackQueue aborting with status: 0x1008:0:rocdevice.cpp :2303: 446460034184 us:Device::callbackQueue aborting with status: 0x1008

[Task 8/28] Current/Best: 0.71/ 212.18 GFLOPS | Progress: (35/35) | 88.81 s Done.

[Task 9/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 s:0:rocdevice.cpp :2303: 446507392266 us: Device::callbackQueue aborting with status: 0x1008

[Task 9/28] Current/Best: 1.37/ 429.31 GFLOPS | Progress: (35/35) | 112.89 s Done.

[Task 10/28] Current/Best: 0.00/1486.56 GFLOPS | Progress: (35/35) | 102.14 s Done.

[Task 11/28] Current/Best: 1.38/ 185.23 GFLOPS | Progress: (35/35) | 82.62 s Done.

[Task 12/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 sMemory access fault by GPU node-1 (Agent handle: 0x2ac284020c50) on address 0x2ac3f7c86000. Reason: Page not present or supervisor privilege. Memory access fault by GPU node-1 (Agent handle: 0x2ac284073020) on address 0x2ac3f69b3000. Reason: Page not present or supervisor privilege.

[Task 12/28] Current/Best: 31.10/ 149.32 GFLOPS | Progress: (35/35) | 94.10 s Done.

[Task 13/28] Current/Best: 360.41/ 529.70 GFLOPS | Progress: (35/35) | 59.31 s Done.

[Task 14/28] Current/Best: 4.82/ 142.41 GFLOPS | Progress: (35/35) | 85.46 s Done.

[Task 15/28] Current/Best: 0.65/ 495.70 GFLOPS | Progress: (35/35) | 110.99 s Done.

[Task 16/28] Current/Best: 70.22/1925.67 GFLOPS | Progress: (35/35) | 100.45 s Done.

[Task 17/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 s:0:rocdevice.cpp :2303: 447279818746 us: Device::callbackQueue aborting with status: 0x1008

[Task 17/28] Current/Best: 3.14/ 392.10 GFLOPS | Progress: (35/35) | 72.59 s Done.

[Task 18/28] Current/Best: 1.22/ 249.17 GFLOPS | Progress: (35/35) | 87.31 s Done.

[Task 19/28] Current/Best: 34.31/ 297.76 GFLOPS | Progress: (35/35) | 97.42 s Done.

[Task 20/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 s:0:rocdevice.cpp :2303: 447529128310 us: Device::callbackQueue aborting with status: 0x1008

[Task 20/28] Current/Best: 74.97/ 942.04 GFLOPS | Progress: (35/35) | 98.24 s Done.

[Task 21/28] Current/Best: 6.11/ 671.28 GFLOPS | Progress: (35/35) | 110.77 s Done.

[Task 22/28] Current/Best: 33.95/2877.30 GFLOPS | Progress: (35/35) | 109.99 s Done.

[Task 23/28] Current/Best: 161.38/ 443.22 GFLOPS | Progress: (35/35) | 102.36 s Done.

[Task 24/28] Current/Best: 5.16/ 752.48 GFLOPS | Progress: (35/35) | 94.05 s Done.

[Task 25/28] Current/Best: 1.11/ 385.02 GFLOPS | Progress: (35/35) | 61.26 s Done.

[Task 26/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 s:0:rocdevice.cpp :2303: 448097712968 us: Device::callbackQueue aborting with status: 0x1008

[Task 26/28] Current/Best: 0.25/ 768.77 GFLOPS | Progress: (32/35) | 65.07 s:0:rocdevice.cpp :2303: 448152967789 us: Device::callbackQueue aborting with status: 0x1008

[Task 26/28] Current/Best: 0.00/ 768.77 GFLOPS | Progress: (35/35) | 70.36 s Done.

[Task 27/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 s:0:rocdevice.cpp :2303: 448184487043 us: Device::callbackQueue aborting with status: 0x1008

[Task 27/28] Current/Best: 99.42/ 257.84 GFLOPS | Progress: (35/35) | 58.12 s Done.

[Task 28/28] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/35) | 0.00 sMemory access fault by GPU node-1 (Agent handle: 0x2b2e9c0706b0) on address 0x2b2fc3b25000. Reason: Page not present or supervisor privilege.

[Task 28/28] Current/Best: 6.60/ 87.15 GFLOPS | Progress: (35/35) | 57.99 s Done.

`tvmc compile --target "rocm" --tuning-records resnet50-v2-7-autotuner_records.json  --output resnet50-v2-7-tvm_autotuned.tar resnet50-v2-7.onnx`

`tvmc run --inputs imagenet_cat.npz --output predictions.npz resnet50-v2-7-tvm_autotuned.tar --device 'rocm' --print-time --repeat 100`
results:
Execution time summary:
mean (ms)   max (ms)   min (ms)   std (ms) 
  26.86      27.11      26.72       0.07

Why did the performance get worse after auto tunning. Thanks :slight_smile:

@tqchen I have encountered the same problem. Could you give me an answer

I have met similar problems in CUDA a year before but finally solved it, my suggestion is that changing some tuning options like increasing number ,timeout, or min_repeat_ms may work.

2 Likes