Updated by @merrymercy: see post20 for the new results
I tried runnning the relay auto schedular tutorial on my Radeon R9 Nano (8 TFLOPS peak) via rocm backend. It didn’t work out of the box, but after a simple fix, I got the following result on resnet50. It uses NCHW layout, since rocm backend currently doesn’t support NHWC.
| ID | Latency (ms) | Speed (GFLOPS) | Trials | [56/1998]
-------------------------------------------------
| 0 | 0.023 | 0.17 | 64 |
| 1 | 0.312 | 13.14 | 192 |
| 2 | 0.014 | -0.00 | 64 |
| 3 | 0.148 | 699.13 | 128 |
| 4 | 0.381 | 607.11 | 512 |
| 5 | 0.195 | 528.04 | 320 |
| 6 | 0.134 | 770.08 | 192 |
| 7 | 0.285 | 180.70 | 128 |
| 8 | 0.132 | 783.08 | 64 |
| 9 | 0.254 | 911.56 | 704 |
| 10 | 0.174 | 590.92 | 448 |
| 11 | 0.112 | 922.53 | 320 |
| 12 | 0.096 | 534.17 | 128 |
| 13 | 0.117 | 891.38 | 128 |
| 14 | 0.223 | 1036.67 | 448 |
| 15 | 0.121 | 852.42 | 192 |
| 16 | 0.121 | 853.46 | 192 |
| 17 | 0.074 | 692.40 | 128 |
| 18 | 0.117 | 896.98 | 128 |
| 19 | 0.221 | 1046.90 | 384 |
| 20 | 0.113 | 914.47 | 128 |
| 21 | 0.113 | 917.25 | 128 |
| 22 | 0.032 | 810.28 | 64 |
| 23 | 0.042 | 52.90 | 64 |
| 24 | 0.224 | 1062.46 | 128 |
| 25 | 0.116 | 882.35 | 128 |
| 26 | 0.224 | 916.39 | 128 |
| 27 | 0.285 | 721.63 | 128 |
| 28 | 0.423 | 485.46 | 256 |
-------------------------------------------------
Estimated total latency: 10.148 ms Trials: 6016 Used time : 13537 s Next ID: 4
So auto sch performance on NCHW resnet 50 is about 10.2 ms. For comparison, AutoTVM performance on the same model on the same GPU is 6.45 ms.
Performance comparison
- Auto sch: 10.2 ms (done last week)
- Auto TVM: 6.45 ms (done two years ago)
- TVM + MIOpen: 6.15 ms (done two years ago)
Even though there is some big gap between AutoTVM result, I’d say getting this number without manual template is already impressive!!
Here are my questions:
- Has anybody tried Ansor on AMDGPU?
- Does the above result look reasonable?
- How can we close the gap between AutoTVM? Would introducing NHWC support help?
I think rocm backend would be interesting for Ansor because it is the only well supported backend that does GPU codegen via LLVM.