Autotune error with Qualcomm Adreno

themachine013 · February 13, 2019, 6:26am

Hi there,

I try to autotune my model using android_rpc on my android phone (SnapDragon 845).

My backend is OpenCL. Here is my opencl device info:

PlatformName: QUALCOMM Snapdragon™
Device: QUALCOMM Adreno™
1.1 Hardware version: OpenCL 2.0 Adreno™ 630
1.2 Software version: OpenCL 2.0 QUALCOMM build: commit #78d547b changeid #I4ca2995ce0 Date: 04/11/18 Wed Local Branch: Remote Branch: refs/tags/AU_LINUX_ANDROID_LA.UM.6.3.R1.08.00.00.301.091 Compiler E031.35.02.06
1.3 OpenCL C version: OpenCL C 2.0 Adreno™ 630
1.4 Parallel compute units: 2

I successfully using android rpc to compile and run my model, but the inference speed is very slow (~4secs), so I try to use autotune.

But after some initial run, I always see " 0.00/ 0.00 GFLOPS". With debug log on, showing 3 type of errors: error_no=1, 4, 7, like the following

DEBUG:autotvm:No: 4 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(RuntimeError(‘Except caught from RPC call: [14:34:04] /tvm/apps/android_rpc/app/src/main/jni/…/…/…/…/…/…/include/…/src/runtime/module_util.cc:53: Check failed: ret == 0 (-1 vs. 0) [14:34:04]
/tvm/apps/android_rpc/app/src/main/jni/…/…/…/…/…/…/include/…/src/runtime/opencl/opencl_module.cc:216: OpenCL build error for device=0x6fb89f2068Pass’,),), error_no=4, all_cost=2.9775819778442383, timestamp=1550039644.957301) [(‘tile_b’, [16, 1, 1, 1]), (‘tile_y’, [4, 8, 4, 2]), (‘tile_x’, [2, 14, 4, 21]), (‘tile_rc’, [256, 1]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 1)],winograd,None,6412458

DEBUG:autotvm:No: 13 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=5, timestamp=1550039770.987435) [(‘tile_b’, [16, 1, 1, 1]), (‘tile_y’, [8, 1, 1, 32]), (‘tile_x’, [6, 4, 7, 14]), (‘tile_rc’, [256, 1]), (‘auto_unroll_max_step’, 128), (‘unroll_explicit’, 0)],winograd,None,2241835

DEBUG:autotvm:No: 24 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(InstantiationError(‘Skipped because of invalid gpu kernel’,),), error_no=1, all_cost=0.045018911361694336, timestamp=1550039771.325883) [(‘tile_b’, [16, 1, 1, 1]), (‘tile_y’, [8, 4, 8, 1]), (‘tile_x’, [1, 168, 7, 2]), (‘tile_rc’, [4, 64]), (‘auto_unroll_max_step’, 0), (‘unroll_explicit’, 0)],winograd,None,1445426

My target and host setting are:
target = ‘opencl’
target_host = "llvm -target=“arm64-linux-android”

Any help?

Additional info:
If I comment out the tuning process, and directly go to compile, upload and evaluate time cost stage. The time cost can be calculated correctly.

If I modify target to a similar one ‘opencl -device mali’, the code also can be run correctly without auto tune.
But all the error are only no.7 left.
DEBUG:autotvm:No: 24 GFLOPS: 0.00/0.00 result: MeasureResult(costs=(’’,), error_no=7, all_cost=5, timestamp=1550056627.234274) [(‘tile_bna’, 2), (‘tile_bnb’, 2), (‘tile_t1’, [256, 1]), (‘tile_t2’, [128, 2]), (‘c_unroll’, [32, 8]), (‘yt’, 32)],winograd,None,37006

I also found out many times that android rpc will go back to its main page, and then go back again to stop_rpc page.

eqy · February 14, 2019, 2:50am

To rule out OpenCL issues, did you use OpenCL before trying attuning? Aside from that, error_no = 7 corresponds to a timeout error, so you may have better luck increasing the timeout of RPCRunner in your tuning script.

Finally, the closing and re-opening of the RPC page is normal, as this is mechanism we use to enforce process isolation and timeouts in the App. Unfortunately this is the best solution to ensure that the running application gets highest execution priority on the Android operating system (by being visible).

themachine013 · February 14, 2019, 3:45am

Thanks for your kindly reply.

I increase my timeout from 5->500. It can work now in intital run. Its not all 0/0 GFlops. But it still encounter many time_out error very often.

DEBUG:autotvm:No: 1 GFLOPS: 12.48/12.48 result: MeasureResult(costs=(0.8895479372,), error_no=0, all_cost=32.59685492515564, timestamp=1550115478.580121) [(‘tile_bna’, 1), (‘tile_bnb’, 8), (‘tile_t1’, [4, 64]), (‘tile_t2’, [256, 1]), (‘c_unroll’, [128, 2]), (‘yt’, 4)],winograd,None,14565
DEBUG:autotvm:No: 2 GFLOPS: 12.47/12.48 result: MeasureResult(costs=(0.8897763851,), error_no=0, all_cost=25.782068967819214, timestamp=1550115471.623367) [(‘tile_bna’, 4), (‘tile_bnb’, 8), (‘tile_t1’, [8, 32]), (‘tile_t2’, [8, 32]), (‘c_unroll’, [256, 1]), (‘yt’, 8)],winograd,None,20342
DEBUG:autotvm:No: 3 GFLOPS: 37.17/37.17 result: MeasureResult(costs=(0.298543953,), error_no=0, all_cost=43.18819999694824, timestamp=1550115488.919654)[(‘tile_bna’, 4), (‘tile_bnb’, 4), (‘tile_t1’, [8, 32]), (‘tile_t2’, [16, 16]), (‘c_unroll’, [32, 8]), (‘yt’, 32)],winograd,None,37737
.
.
.
|DEBUG:autotvm:No: 22|GFLOPS: 0.00/37.17|result: MeasureResult(costs=(’’,), error_no=7, all_cost=500, timestamp=1550115811.343114)|[(‘tile_bna’, 1), (‘tile_bnb’, 2), (‘tile_t1’, [8, 32]), (‘tile_t2’, [8, 32]), (‘c_unroll’, [32, 8]), (‘yt’, 16)],winograd,None,31530|
|DEBUG:autotvm:No: 23|GFLOPS: 0.00/37.17|result: MeasureResult(costs=(’’,), error_no=7, all_cost=500, timestamp=1550115811.343155)|[(‘tile_bna’, 1), (‘tile_bnb’, 4), (‘tile_t1’, [32, 8]), (‘tile_t2’, [32, 8]), (‘c_unroll’, [128, 2]), (‘yt’, 4)],winograd,None,15085|
|DEBUG:autotvm:No: 24|GFLOPS: 0.00/37.17|result: MeasureResult(costs=(’’,), error_no=7, all_cost=500, timestamp=1550115811.343188)|[(‘tile_bna’, 2), (‘tile_bnb’, 16), (‘tile_t1’, [4, 64]), (‘tile_t2’, [2, 128]), (‘c_unroll’, [64, 4]), (‘yt’, 4)],winograd,None,17571|

Is that normal? I further increase time_out to 50000, still display many error_no=7

eqy · February 14, 2019, 7:49pm

You should not increase the timeout if any results give back error_no=0 (if the results are no longer 0.00/0.00). Early on in tuning, timeouts are normal, and the number of timeouts should decrease as the cost model becomes more accurate. Note that increasing timeout has the potential to actually slow down the tuning process, as we use timeout as a way to prune out the obviously bad configurations.