Hello TVM community!
Recently I want to deploy a transformers-based model, LightGlue-ONNX on RK3588 SoC with Linux platform.
I exported the 2nd stage model(lightglue) with ONNX opset 17, fixed all the dynamic input shapes, and removed the postprocessing nodes(to avoid dynamic output shape).
Although the RK3588 SoC have built-in NPU for DNN inference workload and advertised GFLOPs looks pretty promising, the model runs very slow.
After that I tried to run the model on GPU. The GPU is Mali-G610 MP4, with Valhall architecture(not offically supported by TVM?)
Compiling and running the model directly on the board works fine.
python -m tvm.driver.tvmc compile --target="opencl" -v --mixed-precision --output modified_modified_superpoint_lightglue.c_192.tar ./modified_modified_superpoint_lightglue.c_192.onnx
> python -m tvm.driver.tvmc run \
--print-time \
--device cl \
--repeat 20 \
--profile \
modified_modified_superpoint_lightglue.c_192.tar
2023-12-15 00:34:21.563 INFO load_module /tmp/tmpr4ua8xzi/mod.so
arm_release_ver: g13p0-01eac0, rk_so_ver: 10
Name Duration (us) Percent Device Count Argument Shapes Hash
tvmgen_default_fused_nn_dense_3 308,732.24 32.63 opencl0 36 float16[192, 512], float16[512, 512], float16[192, 512] 267c1c897c522d1b
tvmgen_default_fused_nn_dense_2 204,685.15 21.63 opencl0 74 float16[192, 256], float16[256, 256], float16[192, 256] 7b2bcc6362278ac4
tvmgen_default_fused_nn_dense_4 154,909.36 16.37 opencl0 36 float16[192, 512], float16[256, 512], float16[192, 256] a4b21dd8f9fe18a8
tvmgen_default_fused_nn_dense 149,910.48 15.84 opencl0 18 float16[192, 256], float16[768, 256], float16[192, 768] 5368e38e23fa2487
tvmgen_default_fused_nn_batch_matmul_1 74,367.29 7.86 opencl0 36 float16[4, 192, 192], float16[4, 64, 192], float16[4, 192, 64] cedb205ddae014df
tvmgen_default_fused_nn_batch_matmul 25,580.32 2.70 opencl0 36 float16[4, 192, 64], float16[4, 192, 64], float16[4, 192, 192] 06a857f19a78affd
tvmgen_default_fused_nn_softmax 11,022.54 1.16 opencl0 36 float32[1, 4, 192, 192], float32[1, 4, 192, 192] 2d2cd8e20e98cc5b
tvmgen_default_fused_nn_batch_matmul_2 2,835.36 0.30 opencl0 1 float16[1, 192, 256], float16[1, 192, 256], float16[1, 192, 192] 3090cc1a0295312a
tvmgen_default_fused_subtract_add_rsqrt_multiply_multiply_add_divide_erf_add_multiply_multiply__91861172529e67a9_ 1,793.33 0.19 opencl0 36 float32[1, 192, 512], float32[1, 192, 1], float32[1, 192, 1], float32[512], float32[512], float16[192, 512] 99f46fc7dfbc6bfd
tvmgen_default_fused_mean 1,782.77 0.19 opencl0 36 float32[1, 192, 512], float32[1, 192, 1] 04198f5971a3f5c1
tvmgen_default_fused_variance 1,765.75 0.19 opencl0 36 float32[1, 192, 512], float32[1, 192, 1], float32[1, 192, 1] 56ecc0ed1803987c
tvmgen_default_fused_reshape_multiply_add_cast 1,714.21 0.18 opencl0 36 float16[4, 192, 192], float16[], float16[192, 192], float32[1, 4, 192, 192] 2ebfaeb1e38d2cbc
tvmgen_default_fused_reshape_add_reshape_transpose 1,019.59 0.11 opencl0 18 float16[192, 768], float16[768], float16[1, 4, 192, 64, 3] 4969482d21594d05
tvmgen_default_fused_take_reshape_transpose 946.05 0.10 opencl0 18 float16[1, 4, 192, 64, 3], int64[], float16[4, 64, 192] 3d1440e284e116e3
tvmgen_default_fused_reshape_cast 841.56 0.09 opencl0 36 float32[1, 4, 192, 192], float16[4, 192, 192] 0fc15a56f1397bb2
tvmgen_default_fused_reshape_add_cast 733.13 0.08 opencl0 36 float16[192, 512], float16[512], float32[1, 192, 512] 50f61d6cfd9655c4
tvmgen_default_fused_take 456.43 0.05 opencl0 36 float16[1, 4, 192, 64, 3], int64[], float16[1, 4, 192, 64] bdcb7d6eb780ae63
tvmgen_default_fused_nn_dense_1 403.41 0.04 opencl0 2 float16[192, 2], float16[32, 2], float16[192, 32] 36dc743c363e1b2f
tvmgen_default_fused_reshape_add_concatenate_reshape 384.28 0.04 opencl0 34 float16[192, 256], float16[256], float16[1, 192, 256], float16[192, 512] 88a2658d4b438265
tvmgen_default_fused_reshape_add_reshape_transpose_reshape_transpose 372.48 0.04 opencl0 18 float16[192, 256], float16[256], float16[4, 64, 192] ff2e08744e10d0ed
tvmgen_default_fused_reshape_add_add 311.74 0.03 opencl0 34 float16[192, 256], float16[256], float16[1, 192, 256], float16[1, 192, 256] d13ea778ab1d0b22
tvmgen_default_fused_multiply_reshape_take_negative_expand_dims_take_expand_dims_concatenate_re_56c26eeeac2acb52_ 287.16 0.03 opencl0 18 float16[1, 4, 192, 64], float16[1, 1, 192, 64], int64[], int64[], float16[1, 1, 192, 64], float16[4, 192, 64] 7bfb4b2f58b72290
tvmgen_default_fused_multiply_reshape_take_negative_expand_dims_take_expand_dims_concatenate_re_b37aa94ae7f51895_ 286.21 0.03 opencl0 18 float16[1, 4, 192, 64], float16[1, 1, 192, 64], int64[], int64[], float16[1, 1, 192, 64], float16[4, 192, 64] 200c573badd5d6aa
tvmgen_default_fused_reshape_transpose_reshape 220.83 0.02 opencl0 36 float16[4, 192, 64], float16[192, 256] e40d55a93d0386e1
tvmgen_default_fused_transpose_reshape_transpose 173.43 0.02 opencl0 18 float16[1, 192, 4, 64], float16[4, 192, 64] ca2ea32624b573a5
tvmgen_default_fused_transpose_reshape 171.46 0.02 opencl0 18 float16[1, 192, 4, 64], float16[4, 192, 64] 5e8056f97a04dc2c
tvmgen_default_fused_nn_softmax_log_add_add_sigmoid_log_add_cast_add 164.23 0.02 opencl0 1 float32[1, 192, 192], float32[1, 192, 192], float16[1], float16[1, 192, 1], float16[1, 1, 192], float32[1, 192, 192] 390ec76a5e8af46c
tvmgen_default_fused_reshape_add_reshape 128.73 0.01 opencl0 18 float16[192, 256], float16[256], float16[1, 192, 4, 64] 4ceac3de29a171f2
tvmgen_default_fused_nn_softmax_log 120.99 0.01 opencl0 1 float32[1, 192, 192], float32[1, 192, 192] 45ca279ac9a1fd79
tvmgen_default_fused_nn_dense_5 30.22 0.00 opencl0 2 float16[192, 256], float16[1, 256], float16[192, 1] d13b9546ba2def27
tvmgen_default_fused_reshape_add_cast_concatenate_reshape_cast 25.33 0.00 opencl0 2 float16[192, 256], float16[256], float32[1, 192, 256], float16[192, 512] cb50c2a98e9cf1e6
tvmgen_default_fused_cast 21.32 0.00 opencl0 2 float32[1, 192, 256], float16[1, 192, 256] 76ad18e0bd1e222b
tvmgen_default_fused_reshape_add_add_reshape 18.72 0.00 opencl0 2 float16[192, 256], float16[256], float16[1, 192, 256], float16[192, 256] f6ca35b33932779b
tvmgen_default_fused_reshape_cos_expand_dims_sin_expand_dims_concatenate_expand_dims_concatenat_925c438b4a75d8e6_ 17.84 0.00 opencl0 2 float16[192, 32], float16[2, 1, 1, 192, 64] 3e538fe9e3e636bc
tvmgen_default_fused_reshape_add_divide 17.52 0.00 opencl0 2 float16[192, 256], float16[256], float16[], float16[1, 192, 256] bcfdc9f621039655
tvmgen_default_fused_reshape_cast_1 16.77 0.00 opencl0 1 float16[1, 192, 192], float32[1, 192, 192] bee91663beffe750
tvmgen_default_fused_take_1 16.35 0.00 opencl0 4 float16[2, 1, 1, 192, 64], int64[], float16[1, 1, 192, 64] e438271edd0ffa9e
tvmgen_default_fused_cast_reshape 4.08 0.00 opencl0 2 float32[1, 192, 2], float16[192, 2] 5e9a837516eecf71
tvmgen_default_fused_reshape_add_sigmoid_log_transpose 3.52 0.00 opencl0 1 float16[192, 1], float16[1], float16[1, 1, 192] 04bf883b5fa48a9b
__nop 0.00 0.00 opencl0 36 float16[1, 192, 256], float16[192, 256] 046b5744937869f9
__nop 0.00 0.00 opencl0 1 float16[192, 1], float16[1, 192, 1] 679c7c526f85cf65
----------
Sum 946,292.17 100.00 834
Total 946,292.17 opencl0 1
Configuration
-------------
Number of threads: 4
Executor: Graph
Execution time summary:
mean (ms) median (ms) max (ms) min (ms) std (ms)
944.6400 944.6661 945.1847 943.6985 0.3578
Then I tried to AutoSchedule it by running the following on host:
tvmc -v tune --timeout 25 --rpc-tracker 127.0.0.1:9190 --rpc-key rk3588 --target="opencl -device=mali" --target-host="llvm -mtriple=aarch64-linux-gnu" --mixed-precision --tuning-records lightglue-192-addout-tvm-mixprec-cl-ansor.json --output lightglue-192-addout-tvm-mixprec-cl-ansor.json --enable-autoscheduler --log-estimated-latency --trials 10000 --early-stopping 700 ./modified_modified_superpoint_lightglue.c_192.onnx > lightglue-192-addout-tvm-mixprec-cl-ansor.log 2>lightglue-192-addout-tvm-mixprec-cl-ansor.err.log
However, the summary shows only 4 out of 14 tasks get tuned successfully.
| ID | Task Description | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
| 0 | vm_mod_fused_variance | - | - | 192 |
| 1 | vm_mod_fused_nn_dense | - | - | 64 |
| 2 | vm_mod_fused_nn_softmax_log_add_add_sigmoid_log_add_cast_add | - | - | 64 |
| 3 | vm_mod_fused_nn_softmax_log | - | - | 64 |
| 4 | vm_mod_fused_nn_batch_matmul_1 | 0.380 | 49.61 | 64 |
| 5 | vm_mod_fused_nn_softmax | 1.061 | 0.56 | 64 |
| 6 | vm_mod_fused_nn_batch_matmul | 0.459 | 41.08 | 64 |
| 7 | vm_mod_fused_nn_dense_5 | - | - | 64 |
| 8 | vm_mod_fused_nn_batch_matmul_2 | 0.429 | 43.95 | 64 |
| 9 | vm_mod_fused_nn_dense_4 | - | - | 192 |
| 10 | vm_mod_fused_nn_dense_3 | - | - | 192 |
| 11 | vm_mod_fused_nn_dense_2 | - | - | 320 |
| 12 | vm_mod_fused_nn_dense_1 | - | - | 64 |
| 13 | vm_mod_fused_mean | - | - | 128 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms Trials: 1408 Used time : 10564 s Next ID: 13
Other tasks just errored out with following errors:
A.
results: MeasureResult(error_type:InstantiationError, error_msg:Traceback (most recent call last):
File "/home/zt/.conda/envs/py3.10/lib/python3.10/site-packages/tvm/auto_scheduler/measure.py", line 619, in _local_build_worker
sch, args = task.compute_dag.apply_steps_from_state(
File "/home/zt/.conda/envs/py3.1
...
1: _ZZN3tvm3tir11ExprFunctorI
0: tvm::auto_scheduler::IndexRewriter::VisitExpr_(tvm::tir::ProducerLoadNode const*)
File "/workspace/tvm/src/auto_scheduler/compute_dag.cc", line 764
InternalError: Check failed: (name_it != name_to_arg.end()) is false:
Looks like the bug have been in the codebase for a long time. See: apache/tvm/issues/10369
results: MeasureResult(error_type:RuntimeDeviceError, error_msg:Traceback (most recent call last):
File "/home/zt/.conda/envs/py3.10/lib/python3.10/site-packages/tvm/auto_scheduler/measure.py", line 1145, in _rpc_run
costs = time_f(*loc_args).results
File "/home/zt/.conda/envs/py3.10/lib/python3.10/site-package
...
vm/src/runtime/rpc/rpc_endpoint.cc", line 390
RPCError: Error caught from RPC call:
[22:49:25] /home/firefly/tvm/src/runtime/opencl/opencl_common.h:518: InternalError: Check failed: (e == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY
I’m not sure if this is a TVM bug or a bug in the Mali OpenCL driver. The error also comes with Linux kernel error:
[16474.201042] mali fb000000.gpu: Ctx 186878_753 Group 1 CSG 0 CSI: 0
CS_FATAL.EXCEPTION_TYPE: 0x40 (CS_CONFIG_FAULT)
CS_FATAL.EXCEPTION_DATA: 0x0
CS_FATAL_INFO.EXCEPTION_DATA: 0x0
TVM is built using the latest code from git main branch.
The log files and ONNX model can be downloaded here:
Thanks for all efforts made into the great software!