Hello TVM community!
Recently I want to deploy a transformers-based model, LightGlue-ONNX on RK3588 SoC with Linux platform.
I exported the 2nd stage model(lightglue) with ONNX opset 17, fixed all the dynamic input shapes, and removed the postprocessing nodes(to avoid dynamic output shape).
Although the RK3588 SoC have built-in NPU for DNN inference workload and advertised GFLOPs looks pretty promising, the model runs very slow.
After that I tried to run the model on GPU. The GPU is Mali-G610 MP4, with Valhall architecture(not offically supported by TVM?)
Compiling and running the model directly on the board works fine.
python -m tvm.driver.tvmc compile --target="opencl" -v --mixed-precision --output modified_modified_superpoint_lightglue.c_192.tar ./modified_modified_superpoint_lightglue.c_192.onnx
> python -m tvm.driver.tvmc run \
                                                    --print-time \
                                                    --device cl \
                                                    --repeat 20 \
                                                    --profile \
                                                    modified_modified_superpoint_lightglue.c_192.tar 
2023-12-15 00:34:21.563 INFO load_module /tmp/tmpr4ua8xzi/mod.so
arm_release_ver: g13p0-01eac0, rk_so_ver: 10
Name                                                                                                               Duration (us)  Percent   Device  Count                                                                                                       Argument Shapes              Hash  
tvmgen_default_fused_nn_dense_3                                                                                       308,732.24    32.63  opencl0     36                                                               float16[192, 512], float16[512, 512], float16[192, 512]  267c1c897c522d1b  
tvmgen_default_fused_nn_dense_2                                                                                       204,685.15    21.63  opencl0     74                                                               float16[192, 256], float16[256, 256], float16[192, 256]  7b2bcc6362278ac4  
tvmgen_default_fused_nn_dense_4                                                                                       154,909.36    16.37  opencl0     36                                                               float16[192, 512], float16[256, 512], float16[192, 256]  a4b21dd8f9fe18a8  
tvmgen_default_fused_nn_dense                                                                                         149,910.48    15.84  opencl0     18                                                               float16[192, 256], float16[768, 256], float16[192, 768]  5368e38e23fa2487  
tvmgen_default_fused_nn_batch_matmul_1                                                                                 74,367.29     7.86  opencl0     36                                                        float16[4, 192, 192], float16[4, 64, 192], float16[4, 192, 64]  cedb205ddae014df  
tvmgen_default_fused_nn_batch_matmul                                                                                   25,580.32     2.70  opencl0     36                                                        float16[4, 192, 64], float16[4, 192, 64], float16[4, 192, 192]  06a857f19a78affd  
tvmgen_default_fused_nn_softmax                                                                                        11,022.54     1.16  opencl0     36                                                                      float32[1, 4, 192, 192], float32[1, 4, 192, 192]  2d2cd8e20e98cc5b  
tvmgen_default_fused_nn_batch_matmul_2                                                                                  2,835.36     0.30  opencl0      1                                                      float16[1, 192, 256], float16[1, 192, 256], float16[1, 192, 192]  3090cc1a0295312a  
tvmgen_default_fused_subtract_add_rsqrt_multiply_multiply_add_divide_erf_add_multiply_multiply__91861172529e67a9_       1,793.33     0.19  opencl0     36           float32[1, 192, 512], float32[1, 192, 1], float32[1, 192, 1], float32[512], float32[512], float16[192, 512]  99f46fc7dfbc6bfd  
tvmgen_default_fused_mean                                                                                               1,782.77     0.19  opencl0     36                                                                              float32[1, 192, 512], float32[1, 192, 1]  04198f5971a3f5c1  
tvmgen_default_fused_variance                                                                                           1,765.75     0.19  opencl0     36                                                          float32[1, 192, 512], float32[1, 192, 1], float32[1, 192, 1]  56ecc0ed1803987c  
tvmgen_default_fused_reshape_multiply_add_cast                                                                          1,714.21     0.18  opencl0     36                                           float16[4, 192, 192], float16[], float16[192, 192], float32[1, 4, 192, 192]  2ebfaeb1e38d2cbc  
tvmgen_default_fused_reshape_add_reshape_transpose                                                                      1,019.59     0.11  opencl0     18                                                            float16[192, 768], float16[768], float16[1, 4, 192, 64, 3]  4969482d21594d05  
tvmgen_default_fused_take_reshape_transpose                                                                               946.05     0.10  opencl0     18                                                               float16[1, 4, 192, 64, 3], int64[], float16[4, 64, 192]  3d1440e284e116e3  
tvmgen_default_fused_reshape_cast                                                                                         841.56     0.09  opencl0     36                                                                         float32[1, 4, 192, 192], float16[4, 192, 192]  0fc15a56f1397bb2  
tvmgen_default_fused_reshape_add_cast                                                                                     733.13     0.08  opencl0     36                                                                 float16[192, 512], float16[512], float32[1, 192, 512]  50f61d6cfd9655c4  
tvmgen_default_fused_take                                                                                                 456.43     0.05  opencl0     36                                                            float16[1, 4, 192, 64, 3], int64[], float16[1, 4, 192, 64]  bdcb7d6eb780ae63  
tvmgen_default_fused_nn_dense_1                                                                                           403.41     0.04  opencl0      2                                                                     float16[192, 2], float16[32, 2], float16[192, 32]  36dc743c363e1b2f  
tvmgen_default_fused_reshape_add_concatenate_reshape                                                                      384.28     0.04  opencl0     34                                              float16[192, 256], float16[256], float16[1, 192, 256], float16[192, 512]  88a2658d4b438265  
tvmgen_default_fused_reshape_add_reshape_transpose_reshape_transpose                                                      372.48     0.04  opencl0     18                                                                  float16[192, 256], float16[256], float16[4, 64, 192]  ff2e08744e10d0ed  
tvmgen_default_fused_reshape_add_add                                                                                      311.74     0.03  opencl0     34                                           float16[192, 256], float16[256], float16[1, 192, 256], float16[1, 192, 256]  d13ea778ab1d0b22  
tvmgen_default_fused_multiply_reshape_take_negative_expand_dims_take_expand_dims_concatenate_re_56c26eeeac2acb52_         287.16     0.03  opencl0     18         float16[1, 4, 192, 64], float16[1, 1, 192, 64], int64[], int64[], float16[1, 1, 192, 64], float16[4, 192, 64]  7bfb4b2f58b72290  
tvmgen_default_fused_multiply_reshape_take_negative_expand_dims_take_expand_dims_concatenate_re_b37aa94ae7f51895_         286.21     0.03  opencl0     18         float16[1, 4, 192, 64], float16[1, 1, 192, 64], int64[], int64[], float16[1, 1, 192, 64], float16[4, 192, 64]  200c573badd5d6aa  
tvmgen_default_fused_reshape_transpose_reshape                                                                            220.83     0.02  opencl0     36                                                                                float16[4, 192, 64], float16[192, 256]  e40d55a93d0386e1  
tvmgen_default_fused_transpose_reshape_transpose                                                                          173.43     0.02  opencl0     18                                                                           float16[1, 192, 4, 64], float16[4, 192, 64]  ca2ea32624b573a5  
tvmgen_default_fused_transpose_reshape                                                                                    171.46     0.02  opencl0     18                                                                           float16[1, 192, 4, 64], float16[4, 192, 64]  5e8056f97a04dc2c  
tvmgen_default_fused_nn_softmax_log_add_add_sigmoid_log_add_cast_add                                                      164.23     0.02  opencl0      1  float32[1, 192, 192], float32[1, 192, 192], float16[1], float16[1, 192, 1], float16[1, 1, 192], float32[1, 192, 192]  390ec76a5e8af46c  
tvmgen_default_fused_reshape_add_reshape                                                                                  128.73     0.01  opencl0     18                                                               float16[192, 256], float16[256], float16[1, 192, 4, 64]  4ceac3de29a171f2  
tvmgen_default_fused_nn_softmax_log                                                                                       120.99     0.01  opencl0      1                                                                            float32[1, 192, 192], float32[1, 192, 192]  45ca279ac9a1fd79  
tvmgen_default_fused_nn_dense_5                                                                                            30.22     0.00  opencl0      2                                                                   float16[192, 256], float16[1, 256], float16[192, 1]  d13b9546ba2def27  
tvmgen_default_fused_reshape_add_cast_concatenate_reshape_cast                                                             25.33     0.00  opencl0      2                                              float16[192, 256], float16[256], float32[1, 192, 256], float16[192, 512]  cb50c2a98e9cf1e6  
tvmgen_default_fused_cast                                                                                                  21.32     0.00  opencl0      2                                                                            float32[1, 192, 256], float16[1, 192, 256]  76ad18e0bd1e222b  
tvmgen_default_fused_reshape_add_add_reshape                                                                               18.72     0.00  opencl0      2                                              float16[192, 256], float16[256], float16[1, 192, 256], float16[192, 256]  f6ca35b33932779b  
tvmgen_default_fused_reshape_cos_expand_dims_sin_expand_dims_concatenate_expand_dims_concatenat_925c438b4a75d8e6_          17.84     0.00  opencl0      2                                                                           float16[192, 32], float16[2, 1, 1, 192, 64]  3e538fe9e3e636bc  
tvmgen_default_fused_reshape_add_divide                                                                                    17.52     0.00  opencl0      2                                                      float16[192, 256], float16[256], float16[], float16[1, 192, 256]  bcfdc9f621039655  
tvmgen_default_fused_reshape_cast_1                                                                                        16.77     0.00  opencl0      1                                                                            float16[1, 192, 192], float32[1, 192, 192]  bee91663beffe750  
tvmgen_default_fused_take_1                                                                                                16.35     0.00  opencl0      4                                                            float16[2, 1, 1, 192, 64], int64[], float16[1, 1, 192, 64]  e438271edd0ffa9e  
tvmgen_default_fused_cast_reshape                                                                                           4.08     0.00  opencl0      2                                                                                   float32[1, 192, 2], float16[192, 2]  5e9a837516eecf71  
tvmgen_default_fused_reshape_add_sigmoid_log_transpose                                                                      3.52     0.00  opencl0      1                                                                       float16[192, 1], float16[1], float16[1, 1, 192]  04bf883b5fa48a9b  
__nop                                                                                                                       0.00     0.00  opencl0     36                                                                               float16[1, 192, 256], float16[192, 256]  046b5744937869f9  
__nop                                                                                                                       0.00     0.00  opencl0      1                                                                                   float16[192, 1], float16[1, 192, 1]  679c7c526f85cf65  
----------                                                                                                                                                                                                                                                                                         
Sum                                                                                                                   946,292.17   100.00             834                                                                                                                                          
Total                                                                                                                 946,292.17           opencl0      1                                                                                                                                          
Configuration
-------------
Number of threads: 4
Executor: Graph
Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
  944.6400     944.6661     945.1847     943.6985      0.3578                  
Then I tried to AutoSchedule it by running the following on host:
tvmc -v tune --timeout 25 --rpc-tracker 127.0.0.1:9190 --rpc-key rk3588 --target="opencl -device=mali" --target-host="llvm -mtriple=aarch64-linux-gnu" --mixed-precision --tuning-records lightglue-192-addout-tvm-mixprec-cl-ansor.json --output lightglue-192-addout-tvm-mixprec-cl-ansor.json --enable-autoscheduler --log-estimated-latency --trials 10000 --early-stopping 700 ./modified_modified_superpoint_lightglue.c_192.onnx > lightglue-192-addout-tvm-mixprec-cl-ansor.log 2>lightglue-192-addout-tvm-mixprec-cl-ansor.err.log
However, the summary shows only 4 out of 14 tasks get tuned successfully.
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |                                         vm_mod_fused_variance |            - |              - |    192 |
|    1 |                                         vm_mod_fused_nn_dense |            - |              - |     64 |
|    2 |  vm_mod_fused_nn_softmax_log_add_add_sigmoid_log_add_cast_add |            - |              - |     64 |
|    3 |                                   vm_mod_fused_nn_softmax_log |            - |              - |     64 |
|    4 |                                vm_mod_fused_nn_batch_matmul_1 |        0.380 |          49.61 |     64 |
|    5 |                                       vm_mod_fused_nn_softmax |        1.061 |           0.56 |     64 |
|    6 |                                  vm_mod_fused_nn_batch_matmul |        0.459 |          41.08 |     64 |
|    7 |                                       vm_mod_fused_nn_dense_5 |            - |              - |     64 |
|    8 |                                vm_mod_fused_nn_batch_matmul_2 |        0.429 |          43.95 |     64 |
|    9 |                                       vm_mod_fused_nn_dense_4 |            - |              - |    192 |
|   10 |                                       vm_mod_fused_nn_dense_3 |            - |              - |    192 |
|   11 |                                       vm_mod_fused_nn_dense_2 |            - |              - |    320 |
|   12 |                                       vm_mod_fused_nn_dense_1 |            - |              - |     64 |
|   13 |                                             vm_mod_fused_mean |            - |              - |    128 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms	Trials: 1408	Used time : 10564 s	Next ID: 13	
Other tasks just errored out with following errors:
A.
results: MeasureResult(error_type:InstantiationError, error_msg:Traceback (most recent call last):
  File "/home/zt/.conda/envs/py3.10/lib/python3.10/site-packages/tvm/auto_scheduler/measure.py", line 619, in _local_build_worker
    sch, args = task.compute_dag.apply_steps_from_state(
  File "/home/zt/.conda/envs/py3.1
...
1: _ZZN3tvm3tir11ExprFunctorI
  0: tvm::auto_scheduler::IndexRewriter::VisitExpr_(tvm::tir::ProducerLoadNode const*)
  File "/workspace/tvm/src/auto_scheduler/compute_dag.cc", line 764
InternalError: Check failed: (name_it != name_to_arg.end()) is false: 
Looks like the bug have been in the codebase for a long time. See: apache/tvm/issues/10369
results: MeasureResult(error_type:RuntimeDeviceError, error_msg:Traceback (most recent call last):
  File "/home/zt/.conda/envs/py3.10/lib/python3.10/site-packages/tvm/auto_scheduler/measure.py", line 1145, in _rpc_run
    costs = time_f(*loc_args).results
  File "/home/zt/.conda/envs/py3.10/lib/python3.10/site-package
...
vm/src/runtime/rpc/rpc_endpoint.cc", line 390
RPCError: Error caught from RPC call:
[22:49:25] /home/firefly/tvm/src/runtime/opencl/opencl_common.h:518: InternalError: Check failed: (e == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY
I’m not sure if this is a TVM bug or a bug in the Mali OpenCL driver. The error also comes with Linux kernel error:
[16474.201042] mali fb000000.gpu: Ctx 186878_753 Group 1 CSG 0 CSI: 0
               CS_FATAL.EXCEPTION_TYPE: 0x40 (CS_CONFIG_FAULT)
               CS_FATAL.EXCEPTION_DATA: 0x0
               CS_FATAL_INFO.EXCEPTION_DATA: 0x0
TVM is built using the latest code from git main branch.
The log files and ONNX model can be downloaded here:
Thanks for all efforts made into the great software!