Auto tuning self onnx model with Android-rpc and failed on the "Evaluating" step

pminimd · November 24, 2020, 2:54pm

I was trying to speed up my model running on Android device, but failed on the “Evaluating” step.

Firstly, I tested the tune_relay_arm.py scripts following the instruction. (The value of “n_trial” was modified to “1” because I just want to see if it works in a short time)

Tuning...
[Task  1/28]  Current/Best:    2.21/   2.21 GFLOPS | Progress: (1/1) | 1.15 s Done.
[Task  2/28]  Current/Best:    3.09/   3.09 GFLOPS | Progress: (1/1) | 1.09 s Done.
[Task  3/28]  Current/Best:   16.69/  16.69 GFLOPS | Progress: (1/1) | 1.13 s Done.
[Task  4/28]  Current/Best:    1.59/   1.59 GFLOPS | Progress: (1/1) | 1.45 s Done.
[Task  5/28]  Current/Best:    6.57/   6.57 GFLOPS | Progress: (1/1) | 1.00 s Done.
[Task  6/28]  Current/Best:    3.51/   3.51 GFLOPS | Progress: (1/1) | 1.09 s Done.
[Task  7/28]  Current/Best:   15.27/  15.27 GFLOPS | Progress: (1/1) | 0.82 s Done.
[Task  8/28]  Current/Best:    3.67/   3.67 GFLOPS | Progress: (1/1) | 1.08 s Done.
[Task  9/28]  Current/Best:   11.11/  11.11 GFLOPS | Progress: (1/1) | 1.26 s Done.
[Task 10/28]  Current/Best:   16.60/  16.60 GFLOPS | Progress: (1/1) | 1.11 s Done.
[Task 11/28]  Current/Best:   25.53/  25.53 GFLOPS | Progress: (1/1) | 2.65 s Done.
[Task 12/28]  Current/Best:   18.67/  18.67 GFLOPS | Progress: (1/1) | 3.23 s Done.
[Task 13/28]  Current/Best:   13.43/  13.43 GFLOPS | Progress: (1/1) | 1.67 s Done.
[Task 14/28]  Current/Best:   23.99/  23.99 GFLOPS | Progress: (1/1) | 0.88 s Done.
[Task 15/28]  Current/Best:    5.25/   5.25 GFLOPS | Progress: (1/1) | 1.71 s Done.
[Task 16/28]  Current/Best:   23.17/  23.17 GFLOPS | Progress: (1/1) | 2.63 s Done.
[Task 17/28]  Current/Best:   26.99/  26.99 GFLOPS | Progress: (1/1) | 1.11 s Done.
[Task 18/28]  Current/Best:   11.99/  11.99 GFLOPS | Progress: (1/1) | 1.41 s Done.
[Task 19/28]  Current/Best:    9.74/   9.74 GFLOPS | Progress: (1/1) | 0.95 s Done.
[Task 20/28]  Current/Best:   11.01/  11.01 GFLOPS | Progress: (1/1) | 2.07 s Done.
[Task 21/28]  Current/Best:   22.91/  22.91 GFLOPS | Progress: (1/1) | 2.91 s Done.
[Task 22/28]  Current/Best:    6.02/   6.02 GFLOPS | Progress: (1/1) | 2.94 s Done.
[Task 23/28]  Current/Best:    5.32/   5.32 GFLOPS | Progress: (1/1) | 2.58 s Done.
[Task 24/28]  Current/Best:   12.52/  12.52 GFLOPS | Progress: (1/1) | 1.05 s Done.
[Task 25/28]  Current/Best:    5.78/   5.78 GFLOPS | Progress: (1/1) | 2.29 s Done.
[Task 26/28]  Current/Best:   19.51/  19.51 GFLOPS | Progress: (1/1) | 3.51 s Done.
[Task 27/28]  Current/Best:    8.36/   8.36 GFLOPS | Progress: (1/1) | 1.45 s Done.
[Task 28/28]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (1/1) | 10.12 s Done.
Compile...
Cannot find config for target=llvm -keys=arm_cpu,cpu -device=arm_cpu -mtriple=aarch64-linux-android, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Upload...
Evaluate inference time cost...
Mean inference time (std dev): 165.64 ms (2.00 ms)

Then I redefined the function get_network(name, batch_size) :

def get_network(network_path):
    
    input_shape = (1, 1, 125, 125)
    output_shape = (1, 1, 484, 484)

    input_name = "input.1"# This is the name of the network's input
    shape_dict = {input_name: input_shape}
    dtype_dict = {input_name: "float32"}
    try:
        onnx_model = onnx.load(network_path)
        # print(onnx_model)
    except:
        raise ValueError("Unsupported network: " + network_path)

    mod, params = relay.frontend.from_onnx(onnx_model,shape=shape_dict, dtype=dtype_dict)#,freeze_params=False)
    return mod, params, input_shape, output_shape

After that, errors occurred as below :

Extract tasks...
Tuning...
[Task  1/ 9]  Current/Best:    3.79/   3.79 GFLOPS | Progress: (1/1) | 1.41 s Done.
[Task  2/ 9]  Current/Best:    1.93/   1.93 GFLOPS | Progress: (1/1) | 0.99 s Done.
[Task  3/ 9]  Current/Best:    0.98/   0.98 GFLOPS | Progress: (1/1) | 1.32 s Done.
[Task  4/ 9]  Current/Best:    0.82/   0.82 GFLOPS | Progress: (1/1) | 1.51 s Done.
[Task  5/ 9]  Current/Best:    0.87/   0.87 GFLOPS | Progress: (1/1) | 1.81 s Done.
[Task  6/ 9]  Current/Best:   16.13/  16.13 GFLOPS | Progress: (1/1) | 0.88 s Done.
[Task  7/ 9]  Current/Best:    0.89/   0.89 GFLOPS | Progress: (1/1) | 1.05 s Done.
[Task  8/ 9]  Current/Best:    3.90/   3.90 GFLOPS | Progress: (1/1) | 0.97 s Done.
[Task  9/ 9]  Current/Best:    2.96/   2.96 GFLOPS | Progress: (1/1) | 0.85 s Done.
Compile...
Cannot find config for target=llvm -keys=arm_cpu,cpu -device=arm_cpu -mtriple=aarch64-linux-android, workload=('conv2d_transpose_nchw.arm_cpu', ('TENSOR', (1, 32, 121, 121), 'float32'), ('TENSOR', (32, 1, 9, 9), 'float32'), (4, 4), (3, 3, 3, 3), 'float32', (1, 1)). A fallback configuration is used, which may bring great performance regression.
Upload...
(1, 1, 125, 125)
Evaluate inference time cost...
Traceback (most recent call last):
  File "Autotuningplay.py", line 184, in <module>
    tune_and_evaluate(tuning_option)
  File "Autotuningplay.py", line 173, in tune_and_evaluate
    prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
  File "/home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/python/tvm/runtime/module.py", line 226, in evaluator
    blob = feval(*args)
  File "/home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/build/libtvm.so(TVMFuncCall+0x65) [0x7ff8882659a5]
  [bt] (4) /home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/build/libtvm.so(+0x115a063) [0x7ff8882df063]
  [bt] (3) /home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/build/libtvm.so(tvm::runtime::RPCWrappedFunc::operator()(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) const+0x3c5) [0x7ff8882e2255]
  [bt] (2) /home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/build/libtvm.so(tvm::runtime::RPCClientSession::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)> const&)+0x57) [0x7ff8882d4657]
  [bt] (1) /home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/build/libtvm.so(tvm::runtime::RPCEndpoint::CallFunc(void*, TVMValue const*, int const*, int, std::function<void (tvm::runtime::TVMArgs)>)+0x2a3) [0x7ff8882cc1e3]
  [bt] (0) /home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/build/libtvm.so(+0x1144182) [0x7ff8882c9182]
  File "/home/wenjun/Documents/Dev/apache-tvm-src-v0.7.0.rc0-incubating/src/runtime/rpc/rpc_endpoint.cc", line 807
TVMError: Check failed: code == RPCCode: :kReturn: code=1

Is there any difference between “mod, params” generated by relay.frontend and those generated by relay.testing?