LLVM error when deploy mobilenet based model on raspberry pi3b

I try to test the performance and deploy the mobilefacenet which trained by insightface on raspberry pi3b. but Error occurs.

/Users/yujinke/anaconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters [19:08:43] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.2.0. Attempting to upgrade… [19:08:43] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded! Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 3, 112, 112, ‘float32’), (64, 3, 3, 3, ‘float32’), (2, 2), (1, 1), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 128, 28, 28, ‘float32’), (64, 128, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 64, 28, 28, ‘float32’), (128, 64, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 64, 28, 28, ‘float32’), (256, 64, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 128, 14, 14, ‘float32’), (256, 128, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 128, 14, 14, ‘float32’), (512, 128, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 128, 7, 7, ‘float32’), (256, 128, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 256, 7, 7, ‘float32’), (128, 256, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘conv2d’, (1, 128, 7, 7, ‘float32’), (512, 128, 1, 1, ‘float32’), (1, 1), (0, 0), (1, 1), ‘NCHW’, ‘float32’). A fallback configuration is used, which may bring great performance regression. WARNING:autotvm:Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘depthwise_conv2d_nchw’, (1, 64, 56, 56, ‘float32’), (64, 1, 3, 3, ‘float32’), (1, 1), (1, 1), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression. WARNING:autotvm:Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘depthwise_conv2d_nchw’, (1, 128, 28, 28, ‘float32’), (128, 1, 3, 3, ‘float32’), (1, 1), (1, 1), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression. WARNING:autotvm:Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘depthwise_conv2d_nchw’, (1, 256, 14, 14, ‘float32’), (256, 1, 3, 3, ‘float32’), (1, 1), (1, 1), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression. WARNING:autotvm:Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘depthwise_conv2d_nchw’, (1, 256, 7, 7, ‘float32’), (256, 1, 3, 3, ‘float32’), (1, 1), (1, 1), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression. WARNING:autotvm:Cannot find config for target=llvm -device=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon, workload=(‘depthwise_conv2d_nchw’, (1, 512, 7, 7, ‘float32’), (512, 1, 7, 7, ‘float32’), (1, 1), (0, 0), (1, 1), ‘float32’). A fallback configuration is used, which may bring great performance regression. LLVM ERROR: Cannot select: 0x7fdadfc7b220: ch = br_cc 0x7fdae10204d0, setgt:ch, 0x7fdadfbc3070, 0x7fdadfbc1508, BasicBlock:ch<if_end 0x7fdadeaca780>
0x7fdadfbc3070: v4f32 = fadd 0x7fdadfc7b4f8, 0x7fdadfc77700 0x7fdadfc7b4f8: v4f32,ch = CopyFromReg 0x7fdae40036c0, Register:v4f32 %37 0x7fdadfc927b8: v4f32 = Register %37 0x7fdadfc77700: v4f32,ch = CopyFromReg 0x7fdae40036c0, Register:v4f32 %11 0x7fdadfc9ba20: v4f32 = Register %11 0x7fdadfbc1508: v4f32 = bitcast 0x7fdae1021308 0x7fdae1021308: v4i32 = ARMISD::VMOVIMM TargetConstant:i32<0> 0x7fdadfc9b3a0: i32 = TargetConstant<0> In function: __tvm_parallel_lambda.63

My Code:

import tvm import nnvm.compiler import nnvm.testing from tvm import rpc from tvm.contrib import util,graph_runtime as runtime import numpy as np import mxnet as mx from mxnet import ndarray as nd from tvm.contrib.util import tempdir from util import get_network, print_progress prefix,epoch = “model_mfn”,0 #prefix,epoch = “mneti”,0 sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch) image_size = (112,112) #image_size = (-1,-1) opt_level = 3 shape_dict = {‘data’: (1, 3, *image_size)} #target = tvm.target.create(“llvm -mcpu=broadwell”) target = tvm.target.arm_cpu(‘rasp3b’) nnvm_sym, nnvm_params = nnvm.frontend.from_mxnet(sym, arg_params, aux_params) with nnvm.compiler.build_config(opt_level=opt_level): graph, lib, params = nnvm.compiler.build(nnvm_sym, target, shape_dict, params=nnvm_params) lib.export_library(“./deploy_lib.tar”) print(‘lib succeeded’) with open(“deploy_graph.json”, “w”) as fo: fo.write(graph.json()) with open(“deploy_param.params”, “wb”) as fo: fo.write(nnvm.compiler.save_param_dict(params)) local_demo = False if local_demo: remote = rpc.LocalSession() else: # The following is my environment, change this to the IP address of your target device host = ‘192.168.2.105’ #host = ‘172.19.0.12’ port = 9090 remote = rpc.connect(host, port)

upload the library to remote device and load it

lib_fname = ‘deploy_lib.tar’ remote.upload(lib_fname) rlib = remote.load_module(‘deploy_lib.tar’)

upload the parameter (this may take a while)

ctx = remote.cpu(0) rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}

create the remote runtime module

module = runtime.create(graph, rlib, ctx)

set parameter

module.set_input(**rparams)

set input data

import numpy as np network = “emore1”

print(“load succeess. input size [1 3 112 112]”) print(“%-20s %-19s (%s)” % (“name”, “mean”, “-+”))

module.set_input(‘data’, tvm.nd.array(np.zeros(shape = (1,3,image_size[0],image_size[1]),dtype=np.float32))) repeat= 10 print_progress(“%-20s evaluating…” % network) ftimer = module.module.time_evaluator(“run”, ctx, number=1, repeat=repeat) prof_res = np.array(ftimer().results) * 1000 # multiply 1000 for converting to millisecond print(“%-20s %-19s (%s)” % (network, “%.2f ms” % np.mean(prof_res), “%.2f ms” % np.std(prof_res)))

but i set opt_level to 0 .it works well. but the model run extremely slow on arm device.

Both mobilefacenet and mobilenet models from insightface can not be compiled using tvm on arm devices.
Need help!

when you set opt_level to 0,it means tvm will not do any graph optimization ,it just does inference,so it is slow.

The error means nnvm cannot compile the model to lowlevel code.You may try relay instead.

For warnnings,it says nnvm cannot find config for workload of the network and you have to auto-tune the network yourself to get better results

when doing the auto-tuning on x86 CPU using the code of tune_relay_x86.py from the tvm tutorials, the tensorflow inference model has the conv2d, also has the mobilev2 as the backbone which used depth-wise conv,and I want to optimize both conv2d and depth-wise conv, the argument of ops in the line of tasks = autotvm.task.extract_from_program(net, target=target, params=params, ops=(relay.op.nn.conv2d,)) should be set to ? thank you very much!

I got it. OP2TOPI = {
tvm.relay.op.nn.conv2d: [topi.nn.conv2d, topi.nn.depthwise_conv2d_nchw, …] in the file of relay_integration.py, tvm.relay.op.nn.conv2d includes topi.nn.conv2d, topi.nn.depthwise_conv2d_nchw both. there is another question:
in the tune_kernels from tune_relay_x86.py, there are code lines as follows:
for i, tsk in enumerate(tasks):
prefix = "[Task %2d/%2d] " % (i+1, len(tasks))
# converting conv2d tasks to conv2d_NCHWc tasks
op_name = tsk.workload[0]
if op_name == ‘conv2d’:
func_create = ‘topi_x86_conv2d_NCHWc’
elif op_name == ‘depthwise_conv2d_nchw’:
func_create = ‘topi_x86_depthwise_conv2d_NCHWc_from_nchw’
else:
raise ValueError(“Tuning {} is not supported on x86”.format(op_name))

else means till now, there are only two ops(conv2d, depthwise_conv2d_nchw) are supported auto-tunning for x86 CPU?

You can look in the x86 topi directory to see a comprehensive list of operators that are overriden for autotuning. Many ops do not need to be tuned because they are fused (e.g., batchnorm, relu) or because they have a very small impact on performance in many models (e.g., dense).

thank you much! does tvm support auto-tune the deeplabv3 to speed up the inference faster than tensorRT, deeplabv3 has dilated conv and Atrous Spatial Pyramid Pooling, ASPP. if it does, how to set the opts of tasks = autotvm.task.extract_from_program(net, target=target, params=params, ops=(relay.op.nn.conv2d,)) ? thanks again!