The current version of tvm cannot find the configuration of conv2d

Hello.

In rk3399, i found a performance decrease during inference using the vgg-16 model.

Performance was measured using the test code below.

import tvm
import tvm.relay as relay
from tvm.contrib import graph_runtime
import numpy as np
import topi
from tvm.relay.testing.temp_op_attr import TempOpAttr

target_arm_cpu = tvm.target.create('llvm -device=arm_cpu -target=aarch64-linux-gnu')
ctx_arm_cpu =  tvm.cpu()
dtype='float32'
batch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)
mod, paramsO = relay.testing.vgg.get_workload(
    num_layers=16, batch_size=batch_size, image_shape=image_shape)
opt_level = 3

#arm_cpu 
with relay.build_config(opt_level = opt_level):
    graph, lib, params = relay.build_module.build( mod, target_arm_cpu , params = paramsO )

data = tvm.nd.array( np.random.uniform(-1, 1, size=data_shape ).astype("float32") , ctx_arm_cpu )
module = graph_runtime.create(graph, lib, ctx_arm_cpu)
module.set_input("data", data)
module.set_input(**params)
module.run()

When running vgg-16 using arm cpu in current tvm version, the performance is as follows.

Mean inference time (std dev): 1892.25 ms (2.20 ms)

and old tvm version is

Mean inference time (std dev): 989.96 ms (0.80 ms)

The performance difference between the new version and the old version is too big.

i think the new version of tvm doesn’t seem to find the config for vgg-16. Below is the log when compiling vgg-16 model using Relay.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (256, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (512, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (1000, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (4096, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 25088), 'float32'), ('TENSOR', (4096, 25088), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.

And below is the log when I compiled vgg-16 with old tvm.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 4096, 'float32'), (1000, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 4096, 'float32'), (4096, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 25088, 'float32'), (4096, 25088, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.

As you can see from the log, the fallback config for conv2d does not appear in the old version of tvm, but the fallback config for con2d occurs in the new version.

I think the current version of tvm can’t catch the conv2d config, so it seems to cause performance degradation. is it intended or internal tvm problem?