[ Bug ] The arm cpu performance of the new version of tvm is too low than the old version

hello!

I am currently using rk3399 board to measure performance by running vgg-16 with old tvm and current tvm. Below is the specification.

rk3399 device 1 -> old version of tvm and ubuntu16.04 + LLVM 8.0.0
rk3399 devcie 2 -> new version of tvm and ubuntu18.04 + LLVM 8.0.0

and I tested it with the same code below.

import tvm
import tvm.relay as relay
#from tvm.contrib import graph_runtime
from tvm.contrib.debugger import debug_runtime as graph_runtime
import numpy as np
import topi
from tvm.relay.testing.temp_op_attr import TempOpAttr

target_arm_cpu = tvm.target.create('llvm -device=arm_cpu -target=aarch64-linux-gnu')
ctx_arm_cpu =  tvm.cpu()
dtype='float32'
batch_size = 1
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)
mod, paramsO = relay.testing.vgg.get_workload(
    num_layers=16, batch_size=batch_size, image_shape=image_shape)
opt_level = 3

#arm_cpu 
with relay.build_config(opt_level = opt_level):
    graph, lib, params = relay.build_module.build( mod, target_arm_cpu , params = paramsO )

data = tvm.nd.array( np.random.uniform(-1, 1, size=data_shape ).astype("float32") , ctx_arm_cpu )
module = graph_runtime.create(graph, lib, ctx_arm_cpu)
module.set_input("data", data)
module.set_input(**params)
module.run()

And the result is below.

rk3399 device 1 performance is Mean inference time (std dev): 989.96 ms (0.80 ms)
rk3399 device 2 performacne is Mean inference time (std dev): 1961.32 ms (2.55 ms)

I think the new version of tvm can’t catch the tunning configuration. Looking at the log below, the new tvm and old tvm configurations are different.

[New Version TVM when compile vgg-16]
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_winograd.arm_cpu', ('TENSOR', (1, 3, 224, 224), 'float32'), ('TENSOR', (64, 3, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 224, 224), 'float32'), ('TENSOR', (64, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 64, 112, 112), 'float32'), ('TENSOR', (128, 64, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 112, 112), 'float32'), ('TENSOR', (128, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 128, 56, 56), 'float32'), ('TENSOR', (256, 128, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 56, 56), 'float32'), ('TENSOR', (256, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 256, 28, 28), 'float32'), ('TENSOR', (512, 256, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 28, 28), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('conv2d_nchw_spatial_pack.arm_cpu', ('TENSOR', (1, 512, 14, 14), 'float32'), ('TENSOR', (512, 512, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (1000, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 4096), 'float32'), ('TENSOR', (4096, 4096), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense_nopack.x86', ('TENSOR', (1, 25088), 'float32'), ('TENSOR', (4096, 25088), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.

and old one is …

[ old version of tvm when compile vgg-16 ]
Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 4096, 'float32'), (1000, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 4096, 'float32'), (4096, 4096, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.

Cannot find config for target=llvm -device=arm_cpu -target=aarch64-linux-gnu, workload=('dense', (1, 25088, 'float32'), (4096, 25088, 'float32'), 0, 'float32'). A fallback configuration is used, which may bring great performance regression.

As you can see from the log, the fallback config for conv2d does not appear in the old version of tvm, but the fallback config for con2d occurs in the new version.

is it internal issue of TVM??

1 Like

Hi there. I think your problem might have something to do with the problem described in here: [BUG][ARM] Significant performance degradation of execution times between TVM revisions

Maybe it’s too late to answer, but I still hope your problem has already been solved!