https://github.com/apache/incubator-tvm/pull/5233 causes regression on x86.
I create a minimal sample containing the first layer of resnet:
import numpy as np
import tvm
import topi
import time
import tvm.relay.testing
from tvm import relay, autotvm
from tvm.contrib import graph_runtime
from tvm.contrib.debugger import debug_runtime
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
batch_size = 1
in_channel = 3
out_channel = 64
in_height, in_width = 224, 224
kernel_size = (7, 7)
padding = (3, 3, 3, 3)
strides = (2, 2)
dilation = (1, 1)
dshape = (batch_size, in_channel, in_height, in_width)
wshape = (out_channel, in_channel) + kernel_size
target = "llvm -mcpu=skylake-avx512"
dtype="float32"
data = relay.var("data", shape=dshape, dtype=dtype)
weight = relay.var("weight", shape=wshape, dtype=dtype)
net = relay.nn.conv2d(data, weight=weight, channels=out_channel, kernel_size=kernel_size, strides=strides,
padding=padding, dilation=dilation)
out = relay.Function(relay.analysis.free_vars(net), net)
net, params = relay.testing.create_workload(out)
with tvm.autotvm.FallbackContext():
with relay.build_config(opt_level=3):
graph, lib, params = relay.build_module.build(
net, target=target, params=params)
ctx = tvm.cpu()
data_tvm = tvm.nd.array((np.random.uniform(size=dshape)).astype(dtype))
module = debug_runtime.create(graph, lib, ctx)
module.set_input('data', data_tvm)
module.set_input(**params)
module.run()
Before this PR the performance is:
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
fused_layout_transform_1 fused_layout_transform_1 186.86 63.2 (1, 64, 112, 112) 1 1
fused_nn_contrib_conv2d_NCHWc fused_nn_contrib_conv2d_NCHWc 99.964 33.81 (1, 2, 112, 112, 32) 2 1
fused_layout_transform_2 fused_layout_transform_2 8.84 2.99 (1, 1, 224, 224, 3) 1 1
Total_time - 295.664 - - - -
After this PR:
Node Name Ops Time(us) Time(%) Shape Inputs Outputs
--------- --- -------- ------- ----- ------ -------
fused_nn_contrib_conv2d_NCHWc fused_nn_contrib_conv2d_NCHWc 281.12 58.841 (1, 2, 112, 112, 32) 2 1
fused_layout_transform_1 fused_layout_transform_1 187.974 39.345 (1, 64, 112, 112) 1 1
fused_layout_transform_2 fused_layout_transform_2 8.665 1.814 (1, 1, 224, 224, 3) 1 1
Total_time - 477.759 - - - -
There is a clear drop for conv2d op. I checked the lowered IR and they are the same. Is there anything during the codegen breaking the performance?