[Resolved][Performance Regression] Migrate all low-level passes to the Pass Manager PR Causing Regression

kevinthesun · April 7, 2020, 5:25pm

https://github.com/apache/incubator-tvm/pull/5233 causes regression on x86.

I create a minimal sample containing the first layer of resnet:

import numpy as np
import tvm
import topi
import time
import tvm.relay.testing

from tvm import relay, autotvm
from tvm.contrib import graph_runtime
from tvm.contrib.debugger import debug_runtime
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner

batch_size = 1
in_channel = 3
out_channel = 64
in_height, in_width = 224, 224
kernel_size = (7, 7)
padding = (3, 3, 3, 3)
strides = (2, 2)
dilation = (1, 1)


dshape = (batch_size, in_channel, in_height, in_width)
wshape = (out_channel, in_channel) + kernel_size
target = "llvm -mcpu=skylake-avx512"
dtype="float32"
data = relay.var("data", shape=dshape, dtype=dtype)
weight = relay.var("weight", shape=wshape, dtype=dtype)
net = relay.nn.conv2d(data, weight=weight, channels=out_channel, kernel_size=kernel_size, strides=strides,
                      padding=padding, dilation=dilation)
out = relay.Function(relay.analysis.free_vars(net), net)
net, params = relay.testing.create_workload(out)

with tvm.autotvm.FallbackContext():
    with relay.build_config(opt_level=3):
        graph, lib, params = relay.build_module.build(
            net, target=target,  params=params)

ctx = tvm.cpu()
data_tvm = tvm.nd.array((np.random.uniform(size=dshape)).astype(dtype))
module = debug_runtime.create(graph, lib, ctx)
module.set_input('data', data_tvm)
module.set_input(**params)
module.run()

Before this PR the performance is:

Node Name                      Ops                            Time(us)  Time(%)  Shape                 Inputs  Outputs  
---------                      ---                            --------  -------  -----                 ------  -------  
fused_layout_transform_1       fused_layout_transform_1       186.86    63.2     (1, 64, 112, 112)     1       1        
fused_nn_contrib_conv2d_NCHWc  fused_nn_contrib_conv2d_NCHWc  99.964    33.81    (1, 2, 112, 112, 32)  2       1        
fused_layout_transform_2       fused_layout_transform_2       8.84      2.99     (1, 1, 224, 224, 3)   1       1        
Total_time                     -                              295.664   -        -                     -       -

After this PR:

Node Name                      Ops                            Time(us)  Time(%)  Shape                 Inputs  Outputs  
---------                      ---                            --------  -------  -----                 ------  -------  
fused_nn_contrib_conv2d_NCHWc  fused_nn_contrib_conv2d_NCHWc  281.12    58.841   (1, 2, 112, 112, 32)  2       1        
fused_layout_transform_1       fused_layout_transform_1       187.974   39.345   (1, 64, 112, 112)     1       1        
fused_layout_transform_2       fused_layout_transform_2       8.665     1.814    (1, 1, 224, 224, 3)   1       1        
Total_time                     -                              477.759   -        -                     -       -

There is a clear drop for conv2d op. I checked the lowered IR and they are the same. Is there anything during the codegen breaking the performance?

@tqchen @haichen @zhiics

tqchen · April 7, 2020, 3:20am

@kevinthesun can you check the PrimFunc feed into the codegen phase? In particular please also check the attributes(e.g. noalias) Note that the specific PR does not change the codegen part(they are already refactored in a previous PR)

tqchen · April 7, 2020, 3:25am

Please see if https://github.com/apache/incubator-tvm/pull/5258 fixes the issue