Performance has been too slow since the TVM update

Hello!

When i update tvm version to 0.7.dev1, it takes too much time when running module. I use TitanXP GPU with cuda version 8.0 and llvm 6.0.1

And I measured the performance with the code below.

import numpy as np

from tvm import relay
from tvm.relay import testing
import tvm
from tvm import te
from tvm.contrib import graph_runtime
import time


batch_size = 100
num_class = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_class)

mod, params = relay.testing.vgg.get_workload(
    num_layers=16, batch_size=batch_size, image_shape=image_shape)


opt_level = 3
target = tvm.target.cuda()
with relay.build_config(opt_level=opt_level):
    graph, lib, params = relay.build_module.build(
        mod, target, params=params)

ctx = tvm.gpu()
data = np.random.uniform(-1, 1, size=data_shape).astype("float32")
# create module
module = graph_runtime.create(graph, lib, ctx)
# set input and parameters
module.set_input("data", data)
module.set_input(**params)

ev = module.module.time_evaluator( 'run', ctx, number=1,repeat=1)
prof_res = np.array(ev().results) * 1000  # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
              (np.mean(prof_res), np.std(prof_res)))

and the Mean inference time is “Mean inference time (std dev): 50430.87 ms (0.00 ms)”.

but when i use old tvm version, it takes “Mean inference time (std dev): 495.78 ms (0.00 ms)”.

As shown above, the performance difference is too big. Is there a problem with the code? Or is the current version unstable?

2 Likes

I’ve experimented a bit, and even a simple model can see a drastic drop in performance if the batch size is over 2.

Could you use debug graph runtime to see where is the performance bottleneck?

yes.

The result is shown below.

This is the result of running batch size 1 for the VGG-16 network. and “Mean inference time (std dev): 3.38 ms (0.01 ms)”

and when i set batch_size 2.

the result is “Mean inference time (std dev): 357.77 ms (0.19 ms)”

I know that the convolution using the winograd algorithm is fast, but the existing direct conv2d seems to come out too slowly.

plus, the results below are for running older versions of tvm.

when batch_size is 1.

the inference time is “Mean inference time (std dev): 3.41 ms (0.01 ms)”

when batch_size is 2.

the inference time is “Mean inference time (std dev): 15.43 ms (0.01 ms)”

The new version of tvm does not seem to optimize when batch size is 2 or more.

Thanks for the clear info. It looks like conv2d with batch size 2 doesn’t make a good schedule config from Tophub, but it needs to be further investigated.

cc @haichen for any suggestions.

Could it due to the fallback mechanism not working properly? @haichen

I found out the reason is because the winograd implementation has higher priority than the direct one. It seems that winograd performs poor when batch size is large.

I was thinking that too, but the screenshot of batch size 2 still shows fused_nn_conv2d... instead of fused_nn_contrib_conv2d_winograd... so I am not sure if it uses Winograd for batch size 2 or not.

This PR should fix your problem @CASS_choi

2 Likes

okay, thanks all of you!