[AutoTVM]GFLOPS is great in auto-tuning,but the op is very time-consuming at runtime

I print log by graph_time_debug and find that fused_nn_contrib_conv2d_NCHWc_9 took ten times as long as the other op. But GFLOPS is great in auto-tuning.

Node Name                                                                                                  Ops                                                                                                       Time(us)   Time(%)  Shape                 Inputs  Outputs  
---------                                                                                                  ---                                                                                                       --------   -------  -----                 ------  -------  
fused_nn_contrib_conv2d_NCHWc_9                                                                            fused_nn_contrib_conv2d_NCHWc_9                                                                           6791.89    12.747   (1, 6, 71, 71, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_5                                                                            fused_nn_contrib_conv2d_NCHWc_5                                                                           2484.37    4.663    (1, 48, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_11                                                                           fused_nn_contrib_conv2d_NCHWc_11                                                                          2468.05    4.632    (1, 8, 147, 147, 8)   2       1        
fused_nn_contrib_conv2d_NCHWc_12                                                                           fused_nn_contrib_conv2d_NCHWc_12                                                                          1185.52    2.225    (1, 4, 147, 147, 8)   2       1        
fused_nn_contrib_conv2d_NCHWc_381                                                                          fused_nn_contrib_conv2d_NCHWc_38                                                                          810.38     1.521    (1, 24, 8, 8, 16)     2       1        
fused_nn_contrib_conv2d_NCHWc_38                                                                           fused_nn_contrib_conv2d_NCHWc_38                                                                          803.933    1.509    (1, 24, 8, 8, 16)     2       1        
fused_nn_contrib_conv2d_NCHWc_331                                                                          fused_nn_contrib_conv2d_NCHWc_33                                                                          711.435    1.335    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_32                                                                           fused_nn_contrib_conv2d_NCHWc_32                                                                          695.281    1.305    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_322                                                                          fused_nn_contrib_conv2d_NCHWc_32                                                                          680.986    1.278    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_33                                                                           fused_nn_contrib_conv2d_NCHWc_33                                                                          679.579    1.275    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_323                                                                          fused_nn_contrib_conv2d_NCHWc_32                                                                          658.919    1.237    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_333                                                                          fused_nn_contrib_conv2d_NCHWc_33                                                                          654.632    1.229    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_321                                                                          fused_nn_contrib_conv2d_NCHWc_32                                                                          653.408    1.226    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_332                                                                          fused_nn_contrib_conv2d_NCHWc_33                                                                          651.086    1.222    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_16                                                                           fused_nn_contrib_conv2d_NCHWc_16                                                                          630.064    1.183    (1, 12, 35, 35, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_162                                                                          fused_nn_contrib_conv2d_NCHWc_16                                                                          614.8      1.154    (1, 12, 35, 35, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_161                                                                          fused_nn_contrib_conv2d_NCHWc_16                                                                          612.939    1.15     (1, 12, 35, 35, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_421                                                                          fused_nn_contrib_conv2d_NCHWc_42                                                                          587.995    1.104    (1, 28, 8, 8, 16)     2       1        
fused_layout_transform_transpose_multiply_add_nn_relu_transpose                                            fused_layout_transform_transpose_multiply_add_nn_relu_transpose                                           561.455    1.054    (1, 192, 71, 71)      3       1        
fused_nn_contrib_conv2d_NCHWc_271                                                                          fused_nn_contrib_conv2d_NCHWc_27                                                                          538.25     1.01     (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_27                                                                           fused_nn_contrib_conv2d_NCHWc_27                                                                          528.974    0.993    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_30                                                                           fused_nn_contrib_conv2d_NCHWc_30                                                                          519.549    0.975    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_301                                                                          fused_nn_contrib_conv2d_NCHWc_30                                                                          515.873    0.968    (1, 24, 17, 17, 8)    2       1        
fused_layout_transform_transpose_multiply_add_nn_relu_transpose_1                                          fused_layout_transform_transpose_multiply_add_nn_relu_transpose_1                                         508.914    0.955    (1, 64, 147, 147)     3       1        
fused_nn_contrib_conv2d_NCHWc_4111                                                                         fused_nn_contrib_conv2d_NCHWc_41                                                                          496.091    0.931    (1, 12, 8, 8, 32)     2       1        
fused_nn_contrib_conv2d_NCHWc_14                                                                           fused_nn_contrib_conv2d_NCHWc_14                                                                          480.63     0.902    (1, 4, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_142                                                                          fused_nn_contrib_conv2d_NCHWc_14                                                                          478.135    0.897    (1, 4, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_141                                                                          fused_nn_contrib_conv2d_NCHWc_14                                                                          474.557    0.891    (1, 4, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_22                                                                           fused_nn_contrib_conv2d_NCHWc_22                                                                          471.106    0.884    (1, 24, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_311                                                                          fused_nn_contrib_conv2d_NCHWc_31                                                                          470.568    0.883    (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_282                                                                          fused_nn_contrib_conv2d_NCHWc_28                                                                          468.939    0.88     (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_31                                                                           fused_nn_contrib_conv2d_NCHWc_31                                                                          443.32     0.832    (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_312                                                                          fused_nn_contrib_conv2d_NCHWc_31                                                                          442.435    0.83     (1, 20, 17, 17, 8)    2       1        
fused_nn_avg_pool2d_10                                                                                     fused_nn_avg_pool2d_10                                                                                    442.288    0.83     (1, 288, 35, 35)      1       1        
fused_nn_contrib_conv2d_NCHWc_313                                                                          fused_nn_contrib_conv2d_NCHWc_31                                                                          442.159    0.83     (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_28                                                                           fused_nn_contrib_conv2d_NCHWc_28                                                                          433.852    0.814    (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_281                                                                          fused_nn_contrib_conv2d_NCHWc_28                                                                          432.233    0.811    (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_283                                                                          fused_nn_contrib_conv2d_NCHWc_28                                                                          431.392    0.81     (1, 20, 17, 17, 8)    2       1        
fused_nn_contrib_conv2d_NCHWc_1                                                                            fused_nn_contrib_conv2d_NCHWc_1                                                                           412.48     0.774    (1, 10, 8, 8, 32)     2       1        
fused_nn_contrib_conv2d_NCHWc_25                                                                           fused_nn_contrib_conv2d_NCHWc_25                                                                          410.248    0.77     (1, 24, 17, 17, 8)    2       1        
fused_nn_avg_pool2d_9                                                                                      fused_nn_avg_pool2d_9                                                                                     393.421    0.738    (1, 256, 35, 35)      1       1        
fused_nn_contrib_conv2d_NCHWc_39                                                                           fused_nn_contrib_conv2d_NCHWc_39                                                                          371.392    0.697    (1, 14, 8, 8, 32)     2       1        
fused_nn_contrib_conv2d_NCHWc_291                                                                          fused_nn_contrib_conv2d_NCHWc_29                                                                          370.567    0.695    (1, 10, 17, 17, 16)   2       1        
fused_nn_contrib_conv2d_NCHWc_292                                                                          fused_nn_contrib_conv2d_NCHWc_29                                                                          365.842    0.687    (1, 10, 17, 17, 16)   2       1        
fused_nn_contrib_conv2d_NCHWc_293                                                                          fused_nn_contrib_conv2d_NCHWc_29                                                                          363.063    0.681    (1, 10, 17, 17, 16)   2       1        
fused_nn_contrib_conv2d_NCHWc_29                                                                           fused_nn_contrib_conv2d_NCHWc_29                                                                          359.889    0.675    (1, 10, 17, 17, 16)   2       1        
fused_nn_contrib_conv2d_NCHWc_173                                                                          fused_nn_contrib_conv2d_NCHWc_17                                                                          342.677    0.643    (1, 6, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_172                                                                          fused_nn_contrib_conv2d_NCHWc_17                                                                          335.739    0.63     (1, 6, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_17                                                                           fused_nn_contrib_conv2d_NCHWc_17                                                                          329.798    0.619    (1, 6, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_171                                                                          fused_nn_contrib_conv2d_NCHWc_17                                                                          329.348    0.618    (1, 6, 35, 35, 16)    2       1        
fused_nn_contrib_conv2d_NCHWc_410                                                                          fused_nn_contrib_conv2d_NCHWc_4                                                                           328.689    0.617    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_43                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           322.785    0.606    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_49                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           322.402    0.605    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_411                                                                          fused_nn_contrib_conv2d_NCHWc_4                                                                           315.062    0.591    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_44                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           311.779    0.585    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_4                                                                            fused_nn_contrib_conv2d_NCHWc_4                                                                           311.495    0.585    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_41                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           310.259    0.582    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_45                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           309.839    0.582    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_42                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           309.302    0.581    (1, 6, 17, 17, 32)    2       1        
fused_nn_avg_pool2d_113                                                                                    fused_nn_avg_pool2d_11                                                                                    302.062    0.567    (1, 768, 17, 17)      1       1        
fused_nn_contrib_conv2d_NCHWc_48                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           300.728    0.564    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_47                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           300.177    0.563    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_46                                                                           fused_nn_contrib_conv2d_NCHWc_4                                                                           299.445    0.562    (1, 6, 17, 17, 32)    2       1        
fused_nn_contrib_conv2d_NCHWc_261                                                                          fused_nn_contrib_conv2d_NCHWc_26                                                                          294.235    0.552    (1, 16, 17, 17, 8)    2       1        
fused_nn_avg_pool2d_8                                                                                      fused_nn_avg_pool2d_8                                                                                     294.226    0.552    (1, 192, 35, 35)      1       1        
fused_nn_contrib_conv2d_NCHWc_431                                                                          fused_nn_contrib_conv2d_NCHWc_43                                                                          292.24     0.548    (1, 6, 8, 8, 32)      2       1        
fused_nn_contrib_conv2d_NCHWc_26                                                                           fused_nn_contrib_conv2d_NCHWc_26                                                                          291.928    0.548    (1, 16, 17, 17, 8)    2       1        
......                                                                                                                                                     
  1. The best configuration was obtained by auto-tuning first, and then the graph optimization is performed.However, this best configuration is not necessarily suitable for after graph optimization.Shouldn’t that be a problem?

  2. I want to optimize an op individually and then modify the best configuration for auto-tuning.Are there any tutorials?

@comaniac @haichen Can you give me a suggestion?Thanks in advance!

Could you share more information about your model and the shapes of these conv2d ops? You can try to build a simple program that has only conv2d_NCHWc_9, for example, to see if it still as slow as in the entire model.

good idea! I try to build a model that has only conv2d_NCHWc_9 and use auto-tuning optimization.

my model from inceptionv3:

@haichen I don’t know how to get the pre-fusion op , Is there any way to know which op to fuse into fused_nn_contrib_conv2d_NCHWc_9?

I print graph after build_module.build,olny show nodes:

{
  "nodes": [
    {
      "op": "null", 
      "name": "input", 
      "inputs": []
    }, 
    {
      "op": "tvm_op", 
      "name": "fused_transpose_layout_transform", 
      "attrs": {
        "func_name": "fused_transpose_layout_transform", 
        "flatten_data": "0", 
        "num_inputs": "1", 
        "num_outputs": "1"
      }, 
.......
{
      "op": "tvm_op", 
      "name": "fused_nn_contrib_conv2d_NCHWc_9", 
      "attrs": {
        "func_name": "fused_nn_contrib_conv2d_NCHWc_9", 
        "flatten_data": "0", 
        "num_inputs": "2", 
        "num_outputs": "1"
      }, 
      "inputs": [
        [
          23, 
          0, 
          0
        ], 
        [
          24, 
          0, 
          0
        ]
      ]
    }, 
    {
      "op": "null", 
      "name": "p13", 
      "inputs": []
    }, 
    {
      "op": "null", 
      "name": "p14", 
      "inputs": []
    }, 
......

I print mod[‘main’], only show nn.conv2d and don’t know which conv2d is conv2d_NCHWc_9,the ninth conv2d? It looks like %49 = nn. Conv2d…Because fused_nn_contrib_conv2d_NCHWc_9 prints shape of (1, 6, 71, 71, 32).

....
%2 = nn.conv2d(%0, %1, strides=[2, 2], padding=[0, 0, 0, 0], channels=32, kernel_size=[3, 3]) /* ty=Tensor[(1, 32, 149, 149), float32] */;
%13 = nn.conv2d(%11, %12, padding=[0, 0, 0, 0], channels=32, kernel_size=[3, 3]) /* ty=Tensor[(1, 32, 147, 147), float32] */;
%24 = nn.conv2d(%22, %23, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 147, 147), float32] */;
%38 = nn.conv2d(%36, %37, padding=[0, 0, 0, 0], channels=80, kernel_size=[1, 1]) /* ty=Tensor[(1, 80, 73, 73), float32] */;
%49 = nn.conv2d(%47, %48, padding=[0, 0, 0, 0], channels=192, kernel_size=[3, 3]) /* ty=Tensor[(1, 192, 71, 71), float32] */;
%63 = nn.conv2d(%61, %62, padding=[0, 0, 0, 0], channels=64, kernel_size=[1, 1]) /* ty=Tensor[(1, 64, 35, 35), float32] */;
%74 = nn.conv2d(%72, %73, padding=[0, 0, 0, 0], channels=48, kernel_size=[1, 1]) /* ty=Tensor[(1, 48, 35, 35), float32] */;
%85 = nn.conv2d(%83, %84, padding=[2, 2, 2, 2], channels=64, kernel_size=[5, 5]) /* ty=Tensor[(1, 64, 35, 35), float32] */;
%96 = nn.conv2d(%94, %95, padding=[0, 0, 0, 0], channels=64, kernel_size=[1, 1]) /* ty=Tensor[(1, 64, 35, 35), float32] */;
%107 = nn.conv2d(%105, %106, padding=[1, 1, 1, 1], channels=96, kernel_size=[3, 3]) /* ty=Tensor[(1, 96, 35, 35), float32] */;
%118 = nn.conv2d(%116, %117, padding=[1, 1, 1, 1], channels=96, kernel_size=[3, 3]) /* ty=Tensor[(1, 96, 35, 35), float32] */;
......

Do you use graph, lib, params = relay.build_module.build(mod["main"], target=target, params=params) instead of graph, lib, params = relay.build_module.build(mod, target=target, params=params) when apply graph best or history log? My experience is that the later one will failed to apply the history log.

@boood15 I use mod[‘main’] or mod, the result is the same and the time to run the model is the same.when apply graph best log, run the model faster,it works.

My problem is that some op optimizations are not good enough,when use auto-tuning.I want to optimize an op individually and then modify the best log from auto-tuning.Do you have any Suggestions?

When you say optimize an op individually do you mean you want to optimize how it is calculated? You can compare the GFLOPS number when you tuning this op to the theoretical GFLOPS of you device. Sometimes the op’s schedule in the TVM repo may not work so well and you can implement your own schedule to exploit more performance.

I have two ideas:

  1. I want to research params for an op and instead of the one in the best log.Because sometimes auto-tuning search is not comprehensive. I set a large number(n_trial,..,etc),It takes a lot of time.so I want to optimize an op(individually) using auto-tuning, Do you have any relevant experience or advice?

  2. I also want to implement your own schedule to exploit more performance.But I’m a novice at this,It’s too difficult for me to write a good schedule now.Do you know any tutorials or books?

I have another question to ask:

As mentioned above,“fused_nn_contrib_conv2d_NCHWc_9” op take 12.7% of the total time.It’s a fused op, how to get the pre-fusion op? Is there any way to know which op to fuse into fused_nn_contrib_conv2d_NCHWc_9?

Thanks in advance!

  1. the function autotvm.task.extract_from_program extracts tasks from your network and return a list of tasks, you can remove some of the tasks. Which tuner did you use? Some of the TVM tutorials use the RandomTuner which may yield bad performance. I use the XGBTuner and it is better than the RandomTuner. You can also try other tuners, such as the GATuner.

  2. The TVM doc provides some examples and you can take a look at that.

  3. In my limited experience, the fused op’s name may look like this, fused_reshape_add_multiply_divide_erf_add_multiply_reshape, and although your op’s name contains ‘fused’, it may not be actually fused with other ops. Also you can change the opt_level in the relay.build_config(opt_level=3) to disable the op fusion functionality and then run a benchmark to see how much time this op actually costs.