Difference in profiler outputs

Hi @tkonolige Sorry for the delay in this response. I modified the target to “llvm -mcpu=cascadelake” according to the target and re-did the tuning. Now I get a much better inference time of < 100ms on benchmark and VirtualMachineProfiler, but a 4x discrepancy still remains between the output of the two profilers. The outputs are attached below ([1]). I tried ResNet-18 as well, but I am observing the same discrepancy there as well.

On running without graph tuning, I am observing almost no discrepancy. Interestingly the debug_executor’s total inference time worsens when I enable graph tuning, while that of the other two improves. The outputs are attached below ([2]). I haven’t yet been able to get hold of another system to install and run these experiments on, I will update this thread as soon as that happens.


Outputs: [1] With Graph Tuning (a) profiler_vm

Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Name                                                    Duration (us)  Percent  Count  out_layout  Device  data_layout  kernel_layout              Hash   layout                                                                                                                                                       Argument Shapes  dst_layout  weight_layout  src_layout  
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8                 15,909.52    16.00      6     NCHW16c    cpu0      NCHW16c     OIHW16i16o  efb9044cdd43e0b8                                                                float32[1, 16, 14, 14, 16], float32[16, 16, 3, 3, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5                 10,522.82    10.58      4     NCHW32c    cpu0       NCHW8c      OIHW8i32o  0d551fd3800939e1                                                                     float32[1, 16, 28, 28, 8], float32[4, 16, 3, 3, 8, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11                 9,095.54     9.15      3     NCHW16c    cpu0      NCHW16c     OIHW16i16o  68695c5cd347ce57                                                                    float32[1, 32, 7, 7, 16], float32[32, 32, 3, 3, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2                  8,034.25     8.08      3     NCHW32c    cpu0      NCHW64c     OIHW64i32o  83e0f5d1673ff2ae                                                                     float32[1, 1, 56, 56, 64], float32[2, 1, 3, 3, 64, 32], float32[1, 2, 1, 1, 32], float32[1, 2, 56, 56, 32]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9                  6,451.60     6.49      5     NCHW16c    cpu0      NCHW16c     OIHW16i16o  c8d2fb74508242fa                                                                float32[1, 64, 14, 14, 16], float32[16, 64, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_2                          6,219.45     6.25      5     NCHW16c    cpu0       NCHW4c      OIHW4i16o  991e77362efe315d                                                                float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_1                          4,069.38     4.09      3     NCHW64c    cpu0      NCHW32c     OIHW32i64o  b8f45dade76ef8ee                                                                   float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 28, 28, 64]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6                  3,627.03     3.65      3     NCHW32c    cpu0      NCHW64c     OIHW64i32o  435cfe42fcb8d0b0                                                                     float32[1, 8, 28, 28, 64], float32[4, 8, 1, 1, 64, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32]                                         
fused_nn_contrib_conv2d_NCHWc_add                            3,069.46     3.09      2     NCHW16c    cpu0      NCHW32c     OIHW32i16o  6fb734c77ed64bde                                                                float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 56, 56, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu                    2,898.89     2.92      1     NCHW16c    cpu0       NCHW3c      OIHW3i16o  10a40e9231ff15a6                                                                   float32[1, 1, 224, 224, 3], float32[4, 1, 7, 7, 3, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 112, 112, 16]                                         
fused_nn_contrib_conv2d_NCHWc_3                              2,659.84     2.67      1     NCHW16c    cpu0      NCHW16c     OIHW16i16o  9c3ea371f8ec4054                                                                                          float32[1, 64, 14, 14, 16], float32[128, 64, 1, 1, 16, 16], float32[1, 128, 7, 7, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12                 2,592.41     2.61      2     NCHW16c    cpu0      NCHW16c     OIHW16i16o  1cc8a4dccc794a64                                                                  float32[1, 128, 7, 7, 16], float32[32, 128, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_3                          2,587.97     2.60      2     NCHW16c    cpu0      NCHW32c     OIHW32i16o  528b9cb523882d7e                                                                 float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 7, 7, 16]                                         
fused_nn_contrib_conv2d_NCHWc_1                              2,568.47     2.58      1     NCHW64c    cpu0      NCHW16c     OIHW16i64o  9b9c1d5fc56b0353                                                                                            float32[1, 16, 56, 56, 16], float32[8, 16, 1, 1, 16, 64], float32[1, 8, 28, 28, 64]                                         
fused_nn_contrib_conv2d_NCHWc_2                              2,560.30     2.57      1     NCHW16c    cpu0      NCHW64c     OIHW64i16o  371a9e61ecaeecce                                                                                            float32[1, 8, 28, 28, 64], float32[64, 8, 1, 1, 64, 16], float32[1, 64, 14, 14, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3                  2,393.13     2.41      2     NCHW64c    cpu0      NCHW16c     OIHW16i64o  850ecaa157c95aac                                                                   float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64]                                         
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu                1,519.12     1.53      1     NCHW16c    cpu0      NCHW32c     OIHW32i16o  abe40a1f08b34bad                                      float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16]                                         
fused_nn_contrib_conv2d_NCHWc                                1,382.10     1.39      1     NCHW16c    cpu0      NCHW16c     OIHW16i16o  7661eb48c0b8a7e6                                                                                            float32[1, 4, 56, 56, 16], float32[16, 4, 1, 1, 16, 16], float32[1, 16, 56, 56, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1              1,319.25     1.33      1     NCHW64c    cpu0      NCHW32c     OIHW32i64o  88bbb32f8f542f98                                          float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64]                                         
fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2              1,299.49     1.31      1     NCHW16c    cpu0       NCHW4c      OIHW4i16o  c7b912640028a9e2                                      float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu       1,252.95     1.26      1     NCHW16c    cpu0      NCHW32c     OIHW32i16o  21cb6d538731ba92           float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16]                                         
fused_add_nn_relu                                              823.04     0.83      2                cpu0                              e907ce81104cda7a                                                                                               float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10                   759.67     0.76      1     NCHW16c    cpu0      NCHW16c     OIHW16i16o  8d07031ff51d0737                                                                  float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16]                                         
fused_nn_contrib_dense_pack_add                                710.21     0.71      1                cpu0                              7641a0cce9852143                                                                                                    float32[1, 2048], float32[40, 2048, 25], float32[1, 1000], float32[1, 1000]                      NC25n              
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7                    693.32     0.70      1     NCHW16c    cpu0      NCHW64c     OIHW64i16o  dc31662fedbb8185                                                                  float32[1, 8, 28, 28, 64], float32[16, 8, 1, 1, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4                    656.53     0.66      1    NCHW128c    cpu0      NCHW16c    OIHW16i128o  9b01f6479b89fd68                                                                float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 128], float32[1, 1, 1, 1, 128], float32[1, 1, 28, 28, 128]                                         
fused_add_nn_relu_1                                            631.05     0.63      3                cpu0                              0e82013d73aa68c1                                                                                                  float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64]                                         
fused_add_nn_relu_2                                            542.71     0.55      5                cpu0                              f12067172f61c850                                                                                               float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16]                                         
fused_nn_max_pool2d_add_nn_relu                                364.59     0.37      1                cpu0                              6f701a4fa071030f  NCHW16c                                                                                       float32[1, 4, 112, 112, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16]                                         
fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1                    330.20     0.33      1     NCHW64c    cpu0      NCHW16c     OIHW16i64o  0f7bbb0e363c360c                                                                     float32[1, 4, 56, 56, 16], float32[1, 4, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64]                                         
fused_layout_transform_1                                       188.19     0.19      3                cpu0                              b8cbb72b4035894d                                                                                                                           float32[1, 4, 28, 28, 32], float32[1, 16, 28, 28, 8]      NCHW8c                    NCHW32c  
fused_layout_transform_2                                       172.13     0.17      6                cpu0                              f5e631fb93d23d4d                                                                                                                          float32[1, 16, 14, 14, 16], float32[1, 64, 14, 14, 4]      NCHW4c                    NCHW16c  
fused_add_nn_relu_3                                            106.41     0.11      2                cpu0                              5d16c15878cc73d4                                                                                                float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16]                                         
fused_add_layout_transform                                      96.21     0.10      1                cpu0                              69355d3cc810f874                                                                                                          float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224, 224, 3]      NCHW3c                       NCHW  
fused_nn_global_avg_pool2d                                      56.33     0.06      1                cpu0                              f18307e2786f4cb3  NCHW16c                                                                                                                  float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16]                                         
fused_layout_transform                                          52.16     0.05      1                cpu0                              2c5d64d5f9faa001                                                                                                                          float32[1, 1, 28, 28, 128], float32[1, 16, 28, 28, 8]      NCHW8c                   NCHW128c  
fused_layout_transform_3                                        48.26     0.05      3                cpu0                              add43c0d2d8a8a3c                                                                                                                             float32[1, 32, 7, 7, 16], float32[1, 16, 7, 7, 32]     NCHW32c                    NCHW16c  
fused_nn_softmax                                                 9.76     0.01      1                cpu0                              ca61e79ea24e53f0                                                                                                                                             float32[1, 1000], float32[1, 1000]                                         
fused_layout_transform_nn_batch_flatten                          1.73     0.00      1                cpu0                              2db99463d18696a4                                                                                                                                    float32[1, 128, 1, 1, 16], float32[1, 2048]        NCHW                    NCHW16c  
----------                                                                                                                                                                                                                                                                                                                                                                     
Sum                                                         98,275.48    98.83     84                                                                                                                                                                                                                                                                                          
Total                                                       99,441.43               1                cpu0                                                                                                          

(b) debug_executor

Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Name                                                                   Duration (us)  Percent  Count  out_layout  Device  data_layout  kernel_layout              Hash   layout                                                                                                                                                       Argument Shapes  dst_layout  weight_layout  src_layout  
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_3                       1,39,559.24    36.43      2     NCHW16c    cpu0      NCHW32c     OIHW32i16o  6fb734c77ed64bde                                                                float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 56, 56, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_12              1,18,024.98    30.81      1     NCHW16c    cpu0       NCHW3c      OIHW3i16o  10a40e9231ff15a6                                                                   float32[1, 1, 224, 224, 3], float32[4, 1, 7, 7, 3, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 112, 112, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc                               23,051.66     6.02      1     NCHW16c    cpu0      NCHW16c     OIHW16i16o  7661eb48c0b8a7e6                                                                                            float32[1, 4, 56, 56, 16], float32[16, 4, 1, 1, 16, 16], float32[1, 16, 56, 56, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3                 15,185.61     3.96      6     NCHW16c    cpu0      NCHW16c     OIHW16i16o  efb9044cdd43e0b8                                                                float32[1, 16, 14, 14, 16], float32[16, 16, 3, 3, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_9                 13,328.36     3.48      3     NCHW32c    cpu0      NCHW64c     OIHW64i32o  83e0f5d1673ff2ae                                                                     float32[1, 1, 56, 56, 64], float32[2, 1, 3, 3, 64, 32], float32[1, 2, 1, 1, 32], float32[1, 2, 56, 56, 32]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_11                13,159.49     3.44      1     NCHW64c    cpu0      NCHW16c     OIHW16i64o  0f7bbb0e363c360c                                                                     float32[1, 4, 56, 56, 16], float32[1, 4, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_6                 10,205.32     2.66      4     NCHW32c    cpu0       NCHW8c      OIHW8i32o  0d551fd3800939e1                                                                     float32[1, 16, 28, 28, 8], float32[4, 16, 3, 3, 8, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu                    7,727.92     2.02      3     NCHW16c    cpu0      NCHW16c     OIHW16i16o  68695c5cd347ce57                                                                    float32[1, 32, 7, 7, 16], float32[32, 32, 3, 3, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1                          5,840.79     1.52      5     NCHW16c    cpu0       NCHW4c      OIHW4i16o  991e77362efe315d                                                                float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 14, 14, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_4                  5,746.35     1.50      5     NCHW16c    cpu0      NCHW16c     OIHW16i16o  c8d2fb74508242fa                                                                float32[1, 64, 14, 14, 16], float32[16, 64, 1, 1, 16, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2                          3,745.35     0.98      3     NCHW64c    cpu0      NCHW32c     OIHW32i64o  b8f45dade76ef8ee                                                                   float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 28, 28, 64]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_7                  3,425.00     0.89      3     NCHW32c    cpu0      NCHW64c     OIHW64i32o  435cfe42fcb8d0b0                                                                     float32[1, 8, 28, 28, 64], float32[4, 8, 1, 1, 64, 32], float32[1, 4, 1, 1, 32], float32[1, 4, 28, 28, 32]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_2                              2,508.48     0.65      1     NCHW16c    cpu0      NCHW64c     OIHW64i16o  371a9e61ecaeecce                                                                                            float32[1, 8, 28, 28, 64], float32[64, 8, 1, 1, 64, 16], float32[1, 64, 14, 14, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_10                 2,400.83     0.63      2     NCHW64c    cpu0      NCHW16c     OIHW16i64o  850ecaa157c95aac                                                                   float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 64], float32[1, 1, 1, 1, 64], float32[1, 1, 56, 56, 64]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_1                              2,396.47     0.63      1     NCHW64c    cpu0      NCHW16c     OIHW16i64o  9b9c1d5fc56b0353                                                                                            float32[1, 16, 56, 56, 16], float32[8, 16, 1, 1, 16, 64], float32[1, 8, 28, 28, 64]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1                  2,271.00     0.59      2     NCHW16c    cpu0      NCHW16c     OIHW16i16o  1cc8a4dccc794a64                                                                  float32[1, 128, 7, 7, 16], float32[32, 128, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add                            2,260.06     0.59      2     NCHW16c    cpu0      NCHW32c     OIHW32i16o  528b9cb523882d7e                                                                 float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 7, 7, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_3                              2,240.88     0.59      1     NCHW16c    cpu0      NCHW16c     OIHW16i16o  9c3ea371f8ec4054                                                                                          float32[1, 64, 14, 14, 16], float32[128, 64, 1, 1, 16, 16], float32[1, 128, 7, 7, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_2              1,401.25     0.37      1     NCHW16c    cpu0      NCHW32c     OIHW32i16o  abe40a1f08b34bad                                      float32[1, 2, 56, 56, 32], float32[16, 2, 1, 1, 32, 16], float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu_1              1,249.88     0.33      1     NCHW64c    cpu0      NCHW32c     OIHW32i64o  88bbb32f8f542f98                                          float32[1, 4, 28, 28, 32], float32[8, 4, 1, 1, 32, 64], float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_multiply_add_nn_relu       1,220.86     0.32      1     NCHW16c    cpu0      NCHW32c     OIHW32i16o  21cb6d538731ba92           float32[1, 16, 7, 7, 32], float32[128, 16, 1, 1, 32, 16], float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_add_nn_relu                1,160.16     0.30      1     NCHW16c    cpu0       NCHW4c      OIHW4i16o  c7b912640028a9e2                                      float32[1, 64, 14, 14, 4], float32[64, 64, 1, 1, 4, 16], float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_8                    599.86     0.16      1    NCHW128c    cpu0      NCHW16c    OIHW16i128o  9b01f6479b89fd68                                                                float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16, 128], float32[1, 1, 1, 1, 128], float32[1, 1, 28, 28, 128]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_5                    579.10     0.15      1     NCHW16c    cpu0      NCHW64c     OIHW64i16o  dc31662fedbb8185                                                                  float32[1, 8, 28, 28, 64], float32[16, 8, 1, 1, 64, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 14, 14, 16]                                         
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2                    571.03     0.15      1     NCHW16c    cpu0      NCHW16c     OIHW16i16o  8d07031ff51d0737                                                                  float32[1, 64, 14, 14, 16], float32[32, 64, 1, 1, 16, 16], float32[1, 32, 1, 1, 16], float32[1, 32, 7, 7, 16]                                         
tvmgen_default_fused_add_nn_relu_3                                            519.35     0.14      2                cpu0                              e907ce81104cda7a                                                                                               float32[1, 16, 56, 56, 16], float32[1, 16, 1, 1, 16], float32[1, 16, 56, 56, 16]                                         
tvmgen_default_fused_nn_contrib_dense_pack_add                                488.16     0.13      1                cpu0                              7641a0cce9852143                                                                                                    float32[1, 2048], float32[40, 2048, 25], float32[1, 1000], float32[1, 1000]                      NC25n              
tvmgen_default_fused_add_nn_relu_2                                            360.30     0.09      3                cpu0                              0e82013d73aa68c1                                                                                                  float32[1, 8, 28, 28, 64], float32[1, 8, 1, 1, 64], float32[1, 8, 28, 28, 64]                                         
tvmgen_default_fused_nn_max_pool2d_add_nn_relu                                342.65     0.09      1                cpu0                              6f701a4fa071030f  NCHW16c                                                                                       float32[1, 4, 112, 112, 16], float32[1, 4, 1, 1, 16], float32[1, 4, 56, 56, 16]                                         
tvmgen_default_fused_add_nn_relu_1                                            291.22     0.08      5                cpu0                              f12067172f61c850                                                                                               float32[1, 64, 14, 14, 16], float32[1, 64, 1, 1, 16], float32[1, 64, 14, 14, 16]                                         
tvmgen_default_fused_layout_transform_2                                       106.25     0.03      3                cpu0                              b8cbb72b4035894d                                                                                                                           float32[1, 4, 28, 28, 32], float32[1, 16, 28, 28, 8]      NCHW8c                    NCHW32c  
tvmgen_default_fused_add_layout_transform                                      76.45     0.02      1                cpu0                              69355d3cc810f874                                                                                                          float32[1, 3, 224, 224], float32[3, 1, 1], float32[1, 1, 224, 224, 3]      NCHW3c                       NCHW  
tvmgen_default_fused_layout_transform_1                                        68.45     0.02      6                cpu0                              f5e631fb93d23d4d                                                                                                                          float32[1, 16, 14, 14, 16], float32[1, 64, 14, 14, 4]      NCHW4c                    NCHW16c  
tvmgen_default_fused_add_nn_relu                                               51.61     0.01      2                cpu0                              5d16c15878cc73d4                                                                                                float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16], float32[1, 128, 7, 7, 16]                                         
tvmgen_default_fused_nn_global_avg_pool2d                                      46.06     0.01      1                cpu0                              f18307e2786f4cb3  NCHW16c                                                                                                                  float32[1, 128, 7, 7, 16], float32[1, 128, 1, 1, 16]                                         
tvmgen_default_fused_layout_transform_3                                        36.66     0.01      1                cpu0                              2c5d64d5f9faa001                                                                                                                          float32[1, 1, 28, 28, 128], float32[1, 16, 28, 28, 8]      NCHW8c                   NCHW128c  
tvmgen_default_fused_layout_transform                                          11.41     0.00      3                cpu0                              add43c0d2d8a8a3c                                                                                                                             float32[1, 32, 7, 7, 16], float32[1, 16, 7, 7, 32]     NCHW32c                    NCHW16c  
tvmgen_default_fused_nn_softmax                                                 9.50     0.00      1                cpu0                              ca61e79ea24e53f0                                                                                                                                             float32[1, 1000], float32[1, 1000]                                         
tvmgen_default_fused_layout_transform_nn_batch_flatten                          1.05     0.00      1                cpu0                              2db99463d18696a4                                                                                                                                    float32[1, 128, 1, 1, 16], float32[1, 2048]        NCHW                    NCHW16c  
----------                                                                                                                                                                                                                                                                                                                                                                                    
Sum                                                                      3,82,269.07    99.80     84                                                                                                                                                                                                                                                                                          
Total                                                                    3,83,036.30               1                cpu0                                                                                           

(c) benchmark

Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_nopack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
Config for target=llvm -keys=cpu -link-params=0 -mcpu=cascadelake, workload=('dense_pack.x86', ('TENSOR', (1, 2048), 'float32'), ('TENSOR', (1000, 2048), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
Evaluate inference time cost...
Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
  95.1157      95.0706      95.2259      95.0505       0.0784