Sparse_conv2d runs slower than normal conv2d

In Sparse Conv2d Implementation for 3x3 kernels, the conclusion has it that we can get speedup in all sparse rate, shown in pic below:

But I try it on my own simple model like this:

class LeNet(nn.Module):
    
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 1, 3, padding=1, bias=False)

    def forward(self, x):
        x = self.conv1(x)
        
        return x

my CPU info: Intel® Core™ i5-2500 CPU @ 3.30GHz

I do the transform below to turns it into nn.sparse_conv2d for 3*3 kernel:

newfunc = relay.data_dep_optimization.utils._run_opt_pass(
   mod,
   # layout, kernel_size, bsr_height, bsr_width, sparsity_threshold
   relay.transform._ffi_api.Conv2dToSparse2("NCHW", 3, 1, 1, 0.4)
)
sp_mod = tvm.ir.IRModule.from_expr(newfunc)

and I found that if I set input_size larger than (1,1,28,28), no matter how sparse I (randomly)prune the weight, the runtime of sparse_conv2d always larger than normal conv2d.

I trace the code to the part that do the sparse_conv2d for 3*3 kernel. It use the img2col method that create a big tensor to store weights of every step of convolution into it column by column, then do the matrix multiplaction with the kernel.

What makes me confuse is that if input_size get larger, the runtime spend on load and store the big tensor may also get larger, and will not be faster than original conv2d.

How does the result in Sparse Conv2d Implementation for 3x3 kernels still get speedup?

@comaniac I just finished listening to your speech in our seminar in NTHU, hopes that u can give me some advise or correction if i have any misunderstanding, thanks!

Thanks for attending but unfortunately I’m not familiar with this implementation. I believe @jcf94 know the details as he is one of the main contributors of the sparse support. @jcf94 could you help clarify?

Thanks! @comaniac @karta1485825

I’m sorry for I have not tested this implementation either. Maybe we can call the author of that pr for help. @Tantalus13A98B5F

Hi @karta1485825 , thank you for trying out this feature! I can see a few possible reasons for your unexpected performance result,

  1. Our test basically used a larger data volume for testing, from 10, 64, 56, 56 to 10, 512, 7, 7. Although small data size leads to faster execution, it is not the case with acceleration, since many optimizations work by making things complex first. I’d encourage you to try a larger data size.

  2. For the sparsity pattern, we recommend 1xN BSR, where N is a number picked to fill up the vectorization unit you are using. IMO acceleration on 1x1 BSR or CSR is hard if not impossible.

Also, I’d like to do some explanation on im2col. Although it seems to be a big tensor, in TVM we have schedules, and we can do multi-level tiling to have it fit into the cache. In the case of convolution, im2col is a common trick to improve data continuity, facilitating further optimizations like vectorization, Winograd, and here sparse computation.

1 Like

@comaniac @jcf94 Thanks!

And @Tantalus13A98B5F thanks for ur advise. That really helps a lots to me.

I’ll take some tries according to these advises, thanks!

@Tantalus13A98B5F Sorry for bothering. I’ve taken ur advises to try some larger data size, and also try the 1xN BSR, but the result still shows that sparse_conv2d makes it slower.

I first prune the weight randomly, then do the transformation below:

newfunc = relay.data_dep_optimization.utils._run_opt_pass(
   mod,
   # layout, kernel_size, bsr_height, bsr_width, sparsity_threshold
   relay.transform._ffi_api.Conv2dToSparse2("NCHW", 3, 16, 1, 0)
)
sp_mod = tvm.ir.IRModule.from_expr(newfunc)

I use resnet18 as my model and (10,3,224,224) as my input size. Almost the same setup as in the pr.

And I try some different N of the 1xN BSR, the result shows that it got about 10 or more times slower than the original model without transform to sparse_conv2d.

So I still confusing about how to gain the speedup. Do I need to do the scheduling or tiling stuff to gain speedup? or maybe I misunderstanding something?

Another thing that confused me is the first point in ur reply. In my opinion, big data size gain speedup based on small data size gain speedup, so I cannot understand why u encourage me to try big data size directly to determine whether it has acceleration or not?

A few questions,

  1. What is the sparsity of your model? if <40% then it is expected to be slower than the dense version.
  2. Is your dense baseline multithreaded? The PR aims to improve the single-threaded performance.

So I still confusing about how to gain the speedup. Do I need to do the scheduling or tiling stuff to gain speedup? or maybe I misunderstanding something?

No. You must go through auto-tuning and the ordinary building process for TVM models, and nothing beyond these is required. Scheduling should be handled within the build pipeline. Maybe you will want to share some of your auto-tuning results.

big data size gain speedup based on small data size gain speedup

not necessarily. e.g. vectorization on x86 chips come with performance penalties. if data volume is not large enough, acceleration will then be diminished by the penalties.

@Tantalus13A98B5F Thanks for reply.

  1. Yes, I prune it to 80% sparsity.
  2. Yes, I have tried it but still got about 3~5 times slower then dense version.

Sorry for not so clear on the concept of auto-tuning, I use the same tuning method in the Sample usages, and I wonder that is the cell shows below been skipped ? It seems that tuning part is here but to be skipped in the sample usage.

Sorry to the confusion. I ran this cell for a first time, got the test_sparse.best.log file, and interrupted this for all later runs. Have you got that file with meaningful content? For the same op, the same shape and the same target, you need to do the auto tuning just ones.

For performance diagnostics, you may want to share your tuning results and the breakdown timing given by debug_executors.

@Tantalus13A98B5F My computer dies when the first time I run the auto tuning cell and didn’t get the log file. So I adjust the nsamples and the LocalRunner’s timeout to half of it in the sample usage like this:

opts = autotvm.measure_option(
    builder='local',
    runner=autotvm.LocalRunner(timeout=10, min_repeat_ms=100),
)

nsamples = min(500, len(tsk.config_space))

Then I runs it again, and check the result after a night. Though it still seems to make my computer dead, I get the test_sparse.best.log this time. Do I need to show the log file for diagnostics?

After tuning, I test the runtime and compare with the dense version. And this time I get a better result that just a little bit slower then the dense version. Does it means that I can get the speedup if I get larger nsamples for tuning?

And here is my tuning result:

[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:103: Iteration: 0
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #0 tvmgen_default_fused_layout_transform: 536.916 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #1 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu: 107332 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #2 tvmgen_default_fused_nn_max_pool2d: 2605.97 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #3 tvmgen_default_fused_layout_transform_1: 1879.01 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #4 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu: 109441 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #5 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu: 109260 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #6 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_1: 106321 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #7 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu1: 109353 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #8 tvmgen_default_fused_layout_transform_2: 1130 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #9 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1: 32443.6 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #10 tvmgen_default_fused_layout_transform_3: 829.159 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #11 tvmgen_default_fused_layout_transform_4: 781.847 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #12 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add: 3953.99 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #13 tvmgen_default_fused_layout_transform_5: 658.711 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #14 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_1: 104458 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #15 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_2: 105394 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #16 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_2: 100436 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #17 tvmgen_default_fused_layout_transform_6: 402.577 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #18 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2: 31527.2 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #19 tvmgen_default_fused_layout_transform_7: 184.404 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #20 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1: 3710.73 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #21 tvmgen_default_fused_layout_transform_71: 121.917 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #22 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_3: 103666 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #23 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_3: 103309 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #24 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_4: 103671 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #25 tvmgen_default_fused_layout_transform_8: 218.042 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #26 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3: 31511.9 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #27 tvmgen_default_fused_layout_transform_9: 67.794 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #28 tvmgen_default_fused_layout_transform_10: 232.413 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #29 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2: 3610.98 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #30 tvmgen_default_fused_layout_transform_91: 57.2552 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #31 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_5: 309960 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #32 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_4: 307926 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #33 tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_51: 298952 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #34 tvmgen_default_fused_nn_adaptive_avg_pool2d: 91.7728 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #35 reshape_nop: 0.2193 us/iter
[12:57:28] /home/wsko/tvm/src/runtime/graph_executor/debug/graph_executor_debug.cc:108: Op #36 tvmgen_default_fused_nn_dense_add: 276.49 us/iter

Node Name                                                          Ops                                                               Time(us)     Time(%)  Shape                  Inputs  Outputs  
---------                                                          ---                                                               --------     -------  -----                  ------  -------  
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_5   tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_5  309960.0     14.113   (10, 512, 7, 7)        7       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_4       tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_4      307926.0     14.02    (10, 512, 7, 7)        6       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_51  tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_5  298952.0     13.612   (10, 512, 7, 7)        7       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu         tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu        109441.0     4.983    (10, 64, 56, 56)       6       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu1    tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu    109353.0     4.979    (10, 64, 56, 56)       7       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu     tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu    109260.0     4.975    (10, 64, 56, 56)       7       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu           tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu          107332.0     4.887    (10, 2, 112, 112, 32)  3       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_1       tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_1      106321.0     4.841    (10, 64, 56, 56)       6       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_2       tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_2      105394.0     4.799    (10, 128, 28, 28)      6       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_1   tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_1  104458.0     4.756    (10, 128, 28, 28)      7       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_4   tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_4  103671.0     4.72     (10, 256, 14, 14)      7       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_3   tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_3  103666.0     4.72     (10, 256, 14, 14)      7       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_3       tvmgen_default_fused_nn_sparse_conv2d_multiply_add_nn_relu_3      103309.0     4.704    (10, 256, 14, 14)      6       1        
tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_2   tvmgen_default_fused_nn_sparse_conv2d_multiply_add_add_nn_relu_2  100436.0     4.573    (10, 128, 28, 28)      7       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1         tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_1        32443.6      1.477    (10, 2, 28, 28, 64)    3       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2         tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_2        31527.2      1.435    (10, 8, 14, 14, 32)    3       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3         tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_nn_relu_3        31511.9      1.435    (10, 16, 7, 7, 32)     3       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add                   tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add                  3953.99      0.18     (10, 4, 28, 28, 32)    3       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1                 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_1                3710.73      0.169    (10, 8, 14, 14, 32)    3       1        
tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2                 tvmgen_default_fused_nn_contrib_conv2d_NCHWc_add_2                3610.98      0.164    (10, 16, 7, 7, 32)     3       1        
tvmgen_default_fused_nn_max_pool2d                                 tvmgen_default_fused_nn_max_pool2d                                2605.97      0.119    (10, 2, 56, 56, 32)    1       1        
tvmgen_default_fused_layout_transform_1                            tvmgen_default_fused_layout_transform_1                           1879.01      0.086    (10, 64, 56, 56)       1       1        
tvmgen_default_fused_layout_transform_2                            tvmgen_default_fused_layout_transform_2                           1130.0       0.051    (10, 2, 56, 56, 32)    1       1        
tvmgen_default_fused_layout_transform_3                            tvmgen_default_fused_layout_transform_3                           829.159      0.038    (10, 128, 28, 28)      1       1        
tvmgen_default_fused_layout_transform_4                            tvmgen_default_fused_layout_transform_4                           781.847      0.036    (10, 8, 56, 56, 8)     1       1        
tvmgen_default_fused_layout_transform_5                            tvmgen_default_fused_layout_transform_5                           658.711      0.03     (10, 128, 28, 28)      1       1        
tvmgen_default_fused_layout_transform                              tvmgen_default_fused_layout_transform                             536.916      0.024    (10, 1, 224, 224, 3)   1       1        
tvmgen_default_fused_layout_transform_6                            tvmgen_default_fused_layout_transform_6                           402.577      0.018    (10, 4, 28, 28, 32)    1       1        
tvmgen_default_fused_nn_dense_add                                  tvmgen_default_fused_nn_dense_add                                 276.49       0.013    (10, 1000)             3       1        
tvmgen_default_fused_layout_transform_10                           tvmgen_default_fused_layout_transform_10                          232.413      0.011    (10, 8, 14, 14, 32)    1       1        
tvmgen_default_fused_layout_transform_8                            tvmgen_default_fused_layout_transform_8                           218.042      0.01     (10, 1, 14, 14, 256)   1       1        
tvmgen_default_fused_layout_transform_7                            tvmgen_default_fused_layout_transform_7                           184.404      0.008    (10, 256, 14, 14)      1       1        
tvmgen_default_fused_layout_transform_71                           tvmgen_default_fused_layout_transform_7                           121.917      0.006    (10, 256, 14, 14)      1       1        
tvmgen_default_fused_nn_adaptive_avg_pool2d                        tvmgen_default_fused_nn_adaptive_avg_pool2d                       91.773       0.004    (10, 512, 1, 1)        1       1        
tvmgen_default_fused_layout_transform_9                            tvmgen_default_fused_layout_transform_9                           67.794       0.003    (10, 512, 7, 7)        1       1        
tvmgen_default_fused_layout_transform_91                           tvmgen_default_fused_layout_transform_9                           57.255       0.003    (10, 512, 7, 7)        1       1        
reshape_nop                                                        __nop                                                             0.219        0.0      (10, 512)              1       1        
Total_time                                                         -                                                                 2196311.897  -        -                      -       -

That is the common case, and you can try it out, given that tuning a sparse operator is somewhat more time-consuming than a dense one. But looking at the timing,

It is a known drawback that NCHW does not performance well on shapes like these. For best performance I recommend running NHWC sparse operators and NCHWc dense operators in a mixed way, which is also the case in the sample usage (see different sparsity section).

@Tantalus13A98B5F I follow ur recommendation and finally got speedup. Thanks a lots!

But there are still something I want to ask. I wonder that why u focus on single-threaded performance? And this method has currently no speedup in multi-threaded case, are there some concept we can based on in this improvement to improve multi-threaded case?

it is simply not enabled. you can try it out yourself by playing with the scheduling code :slight_smile:

to start with, we assume that sparsity plays in cases where computing resources are limited, i.e. threading may not be feasible. also, given the reduced amount of arithmetic, it is hard to say whether and to what extent will threading still be beneficial. but anyway, I think it is worth trying.