Auto-tuning the vgg-16 network on V100, and the performance is not comparable with cudnn

xiaocenxiaocen · March 16, 2020, 2:54pm

I use the script to tune the performance of vgg-16 network on V100.
After the tuning process has finished, I evaluated the average inference time. When batch size is 128, the inference time is 123ms. But if I use cudnn as the inference backend, the inference time is 88ms. Is this reasonable?
and I profiled the network layer by layer, some layer cost 6ms-24ms. and the costed layer uses winograd algorithm. so tvm has implement winograd kernels for int8 inference?
the following is my tuning result

{"input": ["cuda -model=unknown", "dense_large_batch.cuda", [["TENSOR", [128, 4096], "float32"], ["TENSOR", [1000, 4096], "float32"], null, "float32"], {}], "config": {"index": 771496, "cod
e_hash": null, "entity": [["tile_x", "sp", [-1, 2, 4, 1]], ["tile_y", "sp", [-1, 2, 4, 1]], ["tile_k", "sp", [-1, 8, 2]]]}, "result": [[0.000854268221590909], 0, 1.6328115463256836, 1584199
993.2121232], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "dense_small_batch.cuda", [["TENSOR", [128, 4096], "float32"], ["TENSOR", [1000, 4096], "float32"], null, "float32"], {}], "config": {"index": 5, "code_has
h": null, "entity": [["tile_k", "sp", [-1, 32]]]}, "result": [[0.0023226218153846153], 0, 1.3888931274414062, 1584201048.038974], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "dense_large_batch.cuda", [["TENSOR", [128, 4096], "float32"], ["TENSOR", [4096, 4096], "float32"], null, "float32"], {}], "config": {"index": 2023458, "co
de_hash": null, "entity": [["tile_x", "sp", [-1, 8, 4, 1]], ["tile_y", "sp", [-1, 4, 4, 1]], ["tile_k", "sp", [-1, 2, 8]]]}, "result": [[0.0010716840142857142], 0, 1.9444470405578613, 15842
02598.4111032], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "dense_small_batch.cuda", [["TENSOR", [128, 4096], "float32"], ["TENSOR", [4096, 4096], "float32"], null, "float32"], {}], "config": {"index": 9, "code_has
h": null, "entity": [["tile_k", "sp", [-1, 512]]]}, "result": [[0.00977129945], 0, 1.507237195968628, 1584203706.1801147], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "dense_large_batch.cuda", [["TENSOR", [128, 25088], "float32"], ["TENSOR", [4096, 25088], "float32"], null, "float32"], {}], "config": {"index": 7647258, "
code_hash": null, "entity": [["tile_x", "sp", [-1, 8, 4, 1]], ["tile_y", "sp", [-1, 4, 4, 1]], ["tile_k", "sp", [-1, 1, 16]]]}, "result": [[0.0065043505625], 0, 2.2584891319274902, 15842047
03.5166712], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "dense_small_batch.cuda", [["TENSOR", [128, 25088], "float32"], ["TENSOR", [4096, 25088], "float32"], null, "float32"], {}], "config": {"index": 22, "code_
hash": null, "entity": [["tile_k", "sp", [-1, 896]]]}, "result": [[0.06021745125], 0, 4.591594219207764, 1584206158.1411397], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 512, 14, 14], "int8"], ["TENSOR", [512, 512, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "c
onfig": {"index": 3764134, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 8, 8]], ["tile_x", "sp", [-1, 2, 16, 2]], ["tile_rc", "sp", [-1, 16]], ["a
uto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 0]]}, "result": [[0.00474776134375], 0, 1.705704927444458, 1584209204.3883464], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_NCHWc_int8.cuda", [["TENSOR", [128, 512, 14, 14], "int8"], ["TENSOR", [512, 512, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "int32"], {}
], "config": {"index": 13276589889, "code_hash": null, "entity": [["tile_n", "sp", [-1, 2, 2, 1]], ["tile_f", "sp", [-1, 1, 4, 2]], ["tile_y", "sp", [-1, 2, 1, 1]], ["tile_x", "sp", [-1, 1,
 14, 1]], ["fuse_yx", "ot", 1], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 1]], ["reorder_inner", "re", [1, 0, 2]], ["AA_double_buffer", "ot", 1], ["WW_d
ouble_buffer", "ot", 0], ["auto_unroll_max_step", "ot", 1500]]}, "result": [[0.002698205803571429], 0, 1.6985657215118408, 1584212343.3395205], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 512, 28, 28], "int8"], ["TENSOR", [512, 512, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "c
onfig": {"index": 11647419, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 8, 16]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 16]], [
"auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}, "result": [[0.0186427012], 0, 2.7832956314086914, 1584216768.1764288], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_NCHWc_int8.cuda", [["TENSOR", [128, 512, 28, 28], "int8"], ["TENSOR", [512, 512, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "int32"], {}], "config": {"index": 62328976330, "code_hash": null, "entity": [["tile_n", "sp", [-1, 4, 2, 1]], ["tile_f", "sp", [-1, 2, 4, 1]], ["tile_y", "sp", [-1, 2, 1, 1]], ["tile_x", "sp", [-1, 1, 4, 1]], ["fuse_yx", "ot", 1], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 1]], ["reorder_inner", "re", [0, 1, 2]], ["AA_double_buffer", "ot", 1], ["WW_double_buffer", "ot", 1], ["auto_unroll_max_step", "ot", 512]]}, "result": [[0.0094804296], 0, 2.0820913314819336, 1584220699.3217854], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 256, 28, 28], "int8"], ["TENSOR", [512, 256, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "config": {"index": 10679419, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 8, 16]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 16]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}, "result": [[0.0101890489], 0, 2.056152820587158, 1584223572.0578058], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_NCHWc_int8.cuda", [["TENSOR", [128, 256, 28, 28], "int8"], ["TENSOR", [512, 256, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "int32"], {}], "config": {"index": 87419540302, "code_hash": null, "entity": [["tile_n", "sp", [-1, 2, 8, 1]], ["tile_f", "sp", [-1, 1, 4, 2]], ["tile_y", "sp", [-1, 2, 1, 1]], ["tile_x", "sp", [-1, 1, 4, 1]], ["fuse_yx", "ot", 0], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["reorder_inner", "re", [0, 2, 1]], ["AA_double_buffer", "ot", 1], ["WW_double_buffer", "ot", 1], ["auto_unroll_max_step", "ot", 1500]]}, "result": [[0.00475415871875], 0, 2.022735834121704, 1584230640.6202025], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 256, 56, 56], "int8"], ["TENSOR", [256, 256, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "c
onfig": {"index": 7646572, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 8, 16]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 8]], ["a
uto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 0]]}, "result": [[0.0130612322], 0, 3.975372552871704, 1584235044.9110425], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_NCHWc_int8.cuda", [["TENSOR", [128, 256, 56, 56], "int8"], ["TENSOR", [256, 256, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "int32"], {}
], "config": {"index": 237551421360, "code_hash": null, "entity": [["tile_n", "sp", [-1, 1, 1, 1]], ["tile_f", "sp", [-1, 1, 16, 1]], ["tile_y", "sp", [-1, 2, 1, 4]], ["tile_x", "sp", [-1,
1, 8, 1]], ["fuse_yx", "ot", 0], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["reorder_inner", "re", [2, 1, 0]], ["AA_double_buffer", "ot", 0], ["WW_
double_buffer", "ot", 1], ["auto_unroll_max_step", "ot", 1500]]}, "result": [[0.00923989125], 0, 1.872157096862793, 1584241014.4781325], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 128, 56, 56], "int8"], ["TENSOR", [256, 128, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "c
onfig": {"index": 15632531, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 4, 8, 4]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 8]], ["a
uto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[0.00784787015], 0, 3.945775032043457, 1584243125.3562002], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 128, 112, 112], "int8"], ["TENSOR", [128, 128, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "config": {"index": 18803361, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 4, 8, 4]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 8]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[0.0166608162], 0, 4.897033214569092, 1584250572.05043], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_NCHWc_int8.cuda", [["TENSOR", [128, 128, 112, 112], "int8"], ["TENSOR", [128, 128, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "int32"], {}], "config": {"index": 390429495240, "code_hash": null, "entity": [["tile_n", "sp", [-1, 1, 1, 1]], ["tile_f", "sp", [-1, 1, 8, 1]], ["tile_y", "sp", [-1, 4, 1, 2]], ["tile_x", "sp", [-1, 1, 16, 1]], ["fuse_yx", "ot", 0], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["reorder_inner", "re", [0, 2, 1]], ["AA_double_buffer", "ot", 0], ["WW_double_buffer", "ot", 1], ["auto_unroll_max_step", "ot", 1500]]}, "result": [[0.009707265900000001], 0, 2.054020404815674, 1584254153.2729812], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 64, 112, 112], "int8"], ["TENSOR", [128, 64, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "config": {"index": 2641779, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 16, 8]], ["tile_x", "sp", [-1, 4, 16, 1]], ["tile_rc", "sp", [-1, 64]], ["auto_unroll_max_step", "ot", 0], ["unroll_explicit", "ot", 0]]}, "result": [[0.0108518186], 0, 3.3802666664123535, 1584259767.1725311], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_NCHWc_int8.cuda", [["TENSOR", [128, 64, 112, 112], "int8"], ["TENSOR", [128, 64, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "int32"], {}], "config": {"index": 227896873596, "code_hash": null, "entity": [["tile_n", "sp", [-1, 1, 1, 2]], ["tile_f", "sp", [-1, 4, 4, 1]], ["tile_y", "sp", [-1, 1, 1, 2]], ["tile_x", "sp", [-1, 1, 16, 1]], ["fuse_yx", "ot", 0], ["tile_rc", "sp", [-1, 1]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 1]], ["reorder_inner", "re", [0, 2, 1]], ["AA_double_buffer", "ot", 1], ["WW_double_buffer", "ot", 1], ["auto_unroll_max_step", "ot", 512]]}, "result": [[0.00540158721875], 0, 2.051076650619507, 1584263114.3485994], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 64, 224, 224], "int8"], ["TENSOR", [64, 64, 3, 3], "int8"], [1, 1], [1, 1, 1, 1], [1, 1], "int32"], {}], "config": {"index": 14594989, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 8, 8]], ["tile_x", "sp", [-1, 8, 8, 1]], ["tile_rc", "sp", [-1, 8]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 1]]}, "result": [[0.02666643035], 0, 4.514398574829102, 1584269255.3867397], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw_winograd.cuda", [["TENSOR", [128, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 3785449, "code_hash": null, "entity": [["tile_b", "sp", [-1, 1, 1, 1]], ["tile_y", "sp", [-1, 1, 8, 8]], ["tile_x", "sp", [-1, 16, 32, 1]], ["tile_rc", "sp", [-1, 1]], ["auto_unroll_max_step", "ot", 128], ["unroll_explicit", "ot", 1]]}, "result": [[0.01378827145], 0, 3.3212788105010986, 1584277466.8299236], "version": 0.2, "tvm_version": "0.7.dev1"}
{"input": ["cuda -model=unknown", "conv2d_nchw.cuda", [["TENSOR", [128, 3, 224, 224], "float32"], ["TENSOR", [64, 3, 3, 3], "float32"], [1, 1], [1, 1, 1, 1], [1, 1], "float32"], {}], "config": {"index": 199135959, "code_hash": null, "entity": [["tile_f", "sp", [-1, 4, 4, 1]], ["tile_y", "sp", [-1, 2, 2, 2]], ["tile_x", "sp", [-1, 1, 32, 1]], ["tile_rc", "sp", [-1, 3]], ["tile_ry", "sp", [-1, 3]], ["tile_rx", "sp", [-1, 3]], ["auto_unroll_max_step", "ot", 1500], ["unroll_explicit", "ot", 1]]}, "result": [[0.0024414313263157895], 0, 2.1771702766418457, 1584280846.783762], "version": 0.2, "tvm_version": "0.7.dev1"}

sorry, I have made a mistake. the performance of cudnn is not correct. Inference with cudnn cost 88ms on V100. tvm is a little slower than cudnn. but this is reasonable.

keai007 · May 24, 2020, 8:31am

Hi, xiaocenxiaocen:

I have tuned vgg16 on V100 with tensor-core supported by [RFC][Tensor Core] Optimization of CNNs on Tensor Core.

But the result shows that:

Network	TF	TF+XLA	TVM_TensorCore
VGG16	29.87ms	18.56ms	28.06ms
Resnet50	12ms	10.8ms	10.5ms

All networks are runing on [data_type:FP16, data_format: NHWC, batch_size:32] . The backend of TF and TF+XLA is cudnn.

The result shows that TVM_Tensorcore gets a quite bad performance comparing with TF+XLA on VGG16. Meanwhile, the performance of Resnet50 is comparable with TF+XLA+cudnn. It confused me a lot.

Do you have any idea on the bad performance of vgg16?

xiaocenxiaocen · May 24, 2020, 10:23am

Maybe, the convolution layers of TF+XLA+cudnn in VGG16 used winograd convolution, while the TVM tensor-core doesn’t implement winograd algorithm. You can use nvprof to check whether cudnn used the winograd impls. like

nvprof python3 network-run-script.py

keai007 · May 24, 2020, 11:35am

Thanks for your reply.

I use the varibale TF_ENABLE_WINOGRAD_NONFUSED=0 to disable the use of the non-fused Winograd convolution algorithm. And it makes no use on vgg16 performance.

Below is the nvprof log :

==10817== Profiling application: python tf_cnn_benchmarks.py --batch_size=32 --data_format=NHWC --device=gpu --forward_only=true --model=vgg16 --use_fp16=true --xla=true
==10817== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   32.91%  843.88ms       776  1.0875ms  557.98us  1.7684ms  volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc_tn_v1
                   11.96%  306.74ms       222  1.3817ms  1.3033ms  1.4598ms  Volta_hmma_implicit_gemm_fprop_fp32_nhwc_256x128x32x1_1x3x3x0x1
                   11.64%  298.48ms     87808  3.3990us  1.2800us  5.6000us  [CUDA memcpy HtoD]
                    8.45%  216.68ms       117  1.8520ms  1.1833ms  26.497ms  void cudnn::detail::convolve_common_engine_float_NHWC<__half, __half, int=1024, int=5, int=5, int=3, int=3, int=3, bool=1, bool=1, bool=0, bool=0, bool=0>(int, int, int, __half const *, __half const , int, __half*, conv_kernel_common_params, int, float, float, int, __half, __half const * const *)
                    7.51%  192.58ms       222  867.49us  377.02us  1.3728ms  Volta_hmma_implicit_gemm_fprop_fp32_nhwc_128x128x64x1_1x3x3x0x1
                    6.49%  166.39ms        24  6.9330ms  3.0450ms  9.9014ms  void cudnn::detail::convolve_common_engine_float_NHWC<__half, __half, int=512, int=6, int=8, int=3, int=3, int=5, bool=1, bool=1, bool=0, bool=0, bool=0>(int, int, int, __half const *, __half const , int, __half*, conv_kernel_common_params, int, float, float, int, __half, __half const * const *)
                    3.41%  87.389ms       111  787.29us  771.64us  800.15us  volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc_tn_v1
                    3.23%  82.830ms       110  753.00us  750.68us  756.57us  convert_70
                    2.14%  54.946ms      1280  42.926us  41.535us  48.800us  redzone_checker
                    2.12%  54.471ms         6  9.0785ms  6.1164ms  12.043ms  void cudnn::detail::convolve_common_engine_float_NHWC<__half, __half, int=1024, int=6, int=7, int=3, int=3, int=5, bool=1, bool=1, bool=0, bool=0, bool=0>(int, int, int, __half const *, __half const , int, __half*, conv_kernel_common_params, int, float, float, int, __half, __half const * const *)
                    1.80%  46.047ms       110  418.61us  415.58us  422.24us  fusion_26
                    1.53%  39.307ms       226  173.92us  64.575us  681.72us  volta_fp16_s884gemm_fp16_128x64_ldg8_f2f_nn

==8400== Profiling result:
==8400== Metric result:
Invocations                               Metric Name                           Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: convert_155
        110           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: convert_83
        110           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: convert_85
        110           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: volta_fp16_s884cudnn_fp16_128x128_ldg8_relu_f2f_exp_small_nhwc_tn_v1
          1           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    High (7)    High (7)    High (7)
    Kernel: volta_fp16_sgemm_fp16_32x128_nn
          6           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: volta_fp16_sgemm_fp16_128x32_sliced1x4_nn
          6           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<Eigen::half, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_left<Eigen::half, Eigen::half, Eigen::internal::scalar_sum_op<Eigen::half, Eigen::half>, bool=0>, Eigen::TensorMap<Eigen::Tensor<Eigen::half const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(Eigen::half, int=1)
          1           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: void cudnn::detail::convolve_common_engine_float_NHWC<__half, __half, int=1024, int=6, int=7, int=3, int=3, int=5, bool=1, bool=1, bool=0, bool=0, bool=0>(int, int, int, __half const *, __half const , int, __half*, conv_kernel_common_params, int, float, float, int, __half, __half const * const *)
        116           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: volta_fp16_s884gemm_fp16_64x128_ldg8_f2f_nn
          4           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (5)    High (7)     Mid (5)
    Kernel: volta_fp16_sgemm_fp16_64x32_sliced1x4_nn
          8           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: volta_fp16_s884cudnn_fp16_256x64_ldg8_relu_f2f_exp_small_nhwc_tn_v1
        226           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    High (7)    High (8)    High (7)
    Kernel: void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<Eigen::half, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<Eigen::half, Eigen::half, Eigen::internal::scalar_product_op<Eigen::half, Eigen::half>, bool=0>, Eigen::TensorMap<Eigen::Tensor<Eigen::half const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(Eigen::half, int=1)
          1           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)
    Kernel: volta_fp16_s884gemm_fp16_256x64_ldg8_f2f_nn
          4           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Low (3)    High (7)     Mid (5)
    Kernel: convert_163

Look like that the convolution layers of TF+XLA+cudnn in VGG16 don’t use winograd convolution.

Shawn_Inspur · May 25, 2020, 1:00am

Hi keai007,

We are keep on optimizing TVM on Tensor Core GPUs. Could you please share your codes and models? We will have a check on this issue.

Thanks,
Shawn Wu

Shawn_Inspur · May 25, 2020, 1:00am

Hi xiaocenxiaocen,

Currently TVM does not support int8 inference running on TensorCores. We have made progresses internally on Tensor Core int8 inference, and we will check this issue.

Best wishes,
Shawn Wu

Shawn_Inspur · June 12, 2020, 1:24pm

Hi keai007,

Updates vgg16 performance on V100 with Tensor Core and winograd: 16.23ms. Please try to enable winograd when vgg16 is used, for there are lots of 3x3 convolutions.

Best wishes, Shawn

keai007 · August 19, 2020, 6:45am

Hi Shawn_Inspur,

Method	VGG-16
TVM+TensorCore	23.64 ms
TVM+TensorCore+Winograd	18.00 ms
TF	27.30 ms
TF+XLA	17.16 ms

batchSize=32, dtype=float16, data_format=NHWC

Thanks for your efforts! The winograd schedule really helps a lot. The performance of VGG16 tuned by autotvm is comparable with TF+XLA+cudnn.