[RFC][Tensor Core] Optimization of CNNs on Tensor Core

keai007 · May 5, 2020, 11:59am

Hi,

I have tried to tune my conv2d workload [‘NHWC’, (32, 300, 300, 64)], but it failed for cuLaunchKernel’s Grid_dim(2, 4, 90000(>65535)).

bz = s[output].fuse(hi, wi)
s[output].bind(bz, block_z)

It seems like there should be a H/W direction tiling config to support all shapes.

Hzfengsy · May 5, 2020, 12:44pm

You are right. Thank you for figuring out the bug.

That’s would be my fault that I focused on the classical workload (e.g. resnet), but forgot to test large shapes. It’s easy to fix. Can you please create a PR?

Novice · May 24, 2020, 9:40am

Hi, @Hzfengsy @Shawn_Inspur

Thanks for your efforts on supporing TensorCore on TVM.

I have tuned TensorCore on classical network such as resnet50 & vgg16(32 batch_size). And the tensor_precision_fu_utilization reported by Nvprof shows that I got a Mid/Low utilization on TensorCore:

   Kernel: fused_nn_conv2d_add_nn_relu_2_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_softmax_kernel3                                                                                                            
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)          
   Kernel: fused_nn_conv2d_add_nn_relu_3_kernel0                                                                                               
         4           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_conv2d_add_nn_relu_4_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_batch_flatten_kernel0                                                                                                      
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)          
   Kernel: fused_nn_conv2d_add_nn_relu_5_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_conv2d_add_nn_relu_6_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Mid (4)     Mid (4)     Mid (4)          
   Kernel: fused_nn_dense_add_kernel0                                                                                                          
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Low (2)     Low (2)     Low (2)          
   Kernel: fused_nn_conv2d_add_nn_relu_7_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization     Low (3)     Low (3)     Low (3)          
   Kernel: fused_nn_conv2d_add_nn_relu_8_kernel0                                                                                               
         2           tensor_precision_fu_utilization   Tensor-Precision Function Unit Utilization    Idle (0)    Idle (0)    Idle (0)          
   Kernel: fused_nn_conv2d_add_nn_relu_kernel0

But when I use cudnn as backend, the utilization is always High.

It seems like that there is still a lot of room for further optimization.Do you have any idea on how to get higher utiliazation for tensor core?

Shawn_Inspur · May 25, 2020, 1:23am

Hi @Novice ,

Yes, I agree that TVM on Tensor Core GPUs do have a lot of room to optimize. Currently we are optimizing the data path between global memory and registers, and we think this is a major bottleneck. We are trying to experiment on different layout of both feature maps and weights. We have found that weights with ‘HWOI’ layout, as suggested by @Hzfengsy, do improve performance for int8 inference on Tensor Core.

Thanks,
Shawn Wu

coincheung · July 21, 2021, 2:58am

I am not sure whether my post here is meaningful, but I draw a conclusion from my recent test like this:

fp16 is not always faster than fp32. WIth some model, fp16 is faster but there are also some model that fp32 runs faster than fp16. (both after tuning, each task 2000 trials).
tvm fp16 is slower than tensorrt fp16 inference. On my platform, my model achieves about 35 fps with tvm, but achieves 58 fps with tensorrt.

Do not know what did I miss. Waiting to see the tutorials about the usage of tvm in fp16 mode inference.

coincheung · July 21, 2021, 2:59am

By the way, I am willing to share my model, and my test code, if people would like to spend time looking at it.

reku · July 21, 2021, 5:39am

What’s your platform? I have tested them in nvidia 1660 super, and got a similar conclusion. But in Tesla T4, they are faster by using tensorcore.

coincheung · July 21, 2021, 6:22am

I am using T4, but I do not know how to use tensorcore. Is there any option that I can set to true explicitly to use tensorcore?

reku · July 21, 2021, 11:43am

If you use TVM to compile operators that support tensorcore shape, then tensorcore should be called automatically on T4. So I guess you used some operators that don’t support tensorcore shape (like batch=1)?

coincheung · July 21, 2021, 2:53pm

Yes, I used batchsize=1, and I compiled my model through onnx, rather than a single operator. I compiled the same model with tensorrt, and the trt speed is much faster than tvm(fp16 mode). Maybe we could wait for more tvm updates and optimizations.

reku · July 22, 2021, 3:04am

OK, in the case of batch=1, due to the realization of TVM TOPI, the operator cannot be optimized by tensorcore. However, in other libraries (such as TensorRT), in this case, it can use tensorcore through img2col.

github.com

apache/tvm/blob/45497bd362013c396b0185c6542175c066f45cb2/python/tvm/relay/op/strategy/cuda.py#L231-L244


    target.kind.name == "cuda"
    and nvcc.have_tensorcore(target=target)
    and (
        (N % 16 == 0 and CI % 16 == 0 and CO % 16 == 0)
        or (N % 8 == 0 and CI % 16 == 0 and CO % 32 == 0)
        or (N % 32 == 0 and CI % 16 == 0 and CO % 8 == 0)
    )
):
    strategy.add_implementation(
        wrap_compute_conv2d(topi.cuda.conv2d_nhwc_tensorcore),
        wrap_topi_schedule(topi.cuda.schedule_conv2d_nhwc_tensorcore),
        name="conv2d_nhwc_tensorcore.cuda",
        plevel=20,
    )

You can make your conv2d operators meet these conditions to get the optimization of tensorcore in TVM.

coincheung · July 23, 2021, 7:44am

Is it possible that I use batchsize=16 to compile my model, but call the compiled lib with batchsize=? Could I see speed improvement with this method?

reku · July 23, 2021, 11:15am

I feel it might not work.

alopez_13 · July 23, 2021, 1:30pm

The runtime will expect the input to match the compiled lib. Also, the compiler requires a constant batch size value.

Not sure if using the VM may be of help to you? I have not used it myself and I have the impression that this may not address your problem, but its worth looking into.

coincheung · July 24, 2021, 3:12am

Thanks @alopez_13 and @reku for telling me this !! I have read relevant turorials more carefully.

asdlalala · July 27, 2021, 7:12am

Hi, I am interested in the TVM’s performance of conv2d operator on Tensorcore. I experimented on V100 and T4 platforms using the schedule template in file ‘topi/cuda/cond2d_nhwc_tensorcore’. Results show that AutoTVM never performs better than Cudnn on six commonly used shapes in float16 mode. In some cases (like conv2d_nhwc_32_56_56_256_3_3_64_1_0), AutoTVM’s tuned results only achieve about 50% of Cudnn’s performance. I wonder whether there exist some cases (shape or data layouts) in that AutoTVM performs better than Cudnn? Or can the template be further optimized?

k1107g · July 25, 2023, 11:57am

can u tell me how to tune a network with tensorcore on tvm ?thanks

SharynHu · February 22, 2024, 8:47am

github.com

apache/tvm/blob/main/tests/python/integration/test_auto_tensorize.py

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
"""Integration test for MetaSchedule's auto tensorization."""
import tempfile

import numpy as np

This file has been truncated. show original