I compare the performance autotvm and cudnn, I found that autotvm can get better performance than cudnn when bathch size =1, but cudnn get better performance than autotvm when batch size =100.
I’m confused that, can you explain it? Did you test autotvm and cudnn performance for different batch size?
Current conv2d NCHW template on cuda might not be optimized for large batch size. conv2d_hwcn template can give better performance for large batch: https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/conv2d_hwcn.py However, this template doesn’t have autotvm style yet.
So the nchw layout template kernel is an image size aware kernel, and the hwcn/chwn layout template kernel is a batch size aware kernel.