Different TE using Ansor to tune conv2d

Recently I was using Ansor to tune conv2d which write in Tensor Expression, and different te code has different execution time for 200 trials. and parameter is

data NCHW (1,3,224,224) weight OIHW (63,3,7,7) stride=2 pad=3

Here is my code sinppet

import numpy as np
import tvm
from tvm import te, auto_scheduler,relay,topi
from tvm.contrib import graph_runtime

def conv2d_1():
	data = te.placeholder((1,3,224,224), name='data', dtype='float32')
	tensor_0 = te.compute((1,3,112,112,7,7), lambda n,c,h1,w1,kh,kw: te.if_then_else(te.all(-3+2*h1+kh>=0,-3+2*h1+kh<224,-3+kw+2*w1>=0,-3+kw+2*w1<224),data[n,c,-3+2*h1+kh,-3+kw+2*w1],tvm.tir.const(0, dtype='float32')),name = 'tensor_0')
	weight = te.placeholder((64,3,7,7), name='weight', dtype='float32')
	tensor_1 = te.compute((1,64,3,112,112,7,7), lambda n,oc,c,h1,w1,kh,kw: tensor_0[n,c,h1,w1,kh,kw] * weight[oc,c,kh,kw], name='tensor_1',)
	c = te.reduce_axis((0,3),name='c')
	kh = te.reduce_axis((0,7),name='kh')
	kw = te.reduce_axis((0,7),name='kw')
	tensor_2 = te.compute((1,64,112,112), lambda n,oc,h1,w1: te.sum(tensor_1[n,oc,c,h1,w1,kh,kw],axis = [c,kh,kw]), name='tensor_2')
	return[data,weight,tensor_2]

def conv2d_2():
	data = te.placeholder((1,3,224,224), name='data', dtype='float32')
	tensor_0 = te.compute((1,3,112,112,7,7), lambda n,c,h1,w1,kh,kw: te.if_then_else(te.all(-3+2*h1+kh>=0,-3+2*h1+kh<224,-3+kw+2*w1>=0,-3+kw+2*w1<224),data[n,c,-3+2*h1+kh,-3+kw+2*w1],tvm.tir.const(0, dtype='float32')),name = 'tensor_0')
	weight = te.placeholder((64,3,7,7), name='weight', dtype='float32')
	c = te.reduce_axis((0,3),name='c')
	kh = te.reduce_axis((0,7),name='kh')
	kw = te.reduce_axis((0,7),name='kw')
	tensor_2 = te.compute((1,64,112,112), lambda n,oc,h1,w1: te.sum(tensor_0[n,c,h1,w1,kh,kw] * weight[oc,c,kh,kw],axis = [c,kh,kw]), name='tensor_2')
	return[data,weight,tensor_2]

def conv2d_3():
	data = te.placeholder((1,3,224,224), name='data', dtype='float32')
	weight = te.placeholder((64,3,7,7), name='weight', dtype='float32')
	c = te.reduce_axis((0,3),name='c')
	kh = te.reduce_axis((0,7),name='kh')
	kw = te.reduce_axis((0,7),name='kw')
	tensor_2 = te.compute((1,64,112,112), lambda n,oc,h1,w1: te.sum(te.if_then_else(te.all(-3+2*h1+kh>=0,-3+2*h1+kh<224,-3+kw+2*w1>=0,-3+kw+2*w1<224),data[n,c,-3+2*h1+kh,-3+kw+2*w1],tvm.tir.const(0, dtype='float32')) * weight[oc,c,kh,kw],axis = [c,kh,kw]), name='tensor_2')
	return[data,weight,tensor_2]

def conv2d_pad():
	data = te.placeholder((1,3,224,224), name='data', dtype='float32')
	paddedX = te.compute((1,3,230,230), lambda n,c,h1,w1: te.if_then_else(te.all(h1>=3,h1<230,w1>=3,w1<230),data[n,c,h1-3,w1-3],tvm.tir.const(0, dtype='float32')),name = 'tensor_0')
	weight = te.placeholder((64,3,7,7), name='weight', dtype='float32')
	c = te.reduce_axis((0,3),name='c')
	kh = te.reduce_axis((0,7),name='kh')
	kw = te.reduce_axis((0,7),name='kw')
	tensor_2 = te.compute((1,64,112,112), lambda n,oc,h1,w1: te.sum(paddedX[n,c,h1*2+kh,w1*2+kw] * weight[oc,c,kh,kw],axis = [c,kh,kw]), name='tensor_2')
	return[data,weight,tensor_2]

basically conv2d computation conbine with assignment(te.if_then_else),multiply and sum.

  1. ‘conv2d_pad’ is similar to the tutorial in which compute1 do pad,and compute2 do mul and sum
  2. ‘conv2d_1’ compute1 do pad and layout transform, compute2 do muliply and compute3 do sum
  3. ‘conv2d_2’ combine last two compute
  4. ‘conv2d_3’ combine all three compute

After using Ansor (200 trials) to get a schedule,the best result are ‘conv2d_pad’ and ‘conv2d_2’,and some of that show error message while searching An error occured in ‘conv2d_3’ ,which shows the following:

Eterminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [19:05:26] /home/sun/gitDownload/tvm/src/runtime/cuda/cuda_module.cc:61: CUDAError: cuModuleUnload(module_[i]) failed with error: CUDA_ERROR_MISALIGNED_ADDRESS

here are my questions and thoughts

  1. how to handle this error
  2. if increase numer of trial, can all four function reache convergence (same schedule)
  3. if not, combining compute must have some effect. is compute is executed sequentially?(for ‘conv2d_1’,muliply is executed and then do sum ),can Ansor automatically eliminate intermedia variable(‘tensor_1’ in ‘conv2d_1’)
  4. any suggestion about which funciton to choose to compute conv2d.(basicly ‘conv2d_pad’ and ‘conv2d_2’) what’s difference between ‘conv2d_pad’ and ‘conv2d_2’

Thanks a lot!

Ansor will run analyses on the tensor expressions. I think conv2d_2 is most friendly for these analyses so it is recommended.