Hi there,
I am a first-time TVM user. I am also not very familiar with CUDA GPU programming. I want to write a tensor convolution routine using tvm. I wish the routine to be at least somewhat optimized. Here is a specification of my problem.
1. All the tensors are in NCHW form.
2. Vertical and horizontal padding and stride are required.
3. The code will be executed on an NVIDIA GPU with CUDA, cudnn, cublas etc. support.
I am testing a new way of doing convolution, so I really need a base implementation to run my experiment. Built-in convolution routines such as cuDNN are too optimized and are not good baselines for me. I tried to use cutlass but it is too complicated.
By reading online tutorials like How to optimize convolution on GPU and Tuning High Performance Convolution on NVIDIA GPUs, I assemble the following code.
import numpy as np
import tvm
“”" The computatin we are trying to support is:
void* tensorConvolution(void* input, void* filter,
int vertical_pad, int horizontal_pad,
int vertical_stride, int horizontal_stride,
int conv_mode, int conv_groups); “”"
########################
# Step 1. Describe the computation
########################
# Tensor input and filter dimensions
input_batch = tvm.var(“input_batch”)
input_channel = tvm.var(“input_channel”)
input_width = tvm.var(“input_width”)
input_height = tvm.var(“input_height”)
filter_num = tvm.var(“filter_num”)
filter_size = tvm.var(“filter_size”)
# Additional parameters
# conv_mode is always 1 and not used. I currently does not handle conv_group.
vertical_pad = tvm.var(“vertical_pad”)
horizontal_pad = tvm.var(“horizontal_pad”)
vertical_stride = tvm.var(“vertical_stride”)
horizontal_stride = tvm.var(“horizontal_stride”)
# Input data matrix and filter matrix. NCHW
InputDataMatrix = tvm.placeholder((input_batch, input_channel, input_height, input_width),
name=“InputDataMatrix”)
FilterMatrix = tvm.placeholder((filter_num, input_channel, filter_size, filter_size),
name=“FilterMatrix”)
# Output dimensions
output_width = (input_width - filter_size + 2 * horizontal_pad) // horizontal_stride + 1
output_height = (input_height - filter_size + 2 * vertical_pad) // vertical_stride + 1
# Define padding
InputDataMatrixPad = tvm.compute(
(input_batch, input_channel, output_height + 2 * vertical_pad, output_width + 2 * horizontal_pad),
lambda nidx, cidx, hidx, widx: tvm.if_then_else(
tvm.all(hidx - vertical_pad >= 0, hidx - vertical_pad < input_height,
widx - horizontal_pad >= 0, widx - horizontal_pad < input_width),
InputDataMatrix[nidx, cidx, hidx - vertical_pad, widx - horizontal_pad], tvm.const(0., “float32”)
), name=“InputDataMatrixPad”
)
# Create reduction variables
rc = tvm.reduce_axis((0, input_channel), name=‘rc’)
rh = tvm.reduce_axis((0, filter_size), name=‘rh’)
rw = tvm.reduce_axis((0, filter_size), name=‘rw’)
# Compute the convolution
OutputMatrix = tvm.compute(
(input_batch, filter_num, output_height, output_width),
lambda nidx, fidx, hidx, widx: tvm.sum(
InputDataMatrixPad[nidx, rc, hidx * vertical_stride + rh, widx * horizontal_stride + rw] *
FilterMatrix[fidx, rc, rh, rw],
axis=[rc, rh, rw]
), name=“OutputMatrix”
)
########################
# Step 2. Design the memory hierarchy
########################
# Schedule. Do padding inline.
schedule = tvm.create_schedule(OutputMatrix.op)
schedule[InputDataMatrixPad].compute_inline()
# Memory read and write
InputDataMatrixPadInShared = schedule.cache_read(InputDataMatrixPad, ‘shared’, [OutputMatrix])
FilterMatrixInShared = schedule.cache_read(FilterMatrix, “shared”, [OutputMatrix])
InputDataMatrixPadInLocal = schedule.cache_read(InputDataMatrixPadInShared, “local”,
[OutputMatrix])
FilterMatrixInLocal = schedule.cache_read(FilterMatrixInShared, “local”, [OutputMatrix])
OutputMatrixInLocal = schedule.cache_write(OutputMatrix, “local”)
########################
# Step 3. Blocking
########################
# For each input image x num filter x image channel, a block is created.
nidx, fidx, hidx, widx = schedule[OutputMatrix].op.axis
schedule[OutputMatrix].bind(nidx, tvm.thread_axis(“blockIdx.z”))
schedule[OutputMatrix].bind(fidx, tvm.thread_axis(“blockIdx.y”))
rcidx, rhidx, rwidx = schedule[OutputMatrix].op.reduce_axis
schedule[OutputMatrix].bind(rcidx, tvm.thread_axis(“blockIdx.x”))
# Then, we are left with 2D convolution
# Input[height * weight] conv. Filter[size * size] = Output[hidx * widx]
# We split the workload by a factor of 32
hwidxSplitFactor = 32
hwidxFused = schedule[OutputMatrix].fuse(hidx, widx)
hwidxFusedSplit, _ = schedule[OutputMatrix].split(hwidxFused, factor=hwidxSplitFactor)
schedule[OutputMatrix].bind(hwidxFusedSplit, tvm.thread_axis(“threadIdx.x”))
########################
# Step 4. Memory Fetching
########################
schedule[OutputMatrixInLocal].compute_at(schedule[OutputMatrix], hwidxFusedSplit)
rcidx, rhidx, rwidx = schedule[OutputMatrixInLocal].op.reduce_axis
rhrwFused = schedule[OutputMatrixInLocal].fuse(rhidx, rwidx)
schedule[InputDataMatrixPadInShared].compute_at(schedule[OutputMatrixInLocal], rhrwFused)
schedule[InputDataMatrixPadInLocal].compute_at(schedule[OutputMatrixInLocal], rhrwFused)
schedule[FilterMatrixInShared].compute_at(schedule[OutputMatrixInLocal], rhrwFused)
schedule[FilterMatrixInLocal].compute_at(schedule[OutputMatrixInLocal], rhrwFused)
########################
# Step 5. Testing
########################
func = tvm.build(schedule, [InputDataMatrix, FilterMatrix, vertical_pad, horizontal_pad,
vertical_stride, horizontal_stride, OutputMatrix], ‘cuda’)
ctx = tvm.gpu(0)
input_batch_v = 2
input_channel_v = 3
input_width_v = 3
input_height_v = 3
filter_num_v = 3
filter_size_v = 3
vertical_pad_v = 1
horizontal_pad_v = 1
vertical_stride_v = 1
horizontal_stride_v = 1
# Input
a_np = np.random.uniform(size=(input_batch_v, input_channel_v, input_height_v,
input_width_v)).astype(InputDataMatrix.dtype)
w_np = np.random.uniform(size=(filter_num_v, input_channel_v, filter_size_v,
filter_size_v)).astype(FilterMatrix.dtype)
a = tvm.nd.array(a_np, ctx)
w = tvm.nd.array(w_np, ctx)
# Output
output_width_v = (input_width_v - filter_size_v + 2 * horizontal_pad_v) // horizontal_stride_v + 1
output_height_v = (input_height_v - filter_size_v + 2 * vertical_pad_v) // vertical_stride_v + 1
b = tvm.nd.array(np.zeros((input_batch_v, filter_num_v, output_height_v, output_width_v),
dtype=OutputMatrix.dtype), ctx)
func(a, w, b, vertical_pad_v, horizontal_pad_v, vertical_stride_v, horizontal_stride_v)
evaluator = func.time_evaluator(func.entry_name, ctx, number=1)
print(‘Convolution: %f ms’ % (evaluator(a, w, b).mean * 1e3))
However, they don’t quite work. The first error I am getting is:
Traceback (most recent call last):
File “matrixConvolution.py”, line 99, in
rcidx, rhidx, rwidx = schedule[OutputMatrix].op.reduce_axis
ValueError: not enough values to unpack (expected 3, got 0)
Please help me with the implementation. Also, if you know a better tutorial, either in TVM or CUDA, please share it with me. Thank you.