Tuning convolution with auto scheduler and OpenCL backend fails to compile programs

richard-wwu · July 3, 2021, 6:21am

I am trying to use TVM’s auto scheduler in combination with the OpenCL target to tune convolutions on Intel CPU. The problem I am facing is that all tested programs fail with a compilation error:

No: 1   GFLOPS: 0.00 / 0.00     results: MeasureResult(error_type:RuntimeDeviceError, error_msg:Traceback (most recent call last):
  File ".../python/tvm/auto_scheduler/measure.py", line 1112, in _timed_rpc_run
    func.entry_func(*args)
  File "...
...
 1) * 1024)) + (xx_inner * 64)) + ((((int)get_group_id(0)) & 1) * 32)) + ((((int)get_local_id(0)) & 3) * 8)) + ff_inner) + 512))] = conv_local[((((((nn_inner * 256) + (yy_inner * 64)) + (xx_inner * 8)) + ff_inner) + 512))];
          ^
Compilation failed

I have defined the convolution like this:

@auto_scheduler.register_workload
def conv2d_layer(N, H, W, P, Q, CO, CI, KH, KW, stride, padding):
    data = te.placeholder((N, H, W, CI), name="data", dtype="float32")
    kernel = te.placeholder((CO, KH, KW, CI), name="kernel", dtype="float32")

    # Create reduction variables
    rc = te.reduce_axis((0, CI), name="rc")
    ry = te.reduce_axis((0, KH), name="ry")
    rx = te.reduce_axis((0, KW), name="rx")
    # Compute the convolution
    conv = te.compute(
        (N, P, Q, CO),
        lambda nn, yy, xx, ff: te.sum(
            data[nn, yy * stride[0] + ry, xx * stride[1] + rx, rc] * kernel[ff, ry, rx, rc], axis=[ry, rx, rc]
        ),
        name="conv",
    )
    return [data, kernel, conv]

This is how I start the tuning:

dev = tvm.opencl(0)
target = tvm.target.Target("opencl")
max_shared_memory_per_block = dev.max_shared_memory_per_block
max_local_memory_per_block = 2147483647 # INT32_MAX
max_threads_per_block = dev.max_threads_per_block
max_vthread_extent = int(dev.warp_size / 4) if int(dev.warp_size / 4) > 1 else dev.warp_size
warp_size = dev.warp_size
hardware_params = auto_scheduler.HardwareParams(-1, 16, 64,
                                                max_shared_memory_per_block, max_local_memory_per_block,
                                                max_threads_per_block, max_vthread_extent, warp_size)
task = auto_scheduler.SearchTask(func=conv2d_layer, args=(N, H, W, P, Q, CO, CI, KH, KW, strides, padding), target=target, hardware_params=hardware_params)
measure_ctx = auto_scheduler.LocalRPCMeasureContext(min_repeat_ms=300)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=trials,
    measure_callbacks=[auto_scheduler.RecordToFile(logfile)],
    verbose=2,
    runner=measure_ctx.runner
)
task.tune(tune_option)

With the CUDA backend, this code works just fine.

Many thanks for your help!

merrymercy · July 9, 2021, 10:19am

Can you set this value to a larger number?

github.com

apache/tvm/blob/0fa4396129f242a32a7f0ccb30e4a09a818cc522/python/tvm/auto_scheduler/utils.py#L249




    try:
        children = parent.children(recursive=True)
        for process in children:
            process.send_signal(sig)
    except psutil.NoSuchProcess:
        return




# The maximum length of traceback information
MAX_TRACEBACK_INFO_LEN = 512




def make_traceback_info():
    """Get the error message from traceback."""
    info = str(traceback.format_exc())
    if len(info) > MAX_TRACEBACK_INFO_LEN:
        info = (
            info[: MAX_TRACEBACK_INFO_LEN // 2] + "\n...\n" + info[-MAX_TRACEBACK_INFO_LEN // 2 :]
        )
    return info

so we can see the whole error message.

richard-wwu · July 12, 2021, 8:53am

I increased the stack trace length and was able to solve the problem myself. I named the second argument to my conv operator “kernel” which clashes with OpenCL’s “kernel” keyword. Naming the placeholder anything else fixes the error and the tuning works.

Many thanks for your help!

Panda6 · June 10, 2023, 5:14am

@merrymercy Could you give some ideas about how to set a suitable value for max_local_memory_per_block when use tvm auto scheduler. Thanks.