Best way to deal with kernel layout?

jinchenglee · March 30, 2021, 11:43pm

Well, I have a special BYOC dense kernel that deals with kernel layout different from default topi.nn implemenation.

The default implemenation has weight Tensor with shape [out_dim, in_dim], while I need [in_dim, out_dim].

Two questions here:

How can I change the default behavior of the kernel input of dense kernel to assume transposed layout? I tried to modify include/tvm/topi/nn/dense.h and python/tvm/topi/nn/dense.py to reverse the layout, but it doesn’t work. Where is the code controlling the default kernel layout for dense op?
If I don’t want to change the default behavior but add an alternative solution targeting my BYOC target, what’s the best way?

Thanks in advance.

comaniac · March 31, 2021, 1:54am

You should not change Relay/TOPI for your target. You should use the preprocess optimize pass to convert the layout before your codegen.

Example:

github.com

apache/tvm/blob/main/python/tvm/relay/op/contrib/arm_compute_lib.py#L77


            transform.MergeComposite(arm_compute_lib_pattern_table()),
            transform.AnnotateTarget("arm_compute_lib", False),
            transform.PartitionGraph(),
        ]
    )
    return seq(mod)
@register_func("relay.ext.arm_compute_lib.optimize")
def preprocess_module(mod):
    """
    Pre-process a module containing functions ready for ACL codegen. For now we enforce OHWI
    kernel layout and fold the transforms away.
    Parameters
    ----------
    mod : Module
        The module to run passes on.
    Returns

jinchenglee · March 31, 2021, 3:57am

@comaniac , thanks for prompt reply. I’ll look into it.

But still, where in relay controls the param/kernel layout defaults?

JosseVanDelm · March 31, 2021, 9:22am

@jinchenglee thanks for asking this question, I’m facing a similar issue

@comaniac I was not aware of this data layout transformation process. It would be useful for me as well, as I’m targetting a microcontroller with embedded accelerator, which has a very specific data layout. I’m not using BYOC though (our approach is covered in this recent post). I thought I would have to make separate operator implementations with data layout preparation steps for each operator from the Relay Strategy. Like here (ARM seems to have different TOPI operators for different data layouts, but this seems opposite to your answer above):

github.com

apache/tvm/blob/a1b4f0e8f2bfcc583f98f0f9272adcc0c12f70a5/python/tvm/relay/op/strategy/arm_cpu.py#L52


@schedule_concatenate.register(["arm_cpu", "micro_dev"])
def schedule_concatenate_arm_cpu(_, outs, target):
    """schedule concatenate for arm cpu"""
    with target:
        return topi.arm_cpu.schedule_concatenate(outs)
@conv2d_strategy.register(["arm_cpu", "micro_dev"])
def conv2d_strategy_arm_cpu(attrs, inputs, out_type, target):
    """conv2d arm cpu strategy"""
    strategy = _op.OpStrategy()
    data, kernel = inputs
    dilation_h, dilation_w = attrs.get_int_tuple("dilation")
    stride_h, stride_w = attrs.get_int_tuple("strides")
    padding = attrs.get_int_tuple("padding")
    groups = attrs.groups
    layout = attrs.data_layout
    kernel_layout = attrs.kernel_layout
    if dilation_h < 1 or dilation_w < 1:

I find it a bit confusing that it is apparently possible to account for different data layouts in different parts of the stack.

Also I find it a bit weird that this data layout pass would be implemented as a relay pass, as I thought relay passes are supposed to be hardware independent. But actually I don’t think such a data transformation pass would be useful for other devices that don’t expect the same data layout as the accelerator, so that would need to be hardware dependent then, right?

How do you think I should proceed? Is your answer different if we don’t use BYOC? And is there some documentation on data layout transformations along the TVM stack perhaps, besides the example you showed us? Thanks!

comaniac · March 31, 2021, 7:27pm

The answer would definitely be different in the case of not using BYOC. Without BYOC, every backend is handled by TVM compilation pipeline. It means every operator has to have its corresponding TOPI implementation. Since data layout affects the TE compute semantic, the op with different data layout is treat as different operators in TE/TIR. The Relay op strategy is in charge of selecting the correct TE/TIR op from a Relay op.

For example, a Relay conv2d op has a data layout attribute, so Relay op strategy will select the TOPI implementation of either conv2d_nchw, conv2d_nhwc, or conv2d_hwcn accordingly as the link you pointed out. Of course, some data layout may be missing in some targets, so you may encounter an error if you specify a data layout that doesn’t have the corresponding TOPI implementation for your target (e.g., arm_cpu).

In short, if you are not using BYOC and require a special data layout for a certain op, you need to 1) register your backend as other targets (e.g., x86, cuda, arm_cpu, etc), 2) have the corresponding TOPI implementations for your backend, and 2) register the Relay op strategy to correctly lower a Relay graph to TE/TIR for your backend.

jinchenglee · March 31, 2021, 8:48pm

Hi, @comaniac . I looked into your example and did a simple experiment similar to it.

My example network imported into relay as below:

#[version = "0.0.5"]
def @main(%input.1: Tensor[(1, 1, 32, 16), float32], %conv.0.bias: Tensor[(1), float32], %conv.0.weight: Tensor[(1, 1, 3, 3), float32], %fc.0.weight: Tensor[(30, 14), float32]) {
  %0 = reshape(%input.1, newshape=[1, 1, -1, 16]);
  %1 = nn.conv2d(%0, %conv.0.weight, padding=[0, 0, 0, 0], kernel_size=[3, 3]);
  %2 = nn.bias_add(%1, %conv.0.bias);
  %3 = nn.relu(%2);
  %4 = reshape(%3, newshape=[-1, 14]);
  %5 = transpose(%fc.0.weight, axes=[1, 0]);
  %6 = transpose(%5, axes=[1, 0]);
  %7 = nn.dense(%4, %6, units=None);
  nn.relu(%7)
}

By applying the kernel layout conversion pass as below:

desired_layouts = {'nn.dense': ['NHWC', 'OHWI'],
                   'nn.conv2d': ['NCHW', 'OHWI']}
seq = tvm.transform.Sequential([relay.transform.ConvertLayout(desired_layouts),
                                relay.transform.FoldConstant()])
with tvm.transform.PassContext(opt_level=3):
    mod = seq(mod)

The outcome is as below:

#[version = "0.0.5"]
def @main(%input.1: Tensor[(1, 1, 32, 16), float32], %conv.0.bias: Tensor[(1), float32], %conv.0.weight: Tensor[(1, 1, 3, 3), float32], %fc.0.weight: Tensor[(30, 14), float32]) -> Tensor[(30, 30), float32] {
  %0 = reshape(%input.1, newshape=[1, 1, -1, 16]) /* ty=Tensor[(1, 1, 32, 16), float32] */;
  %1 = layout_transform(%conv.0.weight, src_layout="OIHW", dst_layout="OHWI") /* ty=Tensor[(1, 3, 3, 1), float32] */;
  %2 = nn.conv2d(%0, %1, padding=[0, 0, 0, 0], kernel_size=[3, 3], kernel_layout="OHWI") /* ty=Tensor[(1, 1, 30, 14), float32] */;
  %3 = expand_dims(%conv.0.bias, axis=1, num_newaxis=2) /* ty=Tensor[(1, 1, 1), float32] */;
  %4 = add(%2, %3) /* ty=Tensor[(1, 1, 30, 14), float32] */;
  %5 = nn.relu(%4) /* ty=Tensor[(1, 1, 30, 14), float32] */;
  %6 = reshape(%5, newshape=[-1, 14]) /* ty=Tensor[(30, 14), float32] */;
  %7 = transpose(%fc.0.weight, axes=[1, 0]) /* ty=Tensor[(14, 30), float32] */;
  %8 = transpose(%7, axes=[1, 0]) /* ty=Tensor[(30, 14), float32] */;
  %9 = nn.dense(%6, %8, units=None) /* ty=Tensor[(30, 30), float32] */;
  nn.relu(%9) /* ty=Tensor[(30, 30), float32] */
}

The kernel layout fed into nn.conv2d is changed accordingly successfully, but there’s no change for nn.dense.

Questions might be dumb: what shall I add in relay to allow the nn.dense kernel layout change for the relay pass dedicated for layout conversion?

I see there are different conv2d implemenations with different layout formats but there’s only one for nn.dense, which is not with the desired kernel layout I’m expecting. Since I’m using BYOC, according to what you’ve descirbed above, it seems those strategy related implementation doesn’t affect me anyways. So where and what shall I change to allow nn.dense kernel layout change? Thank you.

comaniac · March 31, 2021, 9:44pm

There’s no change for nn.dense because it doesn’t have the version you want, as you already pointed out.

If you’re using BYOC, then there is a trick you can play at this moment. Since the preprocess still maintains the type, you cannot simply transpose the weight from [N, C] to [C, N]. On the other hand, BYOC allows you to manipulate constants when initializing the runtime engine. As a result, you can inverse the weight layout in tensor data but pretend its shape is still [N, C]. In short, it looks like the following in runtime:

When initializing the engine, you transpose the weight order to be [C, N].
TVM host module runs to the nn.dense, which input shaps is still [N, C] in the graph.
Since nn.dense has been offloaded to your module, TVM host module calls your module with the input data entry IDs.
The data order in the data entry for the weight is already [C, N], so you can access it correctly.

jinchenglee · March 31, 2021, 9:53pm

This looks more like a hack,

If I want to do it in the relay, I should add a version of nn.dense (say, name it nn.dense_transposed_kernel) then register a function convert_dense(…) with register_convert_op_layouts(“nn.dense”), right?

comaniac · March 31, 2021, 9:58pm

If you really want to add an op, I’d just call it matmul. An even better version is having matmul with all 4 possible transposes, and dense is just one of them, but this needs many changes in the code base.

cc @tqchen

JosseVanDelm · April 1, 2021, 7:53am

Okay cool, then I was on the right track after all Thanks for the quick clarification @comaniac !

jinchenglee · April 1, 2021, 4:03pm

Thanks for the suggestion, @comaniac . Adding matmul operator with implementations of all combinations of inputs’ layouts seems overkill to me. Instead, adding a target-specific relay pass to deal with such target-specific case would be a better solution, which is lightweight and orthogonal to main TVM passes.