Quantized models and legalization pass

Hi all,

I am trying to improve quantized performance for memory bound operators (e.g., depthwise or 1x1 convolutions with small shapes).

Bottom line question

Is there any way we can know the strategy picked by the autotuner during the legalization pass of a quantized convolution (qnn.conv2d)?

Long story

In general, for any int8->int32 convolution there are two strategies to follow:

  • Convert to int16, subtract the offset and the execute conv2d+requantization
  • Stay in int8 and use some magic instruction to compute int8->int32 convolutions. This introduces the evaluation of 4 terms: Term1 is the core conv2d (int8->int32), and Term{2,4} are the offset contributions (see here)

In theory, the int8 approach should outperform the int16, but for memory bound operators the additional Terms{2,4} might hit the performance (I have situations where Term2 takes the same time of nn.conv2d). To have the best of the two worlds, we should implement the two strategies and try them both.

At the moment, this is (I think) not possible in TVM. Indeed, the decision of converting to int16 (and then subtracting the offsets) happens during the legalization pass, i.e., when the qnn.conv2d is lowered to a normal nn.conv2d.

So, back to the main question: is there a way to know the auto-tuner strategy during the legalization pass? The ideal code would be:

@qnn_conv2d_legalize.register("arm_cpu")
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
    If strategy == "conv2d_int16":
        convert_to_int16(data)
    else:
        convert_to_int8(data)

cc @anijain2305 @ramana-arm @FrozenGene (we had this discussion before)

does code in alter_op_layout work?

    best_plevel_impl, outs = relay.backend.compile_engine.select_implementation(
        relay.op.get("nn.conv2d"), attrs, tinfos, out_type, target)
   if best_plevel_impl.name == "conv2d_int16":

Thanks for the reply, @FrozenGene!

The signatures of the two functions are:

def _alter_conv2d_layout(attrs, inputs, types, out_type):
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):

While they look similar, inputs in _alter_conv2d_layout contains actual Tensors while inputs in _qnn_conv2d_legalize_arm_cpu are of class tvm.relay.expr.Var.

Is there a way to get the tensors during legalization from variables of type tvm.relay.expr.Var?

Thanks a lot,

Giuseppe

@giuseros I doesn’t run it, but according to my understanding, these two functions’s inputs should be the same type (tvm.relay.expr). For example, inside the alter_op_layout function we have logic:

# HWIO -> OIHW
kernel_transform = relay.transpose(inputs[1], axes=[3, 2, 0, 1])
# alpha, alpha, CO, CI
weight = relay.nn.contrib_conv2d_winograd_weight_transform(kernel_transform,
                                                            tile_size=tile_size)

relay.transpose requires its input’s type is tvm.relay.expr.

For the doc of conv2d_alter_layout, it says we require tvm.relay.expr too:

@tvm.target.generic_func
def conv2d_alter_layout(attrs, inputs, tinfos, out_type):
    """Change Conv2D layout.

    Parameters
    ----------
    attrs : tvm.ir.Attrs
        Attributes of current convolution
    inputs : tvm.relay.Expr
        Grouped input symbols

I got a bit confused above, sorry. It is not about the inputs but about the tinfos.

Just to avoid any additional confusion I tried to print the types of the interesting variables

conv2d_alter_op(attrs, inputs, tinfos, out_type)

print(type(inputs[0]))
# <class 'tvm.relay.expr.Var'>

print(type(tinfos[0]))
# <class 'tvm.te.tensor.Tensor'>

_qnn_conv2d_legalize_arm_cpu(attrs, inputs, types)

print(type(inputs[0]))
# <class 'tvm.relay.expr.Var'>
print(type(types[0]))
#<class 'tvm.ir.tensor_type.TensorType'>

The call to relay.backend.compile_engine.select_implementation asks to provide an array of tvm.te.Tensors as its 3rd argument while I have (during legalization) only a tvm.ir.TensorType array.

So, the question still remains: is there a way to get a tensor during legalization?

Sorry for late reply. Can you try this? tinfo is nothing but just te placeholder.

diff --git a/python/tvm/relay/qnn/op/legalizations.py b/python/tvm/relay/qnn/op/legalizations.py
index 50e5a02f8..8add434c1 100644
--- a/python/tvm/relay/qnn/op/legalizations.py
+++ b/python/tvm/relay/qnn/op/legalizations.py
@@ -295,6 +295,10 @@ def _qnn_dense_legalize_arm_cpu(attrs, inputs, types):
 @qnn_conv2d_legalize.register("cpu")
 def _qnn_conv2d_legalize_intel_cpu(attrs, inputs, types):
     # The VNNI transformations prefer uint8 x int8 datatypes.
+    data_type = types[0]
+    data_tinfo = tvm.te.placeholder(shape=data_type.shape, dtype=data_type.dtype)
+    print(data_tinfo)
+
     if is_fast_int8_on_intel():
         return helper_change_dtypes_to_uint8_int8(attrs, inputs, types, relay.qnn.op.conv2d)
     return helper_no_fast_int8_hw_legalization(attrs, inputs, types, relay.nn.conv2d)

If not, we can change Legalize to accept one more argument, but thats more invasive.

@giuseros @anijain2305 Let us accept one more argument like alter_op_layout

@tvm.target.generic_func
def conv2d_alter_layout(attrs, inputs, tinfos, out_type):

@tvm.target.generic_func
def qnn_conv2d_legalize(attrs, inputs, types):
    """Default legalization is None."""
    return None

Then we could leverage relay.backend.compile_engine.select_implementation

Hi @FrozenGene, @anijain2305
I can confirm that this works :partying_face:! Very good! Now we can implement algorithms like QNNPack and let the tuner try them together! Thanks both guys!

As for the API change, I agree with @FrozenGene that maybe it would be cleaner adding tinfos to the qnn_conv2d_legalize signature.

I have a related question, that I always meant to ask: in conv2d_alter_layout, we don’t execute the function if the current configuration is a Fallback. Do you guys know why? And also, what should the behavior be in legalize? I am referring to this code:

    _, outs = relay.backend.compile_engine.select_implementation(
        relay.op.get("nn.conv2d"), attrs, tinfos, out_type, target
    )
    workload = autotvm.task.get_workload(outs)
    if workload is None:
        # The best implementation is not an AutoTVM template,
        # we then assume it's not necessary to alter this op.
        return None
    cfg = dispatch_ctx.query(target, workload)
    if cfg.is_fallback:  # if is fallback, clear query cache and return None
        autotvm.task.clear_fallback_cache(target, workload)
        return None

    topi_tmpl = workload[0]

In theory, we could gather topi_tmpl directly from the first parameter returned by relay.backend.compile_engine.select_implementation. But instead, in conv2d_alter_layout we query the dispatch_ctx for the current configuration and, if it is a Fallback, we return None. To sum up the follow-up questions are:

  • Why this behavior is there?
  • What should we do in legalize? Simply return back a default legalization?

Thanks once more for your help!

For alter_op_layout, we will alter the weight layout, normally we will change the weight layout to 5D, the last dim is queried from our AutoTVM log file. For example:

    if topi_tmpl == "conv2d_nchw_spatial_pack.arm_cpu":
        assert data_layout == "NCHW" and kernel_layout == "OIHW"
        N, CI, H, W = get_const_tuple(data.shape)
        CO, _, KH, KW = get_const_tuple(kernel.shape)
        VC = cfg['tile_co'].size[-1]

If there is no workload, we don’t want to change the layout. However, you could argue we could set one fixed value like 8, but if you do this, you need to change the compute logic of conv2d too (like def conv2d_spatial_pack_nchw). At there, we will say the VC is cfg['tile_co]`, not 8.

Default legalization will make sense.

Hi @FrozenGene, I think I see why we don’t want to change the layout for no workload (no workload means we don’t even know the strategy, I think). What I am missing is why we don’t want to change the layout when cfg.is_fallback. In that case, the strategy is defined, so we know how the weights should be reshaped (likewise, we know what data type we need during legalization).

When we enter into fall back configuration means we don’t find the configuration of this workload in the tuning log. So like I replied before, even I know this is conv2d_nchw_spatial_pack.arm_cpu, but I can not get cfg['tile_co'].size[-1].

Maybe I am wrong, but are you sure that when cfg.is_fallback parameters like cfg['tile_co'] are not defined? We usually set them to some default values (I think). But even if we don’t set them, IIUC they will get “some” value among the possible ones. Am I missing something?

Ah…u are right, @giuseros sorry i mislead u. I remembered wrong before. We will have one default value, it is 1 if i remember correctly. But even we could have one value, the value is not trusted, because we haven’t tuned it. We maybe could say we could fix it for 4 or 8, but I think it doesn’t bring much benefit, because when we enter into fallback, the performance we will not guarantee, if u really want to do it, you could set 4 or 8 like I said when to enter into fallback, but it doesn’t mean much.