[RFC][BYOC] Data Calibration Flow

seanlatias · June 25, 2020, 6:20pm

Motivation

Although TVM provides quantization flow for pre-quantized models, we do find some developers would prefer to use their own quantization flow for their accelerators, since they may have specialized calibration and quantization flows other than TVM QNN. However, current BYOC flow has limited support in this scenario. One current workaround involves two passes of compilation pipelines. In the first pass, we partition the graph and go through a graph runtime to get the calibration data. In the second pass, the calibration results are used along with the BYOC flow to generate the final quantized code for the accelerator.

Proposal

In this RFC, we want to provide a clean and easy-to-use interface for developers to collect calibration data to feed into their calibration and quantization flows. With this interface, they can get the calibration data along with the subgraph information for the final code generation with only a single API.

Programming Model

mod, params = relay.testing.mobilenet.get_workload(...)

# passes for generating partitioned graphs
mod = transform.AnnotateTarget(["dnnl"])(mod)
mod = transform.MergeCompilerRegions()(mod)
mod = transform.PartitionGraph()(mod)

# proposed calibration flow and API
i_data = ... # the input data to be calibrated
calib_data = analysis.calibrate_parition_graph(mod, i_data, params)

# pass the calibration data to the external codegen and build the program
with transform.PassContext(opt_level=3, config={'calib_data': calib_data}):
    realy.build(mod, ...)

We propose a new analysis API calibrate_parition_graph (any better names would be appreciated) that takes in three inputs: the partitioned module, the input data to be calibrated, and the parameters. It returns the calibration data, which is a mapping between the subgraph name and all its input and output values. Following we show a synthetic example.

The Relay graph after partitioning:

def @dnnl0(%dnnl0_i0: Tensor[(3, 3), float32], %dnnl0_i1: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
  add(%dnnl0_i0, dnnl0_i1) 
}

def @dnnl1(%dnnl0_i0: Tensor[(3, 3), float32], %dnnl0_i1: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
  sub(%dnnl0_i0, dnnl0_i1) 
}

def @main(%data0: Tensor[(3, 3), float32], %data1: Tensor[(3, 3), float32], %data2: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
  %0 = @dnnl0(%data0, %data1)
  @dnnl1(%0, %data2)
}

Then this will be the calibration data we get:

{“main”: {“inputs”: [**data0**, **data1**, **data2**], 
          “outputs”: [**output**]},
 “dnnl0”: {“inputs”: [**data0**, **data1**],
           “outputs”: [**%0**]}
 “dnnl1”: {“intputs”: [**%0**, **data2**],
           “outputs”: [**output**]}}

Note that if we have multiple sets of data to be calibrated, the final results will be a list of list. Finally, to use the calibration data during code generation, we send them to the PassContext.

Implementation Details

We implement two passes to get the calibration results. The first pass will remove all back-end specific attributes and mark all intermediate tensors as the final outputs. Then, we use the graph runtime to get the tensor values. The second pass will get the mapping between the subgraph name and the tensor values. Then, we perform some post-processing to get the final calibration data as shown above.

The POC branch is available here

cc @zhiics, @comaniac, @masahi, @mbaret, @tqchen

comaniac · June 25, 2020, 6:17pm

Also cc @JoeyChou @abergeron

zhiics · June 25, 2020, 6:37pm

cc @anijain2305 as well

comaniac · June 29, 2020, 9:38pm

Gentle ping for comments @anijain2305, @masahi, @mbaret, @tqchen

masahi · June 29, 2020, 10:35pm

This makes sense to me.

I’m curious to see how calib_data is going to be used during codegen. Assuming you want to upstream this pass, how are you going to add tests for this? I can imagine you can use the DNNL codegen to run a dummy calibration pass, but not quantize.

seanlatias · June 29, 2020, 11:29pm

Hi, for now, the calib_data will be sent in as an argument of PassContext, which is accessible by the BYOC pass triggered by relay.build. Users can use the calibration data to perform quantization. You can imagine that users can build a helper quantizer that takes in the calibration data during codegen.

For testing, as you mentioned, we plan to use DNNL codegen to test our flow. Since DNNL supports int8 data type, we can generate random input data for calibration and write a very simple quantizer that takes in the calibration data. And we can test the accuracy of the quantized DNNL code. This tests will also become an example to show how we use the calib_data during codegen.

tqchen · June 30, 2020, 4:22am

cc @ziheng @weberlo who might also be interested

mbaret · June 30, 2020, 4:25pm

This looks reasonable to me, it’s not something we require for Ethos-N but I can see why it may be desirable. I am noticing quite a bit of API creep around BYOC though. We never really settled on a way to encapsulate the partitioning passes and now we have another special pass that may or may not need to run + a new config option. Is there a way we can abstract some of this implementation detail away so a user who just wants to compile for ‘DNNL’ doesn’t need intimate knowledge of the BYOC infrastructure?

comaniac · June 30, 2020, 5:09pm

@mbaret For each BYOC backend such as DNNL, we could define a transform sequence so that we can have mod = transform.partition("dnnl")(mod). However, there are some issues should be further discussed. For example, where should we put those transform sequences (e.g., put them under tvm.transform and ask users to manually invoke, or integrate them along with the PassContext or relay.build to automatically invoke). We could file another RFC to discuss the proposals and APIs.

On the other hand, IMHO, the data calibration flow is an optional analysis pass, so it should be put under analysis passes as proposed. We could discuss how to abstract such BYOC related analysis passes with transform passes in another RFC as well.

Talking back to the calibration flow, I just realized that the calibrate_partition_gaph is not necessary to be a BYOC specific pass. We could rename it to something like profile_subgraph to make it general for all Relay programs. The pass accepts a Relay program and returns complete values of every function boundary tensors.

anijain2305 · June 30, 2020, 9:30pm

LGTM. I think we can rename to get_calibration_data or get_profiling_data instead of calibrate_partition_gaph. I think calibration means more than collecting i/o tensors (for quantization, it means choosing min/max such that quantized data representation is similar to float32 data representation).

weberlo · July 1, 2020, 12:20am

I agree with @mbaret that we should be hesitant to use BYOC as a catch-all for everything we haven’t implemented in TVM.

What would help me better understand the motivation for this change is an example of a quantization flow that isn’t easily expressible with TVM’s internal facilities. I’m not very familiar with Relay’s QNN dialect, but given that there is great interest in improving TVM’s quantization facilities, I’m curious if the flow you have in mind could be accommodated by minor improvements to the QNN dialect. Or perhaps there’s a larger RFC (or two) hiding within this discussion, as @comaniac suggested.

As a side note, if we decide we do want to include calib_data as a config parameter, it should be namespaced, as in the following snippets:

github.com

apache/incubator-tvm/blob/master/tests/python/relay/test_pass_annotation.py#L542


    sub = relay.subtract(add, copy_mul_sub)
    func = relay.Function([a, b, c, d], sub)
    return func

annotated_func = annotated()
expected_func = expected()
expected_index = [2, 2, 2, 1, 1, 1, 2, 2]
check_annotated_graph(annotated_func, expected_func)
params = {"a": a_data, "b": b_data, "c": c_data, "d": d_data}
with tvm.transform.PassContext(opt_level=0,
                               config={"relay.fallback_device_type":
                                       fallback_device.device_type}):
    graph, lib, params = relay.build(annotated_func, target, params=params)
    contexts = [tvm.cpu(0), tvm.context(dev)]
    graph_json = json.loads(graph)
    if "device_index" in graph_json["attrs"]:
        device_index = graph_json["attrs"]["device_index"][1]
        assert device_index == expected_index
    mod = graph_runtime.create(graph, lib, contexts)
    mod.set_input(**params)
    mod.run()

github.com

apache/incubator-tvm/blob/master/tests/python/unittest/test_runtime_micro.py#L53



    params : dict
        input parameters that do not change during inference

    Return
    ------
    mod : tvm.runtime.Module
        graph runtime module for the target device
    """
    with tvm.transform.PassContext(disabled_pass={'FuseOps'}, config={
        "tir.disable_vectorize": True
    }):
        graph, c_mod, params = relay.build(func, target=TARGET, params=params)
    micro_mod = micro.create_micro_mod(c_mod, dev_config)
    ctx = tvm.micro_dev(0)
    mod = graph_runtime.create(graph, micro_mod, ctx)
    mod.set_input(**params)
    return mod


GDB_INIT_TEMPLATE = """

Perhaps relay.quantize.calib_data?

anijain2305 · July 1, 2020, 1:42am

I think we are getting confused because of the overloaded term quantization. To be precise, maybe we can stick to certain terms

QNN Dialect - Framework (like TF/PyTorch/MXNet) performs quantization. Relay parser reads this pre-quantized model and creates a QNN-dialect graph. QNN ops are like a wrapper, that are lowered to a sequence of existing Relay operators.
Relay Automatic Quantization - Takes FP32 Relay model, quantizes, produces a Relay graph with integer datatypes.
Bring Your Own Codegen Quantizer - In this case, the hardware vendors have their own quantization flow because the HW accelerator can have certain restrictions that are not suitably reflected in Relay Automatic Quantization or Framework quantization. This RFC is for this category.

These three options differ at which point quantization is happening. In QNN, it happens in one extreme - frameworks. In BYOCQ, it happens in the other extreme - codegen. Relay Automatic quantization is in between.

This RFC is for BYOC quantizer. In this case, the Relay graph that goes to codegen is FP32. Actually, Relay does not even know that codegen is going to perform quantization.

However, external codegen needs input/output tensor values for each subgraph to perform calibration later. This RFC discusses the API and flow to do that.

@weberlo Hopefully this gives some context. You are right that we should think what is missing in Relay Automatic Quantization to enable more hardware-aware quantization. At the same time, there are hardware vendors that have their own mature codegen toolchain and wants to reuse it as much as possible.

tqchen · July 1, 2020, 2:15am

Thanks for the good summarization. One concern that I have for this case is mainly about the coupling of the quantization part with the customized code generator.

While the application scenario is certainly understandable. We will need to resolve two questions, as an overall goal of the project.

P0: The relation with the existing quantization and which one to advocate for.
P1: The coupling of the customized code generator flow with the quantization.

In the case of P0, I think it is best to focus on QNN and AutoQ, so that most of the quantized optimization are optimized in a transparent way. It is certainly important to produce hardware target aware quantization along the lines of AutoQ, so that we can generate better graphs that can be mapped to the low-level hw.

We can certainly see some value in introducing this feature. However, given that the application scenario is somewhat limited, it would be useful to de-couple it from the existing set of features.

In particular, the name BOYCQ suggests some level of coupling with the customized codegen target, which is not desirable. If the new feature is an optional pass that would not disrupt the existing flow, then it would be easier bring it in. It would be great if we can think about a way to plugin the opaque graph quantizer as a component of AutoQ. So that it is possible to directly feed data in and out to produce such transformed graph, before running the final code generation.

The main motivation for such discussion is that, while it is possible to always introduce new features, every feature also brings technical debts, so it is important to think about ways to minimize the potential debts for future usecases.

Finally i do think that opaque quantizer seems to be a bad idea in the long run, and is harder to get right than the opaque code generator, if there are ways to do things in a more transparent fashion(e.g. plugging things back to AutoQ and return back part of quantized graph) it is better to do things in that way

zhiics · July 1, 2020, 4:18pm

Thanks for the discussion.

I think we don’t really need to tie this feature to the BYOC flow. The problem it tries to solve is providing calibration data to 3rd codegen with quantizers as @anijain2305 pointed out. This is not required by QNN or AutoQ. It is also optional to 3rd codegen or BYOC.

I would say it is probably more appropriate to treat it is just a separate pass and users can optional invoke it when necessary. Examples can be provided to show how it can be used to help the 3rd quantizers. I think it is useful because the majority of hardware vendors actually have their own quantizer. In the long-term, I agree with @tqchen that it would be great to investigate how we can extend AutoQ to support opaque quantizers.

I agree that we should definitely think about some mechanism to encapsulate the BYOC flow. We should have a separate RFC to list some possible options and move forward from there.

seanlatias · July 1, 2020, 10:23pm

Thanks everyone for the feedback. As mentioned above, the focus of this RFC is for the calibration pass, which is an optional analysis pass that can be applied by users. It does not necessarily need to be bound with BYOC. Moreover, quantization is also not the focus of this RFC. We will open other RFCs for discussing how we can combine calibration, BYOC, and quantization all together.

I will file a PR regarding only the calibration pass based on the above discussion in the next few days. Thanks again for the helpful feedback.