Motivation
Although TVM provides quantization flow for pre-quantized models, we do find some developers would prefer to use their own quantization flow for their accelerators, since they may have specialized calibration and quantization flows other than TVM QNN. However, current BYOC flow has limited support in this scenario. One current workaround involves two passes of compilation pipelines. In the first pass, we partition the graph and go through a graph runtime to get the calibration data. In the second pass, the calibration results are used along with the BYOC flow to generate the final quantized code for the accelerator.
Proposal
In this RFC, we want to provide a clean and easy-to-use interface for developers to collect calibration data to feed into their calibration and quantization flows. With this interface, they can get the calibration data along with the subgraph information for the final code generation with only a single API.
Programming Model
mod, params = relay.testing.mobilenet.get_workload(...)
# passes for generating partitioned graphs
mod = transform.AnnotateTarget(["dnnl"])(mod)
mod = transform.MergeCompilerRegions()(mod)
mod = transform.PartitionGraph()(mod)
# proposed calibration flow and API
i_data = ... # the input data to be calibrated
calib_data = analysis.calibrate_parition_graph(mod, i_data, params)
# pass the calibration data to the external codegen and build the program
with transform.PassContext(opt_level=3, config={'calib_data': calib_data}):
realy.build(mod, ...)
We propose a new analysis API calibrate_parition_graph
(any better names would be appreciated) that takes in three inputs: the partitioned module, the input data to be calibrated, and the parameters. It returns the calibration data, which is a mapping between the subgraph name and all its input and output values. Following we show a synthetic example.
The Relay graph after partitioning:
def @dnnl0(%dnnl0_i0: Tensor[(3, 3), float32], %dnnl0_i1: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
add(%dnnl0_i0, dnnl0_i1)
}
def @dnnl1(%dnnl0_i0: Tensor[(3, 3), float32], %dnnl0_i1: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
sub(%dnnl0_i0, dnnl0_i1)
}
def @main(%data0: Tensor[(3, 3), float32], %data1: Tensor[(3, 3), float32], %data2: Tensor[(3, 3), float32]) -> Tensor[(3, 3), float32] {
%0 = @dnnl0(%data0, %data1)
@dnnl1(%0, %data2)
}
Then this will be the calibration data we get:
{āmainā: {āinputsā: [**data0**, **data1**, **data2**],
āoutputsā: [**output**]},
ādnnl0ā: {āinputsā: [**data0**, **data1**],
āoutputsā: [**%0**]}
ādnnl1ā: {āintputsā: [**%0**, **data2**],
āoutputsā: [**output**]}}
Note that if we have multiple sets of data to be calibrated, the final results will be a list of list. Finally, to use the calibration data during code generation, we send them to the PassContext
.
Implementation Details
We implement two passes to get the calibration results. The first pass will remove all back-end specific attributes and mark all intermediate tensors as the final outputs. Then, we use the graph runtime to get the tensor values. The second pass will get the mapping between the subgraph name and the tensor values. Then, we perform some post-processing to get the final calibration data as shown above.
The POC branch is available here