Status on quantization in TVM

Hi all,

I’m working together with @wiebevr on using TVM for an embedded SoC with a Neural Network accelerator. Currently we’re looking into how we can best implement quantization to our tvm-fork.

We’re looking into both pre-quantized models as well as quantization done in TVM whatever would be the easiest. However it’s quite hard to find information on the best way to get started with this. We already found some implementations scattered throughout this forum:

I also found the tutorials in the gallery/how_to/deploy_models folder (deploy_quantized.py, deploy_prequantized.py, deploy_prequantized_tflite.py), but I had some issues getting them to work, and was wondering if they were still being kept up to date.

We’re also looking into int2 and int4 deployment, would the BYOD tutorial be a good fit for this?

This search has been a bit overwhelming, any help is greatly appreciated!

Thank you in advance!

Josse and Wiebe

2 Likes

Tutorials are run on CI, so if you use the same framework version used in CI it should work. I’m aware that deploy_prequantized.py, didn’t work with the recent PyTorch versions, but I’ve just merged the fix this week to make it work on PyTorch 1.10.

We are not ready for sub-byte e2e deployment. There some int4 CUDA kernels in topi, but they don’t work with Relay yet. For example, constant folding doesn’t work with sub-byte or bf16.

I know, I was in the same situation when I started investigating the quantization situation in late 2019. We should have a good support for prequantized models in tflite or pytorch by now, and this year we began to add ONNX prequantized support. Quantized ONNX models can be created via the tf2onnx tool or using the quantization tool in ONNXRuntime.

Quantization by TVM, however, is not in a good situation. First of all, we should acknowledge that quantization is a very hard problem on its own right. The one that’s already in TVM was implemented around 2018, the consensus seems to be that it doesn’t work beyond its demo (imagenet). As you found in the forum, earlier this year there was a proposal to introduce a new quantization system. But as far as I know, there has been no real development of that RFC. And often, quantization-aware training (QAT) is required for preserving sane accuracy for some int8 models (BERT, mobilenet, efficientnet type models etc) and most int4 models. Since we don’t support training even fp32 models yet, I don’t expect QAT by TVM would happen any time soon.

My recommendation for quantization flow is to quantize models using DL frameworks, and rely on prequantized model support in TVM (actually, this is the only reasonable way). Both TF and PT have reasonable support for quantization (both PTQ and QAT). Several companies are using this flow in production (ARM, edgecortix).

Outside of DL frameworks’ built-in quantization functionalities, there are also many third-party quantization tools that take TF/PT models, quantize it by PTQ or QAT, and output the quantized model in ONNX. I believe those tools are developed because framework built-in support for quantization may not be sufficient for many HW vendors. If you are serious about quantization, I encourage you to take a look at them. Since they output ONNX models, those tools are compatible with TVM. I’ve added some links below.

5 Likes

The TFlite frontend of TVM isn’t yet fully featured to cover all the quantized behaviour in TFlite. Additionally there are open tickets on adding more quantized operator support to the Tflite frontend . See this tracking ticket.

Ramana

1 Like