I’m exploring quantization in TVM, and note that according to the quantization tutorial, TVM can generate power2 quantization (i.e. quantized weights are rounded to the nearest power of 2).
This can be performed with for example:
with relay.quantize.qconfig( calibrate_mode="global_scale", global_scale=8.0, weight_scale="power2" ): mod = relay.quantize.quantize(mod, params)
However, when compared to regular int8 quantization, I am getting the same inference time (v0.8.0).
Inspecting the generated code (see code below for how I extract the CUDA), I observe that multiplications are still used for my power2 quantized models rather than bitshifts.
Is there something I’m missing with regards to replacing my mults with bitshifts in this case? A missing pass perhaps? I’m not seeing any specific compute/schedule definitions for this case in TOPI. I also haven’t found anything searching the repo.
model = relay.create_executor("vm", mod, dev, target) model._make_executor() cuda = model.executable.lib.imported_modules print(cuda.get_source())
I also found that the power2 quantization pass does not actually work properly, the generated weights are not power2. Also, data-aware quantization is currently broken for a wide range of models I have tried.