Quantized Transformer


I’m trying to use TVM’s stack to deploy INT8-quantized Transformer-based models.

I tried Relay + Ansor(AutoScheduler) for a Transformer (# layers = 1) and the results weren’t so neat.

Time (ms) Original Quantized
PyTorch 20
TVM (Relay, optimized) 130 120
TVM (Relay, optimized), Ansor (it=20k) 17 44
  • (# of runs) = 100
  • the stdev was very small.

In your opinion, what’d be the best for the next steps? Could you recommend a good starting point or useful references for them?


First of all, Ansor is no good for int8, since it cannot use fast int8 hardware (VNNI, tensorcore) at all.

  • How are you quantizing the model?
  • What backends are you interested in? CPU or GPU?
1 Like

Thanks for the reply.

  • PyTorch → Relay → Ansor → TVM’s low-level code → LLVM/NVCC (LLVM was used above)
  • Both CPU and GPU (in particular, NVIDIA T4)