Quantized Transformer

jshuh · January 6, 2022, 1:19am

Hi,

I’m trying to use TVM’s stack to deploy INT8-quantized Transformer-based models.

I tried Relay + Ansor(AutoScheduler) for a Transformer (# layers = 1) and the results weren’t so neat.

Time (ms)	Original	Quantized
PyTorch	20	–
TVM (Relay, optimized)	130	120
TVM (Relay, optimized), Ansor (it=20k)	17	44

(# of runs) = 100
the stdev was very small.

In your opinion, what’d be the best for the next steps? Could you recommend a good starting point or useful references for them?

Thanks,

masahi · January 6, 2022, 2:42am

First of all, Ansor is no good for int8, since it cannot use fast int8 hardware (VNNI, tensorcore) at all.

How are you quantizing the model?
What backends are you interested in? CPU or GPU?

jshuh · January 7, 2022, 9:27am

Thanks for the reply.

PyTorch → Relay → Ansor → TVM’s low-level code → LLVM/NVCC (LLVM was used above)
Both CPU and GPU (in particular, NVIDIA T4)

youxiudeshouyeren · December 15, 2022, 12:17pm

Hello! What are the layers of your quantitative transformer model? Does it include the softmax layer and the layernorm layer? I am also trying to quantize transformer, but when I export to TVM, it says that quantized::softmax and quantized::layernorm are not implemented.