Hi,
I’m trying to use TVM’s stack to deploy INT8-quantized Transformer-based models.
I tried Relay + Ansor(AutoScheduler) for a Transformer (# layers = 1) and the results weren’t so neat.
Time (ms) | Original | Quantized |
---|---|---|
PyTorch | 20 | – |
TVM (Relay, optimized) | 130 | 120 |
TVM (Relay, optimized), Ansor (it=20k) | 17 | 44 |
- (# of runs) = 100
- the stdev was very small.
In your opinion, what’d be the best for the next steps? Could you recommend a good starting point or useful references for them?
Thanks,