I would like to deploy transformer models to ARM CPU with AutoTVM. As we all know, transformer models consist of dense and matmul layers. If the sequence length is small, then dense layers will cost much latency. But I found that the dense layers are not optimized for ARM CPU with AutoTVM. The performance will be hurt a lot if dense layers are not optimized. Is there a workaround (except using AutoScheduler)?
You may be confused about why I am not using AutoScheduler. That’s because I add some custom operators in transformer. And my custom operator is ONLY implemented with AutoTVM schedules.