Thank you for the detailed answer! I’ll try relax in the courses and expect it ready for production soon.
Still some questions, I want to benchmark tvm on some models like resnet and bert on T4 GPU.
I have tried graph executor and relay vm with resnet-18 and the performance is not good. I don’t know if I got the best performance of tvm.
- I didn’t try the auto-tuning and just set target as
cuda -libs=cudnn
. - I used fp32 and didn’t try fp16 (I don’t know how to do it).
What’s the right way to get the best performance on GPU, cudnn or auto-tuning?
Is there a benchmark (sheet or runnable demo) that I can compare with to ensure that I have got the best performance?
By the way, how is the support for fp16 and tensor core on tvm? Is there any demo or introduction?
Thank you!