TVM is a really handy ML Compiler for model optimizations. However, when it comes to deployment, clone and build the whole TVM repository seems tedious and unnecessary for inference-only scenarios.
I am wondering what is the lightest practice of running the tvm_runtime in a python-based pipeline, how can we minimize the requirements and only build the must-needs for inference? Or even further, can we output a completely dependent-free runtime.so that doesn’t need any library to run with?
The other question is that, since TVM already has a very well developed codegen system, is it do-able to generate single operators (like cuda kernels) instead of a complete IR or runtime for the whole model? So that we can plug the auto-optimized operators into frameworks like pytorch and TensorRT.
Thanks!