Hello ! I have spent some time analyzing the inner workings of TVM and VTA. I would like to know if my following assumptions are correct. In the VTA paper, it states that
The runtime performs JIT compilation of the accelerator binaries and manages heterogeneous execution between the CPU and VTA
Initially I understood that the JIT compiler, while executing code might decide if some call (a matrix multiplication for example) should be executed either on CPU, on GPU or on any other hardware in an heterogeneous system and then it will be compiled on the fly for that system.
However after checking the code and rereading the papers it seems that if we compile for the VTA accelerator, when we lower the code we will transform some statements into VTA calls (example). When the CPU that’s executing the code reaches that point it will generate the VTA instruction on the fly using the runtime . Is this correct ? Why isn’t all the code generated in a single compilation step and the VTA instructions stored on memory to be sent to the accelerator when the call happens?
What would happen if we have an heterogeneous system with a CPU, VTA and a GPU. Could we have a JIT runtime that decides whether an operation should be executed on VTA or on the GPU based on some heuristic ?
I have sometimes read that certain layers in Neural Networks run faster on CPUs because they have a huge number of memory accesses. If this where to be true could TVM run some layers on CPU and some on GPU ?
Thank you for your time