In the recent TVM paper “Relax: Composable Abstractions for End-to-End Dynamic Machine Learning” (Section 5.1), the authors describe LLM support through the Relax IR. A notable claim states: “Importally, Relax compiles models only once for arbitrary batch sizes and sequence lengths.” This is further elaborated with: “More importantly, cross-level abstractions enable us to use compiler-optimized matrix-vector multiplication tensor programs at batch size 1, while being able to apply partial library lowering to leverage operator libraries for other batch sizes.”
Does this implementation imply that:
- During compilation, a single dynamic-shaped tensor program is lowered into multiple binary artifacts optimized for different shape parameters
- At runtime, the virtual machine employs a dispatch mechanism to dynamically select the appropriate precompiled binary based on concrete input shapes?