[DISCUSS] TVM Community Strategy for Foundational Models

There are several key features I believe is in urgent need to properly enable those workloads:

  • KVCache management, including appending to KVCache, switching in/out different KVCache (chat history), heterogenous offloading of KVCache in-between devices, etc
  • Flexible integration of quantization algorithms, where each algorithm may have its own simulated data type (e.g. int4, NF4), where adding quantized operators in a normal way across TVM stack (needing to update TE/TOPI/TIR/Relay/AutoTVM) becomes immediately impossible if aim at scaling up.
  • Dynamic shape representation, management, and code generation across the stack, particularly, tracking the input/output relationship of each dynamic shape subgraph, for example, in LLM, there are two independent dynamic shape vars n and m which we should distinguish.
6 Likes