There are several key features I believe is in urgent need to properly enable those workloads:
- KVCache management, including appending to KVCache, switching in/out different KVCache (chat history), heterogenous offloading of KVCache in-between devices, etc
- Flexible integration of quantization algorithms, where each algorithm may have its own simulated data type (e.g. int4, NF4), where adding quantized operators in a normal way across TVM stack (needing to update TE/TOPI/TIR/Relay/AutoTVM) becomes immediately impossible if aim at scaling up.
- Dynamic shape representation, management, and code generation across the stack, particularly, tracking the input/output relationship of each dynamic shape subgraph, for example, in LLM, there are two independent dynamic shape vars
nandmwhich we should distinguish.