Batched Inference for LLM

shyeonn · October 2, 2024, 5:37am

Hi, I’m trying to use TVM for the Inference LLM model.

I refer to the TVM docs about the Optimize Large Language Model and it works well when there is a single batch case. (Optimize Large Language Model — tvm 0.18.dev0 documentation)

But I can not find or reference how to do batched inference.

Is there any way to execute batched inference for LLM?

tqchen · October 2, 2024, 1:32pm

batched inference is more complicated. The PagedKVCache interface have support for that and we also need to be able to dispatch to key kernels like cutlass/cublas. You can checkout https://github.com/mlc-ai/mlc-llm for a complete LLMEngine based a the tvm flow

shyeonn · October 6, 2024, 10:24am

Does mlc-llm natively(without modifying the codes) support batched inference?

tqchen · October 7, 2024, 12:57pm

yes, mlc llm support continuous baching and other necessary feature for concurrent serving