Batched Inference for LLM

Hi, I’m trying to use TVM for the Inference LLM model.

I refer to the TVM docs about the Optimize Large Language Model and it works well when there is a single batch case. (Optimize Large Language Model — tvm 0.18.dev0 documentation)

But I can not find or reference how to do batched inference.

Is there any way to execute batched inference for LLM?

batched inference is more complicated. The PagedKVCache interface have support for that and we also need to be able to dispatch to key kernels like cutlass/cublas. You can checkout https://github.com/mlc-ai/mlc-llm for a complete LLMEngine based a the tvm flow

1 Like

Does mlc-llm natively(without modifying the codes) support batched inference?

yes, mlc llm support continuous baching and other necessary feature for concurrent serving

1 Like