TVM RPC for concurrent inference? (e.g., multi-request server)

zhuohoudeputao · February 2, 2026, 2:35pm

Hey folks,

I’ve been using TVM RPC to test models on remote devices (like a Raspberry Pi), and it’s great for single-stream debugging and benchmarking.

However, I hit a wall when trying to simulate a realistic server scenario. Imagine an inference server that needs to handle multiple concurrent requests (e.g., for an LLM API). The current RPC server seems to process requests sequentially in a single thread/queue. This makes it impossible to saturate the device’s compute (CPU/GPU cores) or measure true throughput under load.

My question is: Has anyone else run into this? Is there a known pattern or workaround to use TVM RPC for concurrent, multi-client inference on a single remote device?

Some thoughts:

Would launching multiple RPC server processes on different ports and connecting to them via a client-side pool be the right approach? Feels a bit hacky.
Or is the intended production path to skip RPC entirely and embed the TVM runtime directly into a concurrent server (FastAPI/gRPC) on the remote device itself?

I’m curious about the community’s experience and any plans to make the RPC layer more “server-friendly” in the future. Thanks!

tqchen · February 2, 2026, 2:57pm

this is mainly meant for benchmark and not for concurrent serving.

Your noted path.2 is the better choice