Keeping TVM model resident on GPU

Hi,

In PyTorch, if I wanted to run a module multiple times, I could keep it resident on gpu using module.cuda(). This saves me the latency of transferring the model from cpu to gpu for every inference call.

Is there an equivalent behavior in TVM? (i.e. can I keep an evaluator pinned to a gpu so that multiple calls won’t suffer the transfer latency?)

I think this is already what TVM does. On first mod.run(), all parameters are copied to GPU. Subsequent runs doesn’t involve memcpy of weights.

1 Like

I see - perhaps I am using the wrong API here then?

I am using the TVM VM to run a model that is imported from ONNX - essentially my code looks like this:

Code

mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
executor = relay.create_executor(“vm”, mod=mod, target=target)
evaluator = executor.evaluate()
# run evaluator with some params
results = evaluator(**args)
results = evaluator(**args)
results = evaluator(**args)

However when I profile the above code, TVM seems to spend a large amount of time doing HtoD transfers - even if I do warmup runs of the evaluator. Is there some alternate explanation for this?

I’m not exactly sure what evaluator(**args) does under the hood. Below is how I measure performance on VM. Can you try this API?