In PyTorch, if I wanted to run a module multiple times, I could keep it resident on gpu using module.cuda(). This saves me the latency of transferring the model from cpu to gpu for every inference call.
Is there an equivalent behavior in TVM? (i.e. can I keep an evaluator pinned to a gpu so that multiple calls won’t suffer the transfer latency?)
I see - perhaps I am using the wrong API here then?
I am using the TVM VM to run a model that is imported from ONNX - essentially my code looks like this:
Code
mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)
executor = relay.create_executor(“vm”, mod=mod, target=target)
evaluator = executor.evaluate()
# run evaluator with some params
results = evaluator(**args)
results = evaluator(**args)
results = evaluator(**args)
However when I profile the above code, TVM seems to spend a large amount of time doing HtoD transfers - even if I do warmup runs of the evaluator. Is there some alternate explanation for this?