Tvm Inference peformance almost 10x better than pytorch

I was bench-marking tvm and pytorch in terms of inference time and I see that tvm performance is almost 10x better than pytorch. I also compared predictions probability for each inference and it almost matches between tvm and pytorch. The inference time results looks very suspicious for me and I couldn’t figure out any issue in the my code. I thought it would be better to get tvm experts opinion on this.

Is it really possible to get this performance in tvm or something wrong in my code ?

Here is the inference time results in ms with different batch-size(bs)

     1bs      8bs      16bs     32bs     64bs

pytorch 5.038ms 0.7796ms 0.3248ms 0.1513ms 0.0786ms

TVM 0.1976ms 0.0204ms 0.0099ms 0.0051ms 0.0025ms

I have shared my code here https://github.com/manojec054/tvm-explore/blob/master/tvm_explore.py

Dataset : Animal classification with 3 classes

Image size : 224, 224

tvm version : 0.9.dev0

module.run in non-blocking. it need to put timer.end after module.get_output

1 Like

Thanks for the quick reply.

That’s a major mistake. But just curious why it is not mentioned(module.run being non bloking) in the API reference at here tvm.contrib.graph_executor — tvm 0.9.dev0 documentation

I have corrected my code to consider get_output as well

    start = time.time()
    module.run()
    tvm_output = module.get_output(0)
    end = time.time()

Here is my results tvm_results

I still see tvm performance 10x better than pytorch

I agree the doc can be improved, PR welcomed. Can you try running module.benchmark?

I am also running into problems with benchmarking. I am trying to measure inference on C++ and the problem is that there are no equivalent to module.benchmark. Is there a correct way to benchmark on C++ somewhere that we can replicate?

With module.benchmark I see totally a different results

    data_tvm = tvm.nd.array(data.cpu().numpy().astype('float32'))
    module.set_input("data", data)
    print(module.benchmark(device=tvm.cuda(0), func_name="run", repeat=3, min_repeat_ms=500, number=10))

tvm_results2

I just saw that you run on GPU. Shouldn’t you perform a GPU synchronization to measure time correctly? Like this example done with Torch? You could try for TVM inference:

start = time.time()
module.run()
tvm.cuda().sync() #or something similar
end = time.time()

And for PyTorch inference:

 start = time.time()
 _ = model(data)
 torch.cuda.synchronize()
 start = time.time
1 Like

Took some time to evaluate different API’s available to calculate inference time in pytorch. It turns out that time difference varies a lot based on what API used in the calculation. Added https://github.com/manojec054/tvm-explore/blob/master/pytorch_benchmark_explore.py#L13 program to calculate matrix multiplication operation time and here is the result

Using TORCH Profile = 0.0522901119158268ms

Using CUDA Events = 0.03761523292819038ms

Using TIME API = 0.018018245697021484ms

Using TIME & SYNC API = 0.022334575653076172ms

Any suggestion which one is the most accurate ?

Seems all the these results are not accurate. If you run nsys nvprof python3 pytorch_benchmark_explore.py you will the stats for the specific gemm kernel.

Using TIME API without SYNC is not correct. Using TORCH Profile, I’m not sure whether we should sum all the events. Also usually we need to have a large number of repeats inside the timing region.

1 Like