Performance Drop After Compiling Model with TVM Compared to ONNX Runtime

Title: Performance Drop After Compiling Model with TVM Compared to ONNX Runtime

Hi everyone,

I’m working on compiling an object detection model using TVM with auto-schedule optimization. However, I’m observing slower inference performance compared to running the original ONNX model with onnxruntime-gpu. Below are the details:


1. Compilation and Auto-Tuning


2. Performance Measurements

  1. ONNX Runtime (GPU)

  2. TVM Model (Compiled)

    • Script: performance.py

    • Mean inference time (before tuning): 8.83 ms

    • Mean inference time (after tuning): 7.16 ms

Even with TVM’s auto-schedule optimization, the inference times (7.16 ms) are still slower than the ONNX Runtime result of 6.10 ms.


3. Environment

  • Docker Base Image: nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

  • CPU:

    • Intel 12th Gen Core i9-12900K (24 CPU cores)
    • lscpu excerpt:
      • Architecture: x86_64
      • CPU op-mode(s): 32-bit, 64-bit
      • Address sizes: 46 bits physical, 48 bits virtual
  • GPU:

    • NVIDIA GeForce RTX 3090
    • Driver Version: 535.183.01
    • CUDA Version: 12.2
    • GPU Memory: 24 GB

Question

Is this performance difference (6.10 ms with ONNX Runtime vs. 7.16 ms with TVM after auto-tuning) expected, or might there be something wrong in my workflow? Any suggestions on how to further analyze and improve the performance of the compiled TVM model would be greatly appreciated.

If there’s any additional information or logs you’d like me to provide, feel free to let me know. Thank you in advance for your help!