Task scheduler total latency differs significantly from benchmark latency

Hi, everyone. I have been trying to do some benchmarking with MetaSchedule but am having trouble explaining the results I get.

The latency that task_scheduler tells me during tuning varies significantly from the latency I get when benchmarking the graph module. For example, task_scheduler gives a total latency of 9,835 us for MobileNet after 2000 trials, while profiling with GraphModuleDebug gives a latency of 16,134 us, and benchmarking gives 27,350 us.

I am using an Intel Xeon Platinum 8368 CPU with TVM Unity, Relay, LLVM 14, and the following target tag: llvm -num-cores 38 -mcpu=icelake-server -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512f

The graph module debugger tells me that only a single thread is used, while I specify 38 cores in the target string. I should probably note that I am running this on my University’s HPC system, but I schedule the job to one node with an entire socket (38 cores). Am I specifying the target string incorrectly?

I have also run some tests on an M3 Max Mac and observed similar behavior, with the task_scheduler latency being below the benchmark latency or a drop in the task_scheduler latency not leading to a decrease in the benchmark latency. The overall results seem a lot closer on this system, and it uses all the available cores. However, flushing the cache of the local runner does not seem to work on Mac.

Overall, I would be very grateful for any insights into the cause of this. I can make my code and the logs available if desired.