How does TVM measure performance when tuning for a different target device?

In my understanding, this only generates the binary for that target, it has nothing to do with execution. If you try to run it locally, it will fail. That’s my experience with TIR modules. I think relax modules have another runtime called relax_vm, I’m not very sure what will happen in there.

As for tuning for heterogeneous targets, I’m not sure if there’s a better way to do that, but my method was to limit parallelism and intrinsics to simulate the target device as much as possible on the computer running TVM.