Hello,
I tried running TVM on a two socket system, using Xeon E5-2680v3 CPUs (12 cores each, hyperthreading disabled) and the performance seems to be not enough (especially compared to GPUs). I compared the performance to the same matrix multiplication of numpy, using the same dimensions and datatypes, and it usually runs 10x faster there.
I tried using llvm
, llvm -mcpu=core-avx2
and llvm -mcpu=haswell
as targets and even employed Auto-Scheduling and Auto-Tuning - which often just fail with different error messages - , but the performance does not change.
Example: Dense Layer, bound to single CPU (12 cores of 1 NUMA node)
- Batch Size: 245
- Input Features: 12027
- Units: 27517
- Input Shape: (245, 12027)
- Weight Shape: (27517, 12027)
- DType: float32
Execution Time for this layer: 7 seconds
While htop shows 100% utilization on the allocated cores, the power consumption of the CPU is much lower than on a typical workload (80 instead of 120 W).
I tried using random dense layer configurations, and evaluated the resulting schedules with the PAPI profiler, measuring the power consumption and the number of AVX instructions. It looks like TVM does not always generate AVX instructions.
Addition:
The AutoTVM error is:
Current/Best: 0.00/ 0.00 GFLOPS | Progress: (10/10) | 496.63 sWARNING:root:Could not find any valid schedule for task Task(func_name=dense_pack.x86, args=(('TENSOR', (245, 12027), 'float32'), ('TENSOR', (27517, 12027), 'float32'), None, 'float32'), kwargs={}, workload=('dense_pack.x86', ('TENSOR', (245, 12027), 'float32'), ('TENSOR', (27517, 12027), 'float32'), None, 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_1_5yy4hr.log.