Bad performance of Dense layers on Xeon E5-2680v3

Hello,

I tried running TVM on a two socket system, using Xeon E5-2680v3 CPUs (12 cores each, hyperthreading disabled) and the performance seems to be not enough (especially compared to GPUs). I compared the performance to the same matrix multiplication of numpy, using the same dimensions and datatypes, and it usually runs 10x faster there.

I tried using llvm, llvm -mcpu=core-avx2 and llvm -mcpu=haswell as targets and even employed Auto-Scheduling and Auto-Tuning - which often just fail with different error messages - , but the performance does not change.

Example: Dense Layer, bound to single CPU (12 cores of 1 NUMA node)

  • Batch Size: 245
  • Input Features: 12027
  • Units: 27517
  • Input Shape: (245, 12027)
  • Weight Shape: (27517, 12027)
  • DType: float32

Execution Time for this layer: 7 seconds

While htop shows 100% utilization on the allocated cores, the power consumption of the CPU is much lower than on a typical workload (80 instead of 120 W).

I tried using random dense layer configurations, and evaluated the resulting schedules with the PAPI profiler, measuring the power consumption and the number of AVX instructions. It looks like TVM does not always generate AVX instructions.

Addition:

The AutoTVM error is: Current/Best: 0.00/ 0.00 GFLOPS | Progress: (10/10) | 496.63 sWARNING:root:Could not find any valid schedule for task Task(func_name=dense_pack.x86, args=(('TENSOR', (245, 12027), 'float32'), ('TENSOR', (27517, 12027), 'float32'), None, 'float32'), kwargs={}, workload=('dense_pack.x86', ('TENSOR', (245, 12027), 'float32'), ('TENSOR', (27517, 12027), 'float32'), None, 'float32')). A file containing the errors has been written to /tmp/tvm_tuning_errors_1_5yy4hr.log.