Surprisingly low performance on CPU for a DenseNN

Hello,

I have been trying to optimize my fully connected neural network using autoTVM. I would like to compile 4 different models, with 4 different input shapes (input : [bs, 1, 2] → output [bs, 1, 8], with bs : 60, 200, 3000 and 3536). I am running on Intel Xeon Platinum 8260L, which is an arch Cascade Lake. This is my running configuration :

target='llvm -mcpu=cascadelake'
number = 30
repeat = 10
min_repeat_ms = 4000
timeout = 20
trials = 2000
early_stopping = (trials//2)+1
opt=4

I use the XGBoost algorithm. AutoTVM tells me that I get at very most 400 GFLOPS, on a CPU that should be running at least 1000 GFLOPS. When I export my model with :

with autotvm.apply_history_best(tuning_option["tuning_records"]):
    with tvm.transform.PassContext(opt_level=opt, config={}):
        lib = relay.build(mod, target=target, params=params)
        lib.export_library("compiledNN-bs" + str(bs) + ".so")
        print(f"Exported library with batch size {bs}")

And then I include it in my C++ project, using the same approach as in the cpp_deploy tutorial, I get a really bad performance (about x100 slower than with tensorflow/cppflow inference, that allows dynamic shaping). I really feel like I am missing one big thing on TVM or AutoTVM, I thought at first it was the optimization_level that wasn’t taken into account, but it seems like it is something else.

I tried with ‘llvm -mcpu=skylake-avx512’ but I get the same kind of results. I also tried to set TVM_NUM_THREADS to 24 but didn’t get results. Also, I am running AutoTVM on the same CPU where I will be running the model, I am not using the RPC server (maybe I should ?).

I would like to know if anyone has any idea about where/what I should be looking for, I can provide more code or more detailed information if needed.

PS : Thank you for all the effort and the work, I think TVM is a great framework and has awesome purposes.

I changed the ‘target’ and added llvm’s options directly inside, so now I have :

target='llvm -mcpu=cascadelake -opt-level=3 -fast-math -num-cores=4 -fast-math-arcp -fast-math-contract -fast-math-nnan -fast-math-reassoc' 

The performances are way more acceptable. I don’t understand why AutoTVM doesn’t try these configurations ? I am guessing the scheduler only tries optimizations like loop tiling, memory location, etc… Also, I don’t understand the difference between this opt-level and the opt_level from the PassContext in python. I someone has any explanation, it would be great !

Hi, @Aympab.

TVM translates internal IR to target code and use the existing codegen toolchain like LLVM. In this compilation flow, opt-level in PassContext is for TVM internal relay/tir passes while opt-level in target is for the codegen. Thus, opt-level in your target configuration reflects LLVM flag setting.

AutoTVM simply relies on target configuration provided by user assuming that the user would provide the reasonable setting. I believe this might be mainly due to the explosive search space. So, I wouldn’t be surprised to see some cases where we can extract more speedup by tweaking target configuration.

1 Like

Ok I understand ! Thanks a lot for these details they are really useful to me