Hello,
I have been trying to optimize my fully connected neural network using autoTVM. I would like to compile 4 different models, with 4 different input shapes (input : [bs, 1, 2] → output [bs, 1, 8], with bs : 60, 200, 3000 and 3536). I am running on Intel Xeon Platinum 8260L, which is an arch Cascade Lake. This is my running configuration :
target='llvm -mcpu=cascadelake'
number = 30
repeat = 10
min_repeat_ms = 4000
timeout = 20
trials = 2000
early_stopping = (trials//2)+1
opt=4
I use the XGBoost algorithm. AutoTVM tells me that I get at very most 400 GFLOPS, on a CPU that should be running at least 1000 GFLOPS. When I export my model with :
with autotvm.apply_history_best(tuning_option["tuning_records"]):
with tvm.transform.PassContext(opt_level=opt, config={}):
lib = relay.build(mod, target=target, params=params)
lib.export_library("compiledNN-bs" + str(bs) + ".so")
print(f"Exported library with batch size {bs}")
And then I include it in my C++ project, using the same approach as in the cpp_deploy tutorial, I get a really bad performance (about x100 slower than with tensorflow/cppflow inference, that allows dynamic shaping). I really feel like I am missing one big thing on TVM or AutoTVM, I thought at first it was the optimization_level that wasn’t taken into account, but it seems like it is something else.
I tried with ‘llvm -mcpu=skylake-avx512’ but I get the same kind of results. I also tried to set TVM_NUM_THREADS to 24 but didn’t get results. Also, I am running AutoTVM on the same CPU where I will be running the model, I am not using the RPC server (maybe I should ?).
I would like to know if anyone has any idea about where/what I should be looking for, I can provide more code or more detailed information if needed.
PS : Thank you for all the effort and the work, I think TVM is a great framework and has awesome purposes.