Surprisingly low performance on CPU for a DenseNN

Aympab · September 1, 2022, 1:07pm

Hello,

I have been trying to optimize my fully connected neural network using autoTVM. I would like to compile 4 different models, with 4 different input shapes (input : [bs, 1, 2] → output [bs, 1, 8], with bs : 60, 200, 3000 and 3536). I am running on Intel Xeon Platinum 8260L, which is an arch Cascade Lake. This is my running configuration :

target='llvm -mcpu=cascadelake'
number = 30
repeat = 10
min_repeat_ms = 4000
timeout = 20
trials = 2000
early_stopping = (trials//2)+1
opt=4

I use the XGBoost algorithm. AutoTVM tells me that I get at very most 400 GFLOPS, on a CPU that should be running at least 1000 GFLOPS. When I export my model with :

with autotvm.apply_history_best(tuning_option["tuning_records"]):
    with tvm.transform.PassContext(opt_level=opt, config={}):
        lib = relay.build(mod, target=target, params=params)
        lib.export_library("compiledNN-bs" + str(bs) + ".so")
        print(f"Exported library with batch size {bs}")

And then I include it in my C++ project, using the same approach as in the cpp_deploy tutorial, I get a really bad performance (about x100 slower than with tensorflow/cppflow inference, that allows dynamic shaping). I really feel like I am missing one big thing on TVM or AutoTVM, I thought at first it was the optimization_level that wasn’t taken into account, but it seems like it is something else.

I tried with ‘llvm -mcpu=skylake-avx512’ but I get the same kind of results. I also tried to set TVM_NUM_THREADS to 24 but didn’t get results. Also, I am running AutoTVM on the same CPU where I will be running the model, I am not using the RPC server (maybe I should ?).

I would like to know if anyone has any idea about where/what I should be looking for, I can provide more code or more detailed information if needed.

PS : Thank you for all the effort and the work, I think TVM is a great framework and has awesome purposes.

Aympab · September 7, 2022, 12:11pm

I changed the ‘target’ and added llvm’s options directly inside, so now I have :

target='llvm -mcpu=cascadelake -opt-level=3 -fast-math -num-cores=4 -fast-math-arcp -fast-math-contract -fast-math-nnan -fast-math-reassoc'

The performances are way more acceptable. I don’t understand why AutoTVM doesn’t try these configurations ? I am guessing the scheduler only tries optimizations like loop tiling, memory location, etc… Also, I don’t understand the difference between this opt-level and the opt_level from the PassContext in python. I someone has any explanation, it would be great !

sunggg · September 11, 2022, 5:57pm

Hi, @Aympab.

TVM translates internal IR to target code and use the existing codegen toolchain like LLVM. In this compilation flow, opt-level in PassContext is for TVM internal relay/tir passes while opt-level in target is for the codegen. Thus, opt-level in your target configuration reflects LLVM flag setting.

github.com

apache/tvm/blob/main/src/target/target_kind.cc#L276


    .add_attr_option<String>("mabi")
    .add_attr_option<Integer>("num-cores")
    // Fast math flags, see https://llvm.org/docs/LangRef.html#fast-math-flags
    .add_attr_option<Bool>("fast-math")  // implies all the below
    .add_attr_option<Bool>("fast-math-nnan")
    .add_attr_option<Bool>("fast-math-ninf")
    .add_attr_option<Bool>("fast-math-nsz")
    .add_attr_option<Bool>("fast-math-arcp")
    .add_attr_option<Bool>("fast-math-contract")
    .add_attr_option<Bool>("fast-math-reassoc")
    .add_attr_option<Integer>("opt-level")
    // LLVM command line flags, see below
    .add_attr_option<Array<String>>("cl-opt")
    .set_default_keys({"cpu"})
    // Force the external codegen kind attribute to be registered, even if no external
    // codegen targets are enabled by the TVM build.
    .set_attr<Bool>(tvm::attr::kIsExternalCodegen, Bool(false))
    .set_target_parser(tvm::target::parsers::cpu::ParseTarget);


// Note regarding the "cl-opt" attribute:
// Each string in the array has the format

github.com

apache/tvm/blob/main/src/target/llvm/codegen_llvm.cc#L337


FPassManager fpass(module_.get());
MPassManager mpass;
llvm::TargetMachine* tm = llvm_target_->GetOrCreateTargetMachine();
mpass.add(llvm::createTargetTransformInfoWrapperPass(tm->getTargetIRAnalysis()));
fpass.add(llvm::createTargetTransformInfoWrapperPass(tm->getTargetIRAnalysis()));


// place optimization pass
llvm::PassManagerBuilder builder;


// Use the same opt-level as specified in TargetMachine for running passes
llvm::CodeGenOpt::Level opt_level = llvm_target_->GetOptLevel();


switch (opt_level) {
  case llvm::CodeGenOpt::Level::None:
    builder.OptLevel = 0;
    break;
  case llvm::CodeGenOpt::Level::Less:
    builder.OptLevel = 1;
    break;


  case llvm::CodeGenOpt::Level::Default:

AutoTVM simply relies on target configuration provided by user assuming that the user would provide the reasonable setting. I believe this might be mainly due to the explosive search space. So, I wouldn’t be surprised to see some cases where we can extract more speedup by tweaking target configuration.

Aympab · September 12, 2022, 8:11am

Ok I understand ! Thanks a lot for these details they are really useful to me