Good use of the C++ TVM Runtime

Hello,

I am using TVM to integrate a machine learning model in my code. I already tuned my model on my CPU using AutoTVM, which I integrated in my C++ code following the cpp deploy tutorial.

The problem is that I am not having good performances. I linked my model as a dynamic library using the tvm::runtime::LoadFromFile function. My perfs are about x2.5 longer than using cppflow, the TensorFlow C API to run models. In cpp flow, the batch size (or input size) is dynamically set, whereas I compiled my models with static batch sizes, and I still get worse performance than with cppflow.

I have a few guesses :

  1. I am running on a Intel Xeon Platnium 8620L server CPU that has a lot of RAM (3Tb), I didn’t find any Intel CPU benchmarks. But I did for AMD, ARM, and NVIDIA GPUs. Maybe I should be trying on AMD CPUs ?
  2. My input sizes are 60, 200, 3000 and 3536, which is too small to fully use my hardware’s capabilities. But my guess would be that the smaller the faster, if it can run a batch size of 2^20, it should be able to run fast a batch size of 3000 ??
  3. I am not sure what is the difference between the LoadFromFile of a dynamic lib and the (*tvm::runtime::Registry::Get(“runtime.SystemLib”))(); method with a system library. But anyways, I am not timing this but only the “run” function tvm::runtime::PackedFunc run = gmod.GetFunction(“run”);, so it shouldn’t be a problem, but maybe TVM internally does some lazy programming and it loads in memory the model when I call the “run” function ?
  4. I tried to rebuild the TVM runtime lib without logging library, think that maybe this was the problem, but nothing changed

These guesses are only intuition, to give you an idea of where I am stuck. If anyone has any idea on where I should be looking, I would be great help ! I really feel like I just forgot to do a simple thing to get TVM fully working (maybe a compilation flag?, a fucntion somewhere ?), because everything else works fine (AutoTVM, the integration, the NDArrays structures, etc.).

Please do send me a message if you want to discuss in private about this.

EDIT : I am going to recompile TVM using the MKL library, I’ll let you know if it worked

Hi, @Aympab

I’m not familiar with C++ integration flow, but I think some psuedo code, target, and tuning configs would be helpful for the better discussion.

One question - have you compared the model performance before and after the c++ integration? If so, do you see any difference? I’m wondering if AutoTVM is just bad at your model & target or there is more fundamental issue in the integration flow.

Hi @sunggg, thank you for your message.

This is my AutoTVM configuration :

  • target='llvm -mcpu=cascadelake -libs=cblas,mkl -opt-level=3 -fast-math -fast-math-arcp -fast-math-contract -fast-math-nnan -fast-math-reassoc'
  • number = 35
  • repeat = 10
  • min_repeat_ms = 20
  • timeout = 20
  • trials = 1500
  • early_stopping = (trials//2)+1
  • opt=4 #TVM opt level

To tune my model, I followed the “Compiling and Optimizing a Model with the Python Interface (AutoTVM)” tutorial and adapted the code to my CPU.

I just tried rebuilding TVM with MKL and I added the -libs=cblas,mkl flags to the target. The maximum throughput displayed by AutoTVM has increased, but in my C++ integration, I still have the same results.

This would be a pseudo code of how I integrate in C++ :

#################### INIT PHASE (done only once) #####################
      #Tutorial stuff
      mod_factory = tvm::runtime::Module::LoadFromFile(lib_path);
      gmod = runner.mod_factory.GetFunction("default")(dev);
      set_input = runner.gmod.GetFunction("set_input");
      get_output = runner.gmod.GetFunction("get_output");
      run = runner.gmod.GetFunction("run");

      #create tensor in memory that will hold input and output values (NDArrays & DLManagedTensors)
      in_tensor = init()
      out_tensor = init()

#################### ITERATION PHASE (called multiple times) #####################
      in_tensor = fill_in_tensor()

      tvm::runtime::NDArray x = NDArray::FromExternalDLTensor(in_tensor)

      set_input("input_1:0", x); #' 'input_1:0' is the name of the model's input
      run();

      out_tensor = get_output(0);
      
      # do stuff with out_tensor ...

## Do this for every batch size
## (I have 4 precompiled models with 4 different input sizes, the 'lib_path' value changes)

I am only timing the run(); function, in the “Iteration phase”. This is what takes very long, and I’m sure I’m just doing something wrong, maybe with a bad linking of MKL or BLAS, I am not sure what is the most efficient low level linear algebra framework on Intel CPUs. The code is working fine, I get the right results, no compilation problem, no warning.

I did not measure the throughput of cppflow’s model without integrating in C++, so I cannot say how it should theorically perform, but clearly I don’t think my precompiled model with TVM should be orders of magnitude longer. I did not try on AMD CPU either because the code I am running has to run on Intel’s CPU.

Maybe some compilation flags in TVM’s CMake config file ? Although I just recompiled it with MKL on (and LLVM ofc).

Let me know if you need more infos.

Thank you for the detailed info. This really helps to understand your situation better. Have you tried to profile the execution? If we can figure out which function takes much time, it might be easier to figure out the root cause.

I haven’t profile the execution using TVM’s profiler, only with my PAPI Timers in the code. I will look this up now and see if I can get some more detailed perfs

Meanwhile I’m trying to rebuild TVM with other compil flags.

I am having trouble to profile my model following the PAPI Tutorial. As soon as I get it working, I’ll post the results here.

Any help is welcome

EDIT : Opening a new thread here for this profiling-specific problem

EDIT 2 :

I managed to get the profiler working : This is the first output for my NN

Name                                                                                     Duration (us)  Percent  Device  Count                                                              Argument Shapes              Hash  VM::Argument Shapes  perf::CACHE-MISSES  perf::CYCLES  perf::INSTRUCTIONS  perf::STALLED-CYCLES-BACKEND  perf::STALLED-CYCLES-FRONTEND  weight_layout  
vm_mod_fused_reshape_add_exp_subtract_nn_relu_multiply_nn_relu_add_multiply_add_reshape       1,579.71    50.61    cpu0      5  float32[3000, 20], float32[20], float32[20], float32[20], float32[3000, 20]  bb23e7419c2e3451                                    3,330     5,278,160          18,092,178                     2,486,365                         11,025                 
vm_mod_fused_nn_contrib_dense_pack_1                                                            408.29    13.08    cpu0      4                     float32[3000, 20], float32[1, 20, 20], float32[3000, 20]  c6382300161e7b61                                    4,120     1,379,177           5,025,882                       866,665                          5,082          NC20n  
vm_mod_fused_nn_contrib_dense_pack_2                                                             42.32     1.36    cpu0      1                       float32[3000, 20], float32[1, 20, 8], float32[3000, 8]  939e1bd5242e3085                                      764       145,090             501,736                        92,334                          2,607           NC8n  
VM::AllocStorage                                                                                 13.74     0.44    cpu0     12                                                                                                                                     841        85,172              69,685                         8,946                         26,549                 
vm_mod_fused_nn_contrib_dense_pack                                                               12.02     0.39    cpu0      1                       float32[3000, 2], float32[1, 2, 20], float32[3000, 20]  9dc8805547517187                                      598        43,390             133,645                        23,230                            561          NC20n  
VM::AllocTensor                                                                                   9.14     0.29    cpu0     10                                                            float32[3000, 20]                                                        988        57,520              63,576                         4,745                          9,389                 
VM::UnknownOp                                                                                     8.71     0.28    cpu0     25                                                                                                                                   2,988        89,378              98,315                         5,586                          9,086                 
vm_mod_fused_reshape_add                                                                          4.47     0.14    cpu0      1                            float32[3000, 8], float32[8], float32[3000, 1, 8]  69c860389db3a274                                      512        17,349              35,091                         8,236                            329                 
VM::ReshapeTensor                                                                                 4.11     0.13    cpu0      1                                                                                                                                     329        27,531               5,332                        10,114                          8,889                 
VM::AllocTensor                                                                                   1.51     0.05    cpu0      1                                                             float32[3000, 8]                                                         84         9,099               7,139                         1,798                          2,546                 
VM::AllocTensor                                                                                   0.81     0.03    cpu0      1                                                          float32[3000, 1, 8]                                                         77         5,173               7,516                           192                            200                 
----------                                                                                                                                                                                                                                                                                                                                                                            
Sum                                                                                           2,084.83    66.79             62                                                                                                                                  14,631     7,137,039          24,040,095                     3,508,211                         76,263                 
Total                                                                                         3,121.32             cpu0      1                                                                                                                                  40,550     9,801,638          25,860,685                     4,808,036                        439,378                 

Configuration
-------------
Number of threads: 1
Executor: VM

A few important things :

  • I could not profile on the same CPU I should be running (the Intel Xeon Planitum) because of a PAPI error : PAPIError: -7 Event does not exist: perf::STALLED-CYCLES-FRONTEND. So I ran it on another CPU to get an intuition of performance

  • I am not sure to fully understand though, when building with MKL, AutoTVM detects 12 Tasks to optimize. Without MKL it finds 6 tasks. In the above profiler’s output, I count 11 tasks. Maybe I am understanding something wrong.

  • In order to profile only my optimized model, I used with autotvm.apply_history_best, then I build my lib with lib = relay.build(mod, target=target, params=params) so I have the optimized one. Then I update my params with params = lib.get_params() using the optimized lib. Does this flow seem correct ?

EDIT 3 : I set TVM_NUM_THREADS=8 to see if there was any difference. I get x4 on the bad perfs I had before. The performance is way better, but still very long compared to cppflow. I am going to continue to explore this lead

That might happen because you offload operators to the external libraries which could be fused otherwise.

For AutoTVM, you can do something like this (tutorial):

with autotvm.apply_history_best(your_tuning_log):
    with tvm.transform.PassContext(opt_level=3, config={}):
        lib = relay.build(mod, target=target, params=params)