Issue tuning uint8 InceptionV4 TFLite model for ARM CPU

aymanchaudhry · August 14, 2024, 4:45pm

Hi,

I am currently trying to tune a quantized InceptionV4 TFLite model with TVM for ARM CPU using TVM v0.17, but after compiling the model with the tuning logs the performance of the model does not improve.

Tuning methods attempted: Method 1: Using TVMC via command-line.

Set up RPC server between board and host machine.
Run on host machine: python3 -m tvm.driver.tvmc tune --target "llvm -device=cpu_arm -mtriple=aarch64-linux-gnu" --output tuning-record.json --rpc-tracker "0.0.0.0:9090" --rpc-key "board" inception_v4_299_quant.tflite
I also tested setting parameter values for early-stopping, repeat and also changing the tuner to random.

Method 2: Using Auto Scheduler

Modify the python script from Auto-scheduling a Neural Network for ARM CPU — tvm 0.18.dev0 documentation for the InceptionV4 model.
Run script.

Both of these approaches had no impact on performance on my platform. Below is a breakdown of the measurements, I ran the models for 30 iterations*:

Tuning Method	TVM Execution Time (ms)
No Tuning	1018.7021
Method 1	1029.2489
Method 1 + early stopping (800)	1023.4508
Method 1 + early stopping (800) + trials (1500)	1019.2631ms
Method 2	1020.2225ms

These values do fluctuate +/- a few ms.

During Method 2, I did notice the following error being printed for some of the runs:

GA Iter: 0      Max score: 0.9988       Min score: 0.9982       #Pop: 2 #M+: 0  #M-: 0                                                                                                                                                        
[13:01:17] /tvm/src/auto_scheduler/compute_dag.cc:1377: Warning: InferBound fails on the state:                                                                                                                                               
Placeholder: p0, p1, p2, p3                                                                                                                                                                                                                   
parallel b.0@x.0@y.0@ (0,584)                                                                                                                                                                                                                 
  for b (None)                                                                                                                                                                                                                                
    for x (None)                                                                                                                                                                                                                              
      for y (None)                                                                                                                                                                                                                            
        for z (None)                                                                                                                                                                                                                          
          for i0 (None)                                                                                                                                                                                                                       
            for i1 (None)                                                                                                                                                                                                                     
              for i2 (None)                                                                                                                                                                                                                   
                A_padded_M = ...                                                                                                                                                                                                              
          vectorize w (None)                                                                                                                                                                                                                  
            A_interleaved = ...                                                                                                                                                                                                               
  for w.1 (0,2)                                                                                                                                                                                                                               
    for z.1 (0,2)                                                                                                                                                                                                                             
      for b_c.0 (None)                                                                                                                                                                                                                        
        for x_c.0 (None)                                                                                                                                                                                                                      
          for y_c.0 (None)                                                                                                                                                                                                                    
            for ax0 (None)                                                                                                                                                                                                                    
              for ax1 (None)                                                                                                                                                                                                                  
                for ax2 (None)                                                                                                                                                                                                                
                  T_reshape = ...                                                                                                                                                                                                             
            for w_c.0 (None)                                                                                                                                                                                                                  
              for z_c.0 (None)                                                                                                                                                                                                                
                for b_c.1 (None)                                                                                                                                                                                                              
                  for x_c.1 (None)
                   for y_c.1 (None)                                                                                                                                                                                                [331/1820]
                      for w_c.1 (None)
                        for z_c.1 (None)
                          for k.0 (None)
                            for b_c.2 (None)
                              for x_c.2 (None)
                                for y_c.2 (None)
                                  for w_c.2 (None)
                                    for z_c.2 (None)
                                      for k.1 (None)
                                        for b_c.3 (None)
                                          for x_c.3 (None)
                                            for y_c.3 (None)                                                           
                                              for w_c.3 (None)                                                         
                                                vectorize z_c.3 (None)                                                                                                                                                                        
                                                  C_interleaved.local = ...                                                                                                                                                                   
      for y.2 (0,4)
        for w.2 (0,2)
          vectorize z.2 (0,2)
            C_interleaved = ...
parallel b@x@ (0,289)
  for y (0,128)
    C = ...
parallel b@x@y@ (None)
  for z (None)
    conv2d_gemm_output = ...
parallel ax0@ax1@ax2@ (None)
  for ax3 (None)
    T_cast = ...

with: [13:01:17] /tvm/src/te/schedule/bound.cc:175: InternalError: Check failed: (found_attach || stage_attach.size() == 0) is false: Invalid Schedule, cannot find the producer compute(T_reshape, body=[p0[0, ((ax0 * 289 + ax1) * 1024 + ax
2) // 1024 // 17 % 17, ((ax0 * 289 + ax1) * 1024 + ax2) // 1024 % 17, ((ax0 * 289 + ax1) * 1024 + ax2) % 1024]], axis=[T.iter_var(ax0, T.Range(0, 1), "DataPar", ""), T.iter_var(ax1, T.Range(0, 289), "DataPar", ""), T.iter_var(ax2, T.Range
(0, 1024), "DataPar", "")], reduce_axis=[], tag=injective, attrs={}) along the loop nest specified by compute_at of consumer compute(A_padded_M, body=[T.if_then_else(i1 >= 0 and i1 < 289, T_reshape[i0, i1, i2], T.uint8(0))], axis=[T.iter_
var(i0, T.Range(0, 1), "DataPar", ""), T.iter_var(i1, T.Range(0, 292), "DataPar", ""), T.iter_var(i2, T.Range(0, 1024), "DataPar", "")], reduce_axis=[], tag=injective,pad, attrs={})
Stack trace:
  0: tvm::te::InferRootBound(tvm::te::Stage const&, tvm::te::GraphContext const&, std::unordered_map<tvm::tir::IterVar, tvm::Range, std::hash<tvm::tir::IterVar>, std::equal_to<tvm::tir::IterVar>, std::allocator<std::pair<tvm::tir::IterVar
 const, tvm::Range> > >*)                                  
  1: tvm::te::InferBound(tvm::te::Schedule const&)
  2: tvm::auto_scheduler::ComputeDAG::InferBound(tvm::auto_scheduler::State const&) const                                                                                                                                                     
  3: tvm::auto_scheduler::ComputeDAG::InferBound(tvm::runtime::Array<tvm::auto_scheduler::State, void> const&) const::{lambda(int)#1}::operator()(int) const
  4: _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIvEES3_EZNS1_11_Task_stateIZN3tvm7support12parallel_
  5: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)
  6: 0x000071ab617cbee7
  7: std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::packaged_task<void (std::vector<int, std::allocator<int> > const&, std::function<void (int)> const&)>, std::vector<int, std::allocator<int> >, std::function<void (int)> >
 > >::_M_run()                                             
  8: 0x000071ab5b4b0252
  9: 0x000071ab617c6ac2
  10: __clone
  11: 0xffffffffffffffff

Is this a configuration issue or is it a TVM limitation?

Link to model: https://storage.googleapis.com/download.tensorflow.org/models/inception_v4_299_quant_20181026.tgz

Script changes:

60a61,62
> import tflite.Model
> 
122a125,132
>     elif name == "inception_v4_tflite":
>         input_shape = (batch_size, 299, 299, 3)
>         input_tensor = "input"
> 
>         tflite_model_buf = open("/mnt/inception_v4_299_quant.tflite", "rb").read()
>         tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)
>         mod, params = relay.frontend.from_tflite(tflite_model, shape_dict={input_tensor: input_shape}, dtype_dict={input_tensor: dtype})
>         output_shape = (batch_size, 1001)
225c235
< device_key = "rasp4b-64"
---
> device_key = "board"
235c245
< network = "mobilenet"
---
> network = "inception_v4_tflite"
239c249
< dtype = "float32"
---
> dtype = "uint8"
292c302
<         num_measure_trials=200,  # change this to 20000 to achieve the best performance
---
>         num_measure_trials=216,  # change this to 20000 to achieve the best performance
337c347
<     module.set_input("data", data_tvm)
---
>     module.set_input("input", data_tvm)

Thanks, Ayman