Hi,
I am currently trying to tune a quantized InceptionV4 TFLite model with TVM for ARM CPU using TVM v0.17, but after compiling the model with the tuning logs the performance of the model does not improve.
Tuning methods attempted: Method 1: Using TVMC via command-line.
- Set up RPC server between board and host machine.
- Run on host machine:
python3 -m tvm.driver.tvmc tune --target "llvm -device=cpu_arm -mtriple=aarch64-linux-gnu" --output tuning-record.json --rpc-tracker "0.0.0.0:9090" --rpc-key "board" inception_v4_299_quant.tflite
- I also tested setting parameter values for early-stopping, repeat and also changing the tuner to random.
Method 2: Using Auto Scheduler
- Modify the python script from Auto-scheduling a Neural Network for ARM CPU — tvm 0.18.dev0 documentation for the InceptionV4 model.
- Run script.
Both of these approaches had no impact on performance on my platform. Below is a breakdown of the measurements, I ran the models for 30 iterations*:
Tuning Method | TVM Execution Time (ms) |
---|---|
No Tuning | 1018.7021 |
Method 1 | 1029.2489 |
Method 1 + early stopping (800) | 1023.4508 |
Method 1 + early stopping (800) + trials (1500) | 1019.2631ms |
Method 2 | 1020.2225ms |
- These values do fluctuate +/- a few ms.
During Method 2, I did notice the following error being printed for some of the runs:
GA Iter: 0 Max score: 0.9988 Min score: 0.9982 #Pop: 2 #M+: 0 #M-: 0
[13:01:17] /tvm/src/auto_scheduler/compute_dag.cc:1377: Warning: InferBound fails on the state:
Placeholder: p0, p1, p2, p3
parallel b.0@x.0@y.0@ (0,584)
for b (None)
for x (None)
for y (None)
for z (None)
for i0 (None)
for i1 (None)
for i2 (None)
A_padded_M = ...
vectorize w (None)
A_interleaved = ...
for w.1 (0,2)
for z.1 (0,2)
for b_c.0 (None)
for x_c.0 (None)
for y_c.0 (None)
for ax0 (None)
for ax1 (None)
for ax2 (None)
T_reshape = ...
for w_c.0 (None)
for z_c.0 (None)
for b_c.1 (None)
for x_c.1 (None)
for y_c.1 (None) [331/1820]
for w_c.1 (None)
for z_c.1 (None)
for k.0 (None)
for b_c.2 (None)
for x_c.2 (None)
for y_c.2 (None)
for w_c.2 (None)
for z_c.2 (None)
for k.1 (None)
for b_c.3 (None)
for x_c.3 (None)
for y_c.3 (None)
for w_c.3 (None)
vectorize z_c.3 (None)
C_interleaved.local = ...
for y.2 (0,4)
for w.2 (0,2)
vectorize z.2 (0,2)
C_interleaved = ...
parallel b@x@ (0,289)
for y (0,128)
C = ...
parallel b@x@y@ (None)
for z (None)
conv2d_gemm_output = ...
parallel ax0@ax1@ax2@ (None)
for ax3 (None)
T_cast = ...
with: [13:01:17] /tvm/src/te/schedule/bound.cc:175: InternalError: Check failed: (found_attach || stage_attach.size() == 0) is false: Invalid Schedule, cannot find the producer compute(T_reshape, body=[p0[0, ((ax0 * 289 + ax1) * 1024 + ax
2) // 1024 // 17 % 17, ((ax0 * 289 + ax1) * 1024 + ax2) // 1024 % 17, ((ax0 * 289 + ax1) * 1024 + ax2) % 1024]], axis=[T.iter_var(ax0, T.Range(0, 1), "DataPar", ""), T.iter_var(ax1, T.Range(0, 289), "DataPar", ""), T.iter_var(ax2, T.Range
(0, 1024), "DataPar", "")], reduce_axis=[], tag=injective, attrs={}) along the loop nest specified by compute_at of consumer compute(A_padded_M, body=[T.if_then_else(i1 >= 0 and i1 < 289, T_reshape[i0, i1, i2], T.uint8(0))], axis=[T.iter_
var(i0, T.Range(0, 1), "DataPar", ""), T.iter_var(i1, T.Range(0, 292), "DataPar", ""), T.iter_var(i2, T.Range(0, 1024), "DataPar", "")], reduce_axis=[], tag=injective,pad, attrs={})
Stack trace:
0: tvm::te::InferRootBound(tvm::te::Stage const&, tvm::te::GraphContext const&, std::unordered_map<tvm::tir::IterVar, tvm::Range, std::hash<tvm::tir::IterVar>, std::equal_to<tvm::tir::IterVar>, std::allocator<std::pair<tvm::tir::IterVar
const, tvm::Range> > >*)
1: tvm::te::InferBound(tvm::te::Schedule const&)
2: tvm::auto_scheduler::ComputeDAG::InferBound(tvm::auto_scheduler::State const&) const
3: tvm::auto_scheduler::ComputeDAG::InferBound(tvm::runtime::Array<tvm::auto_scheduler::State, void> const&) const::{lambda(int)#1}::operator()(int) const
4: _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIvEES3_EZNS1_11_Task_stateIZN3tvm7support12parallel_
5: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)
6: 0x000071ab617cbee7
7: std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::packaged_task<void (std::vector<int, std::allocator<int> > const&, std::function<void (int)> const&)>, std::vector<int, std::allocator<int> >, std::function<void (int)> >
> >::_M_run()
8: 0x000071ab5b4b0252
9: 0x000071ab617c6ac2
10: __clone
11: 0xffffffffffffffff
Is this a configuration issue or is it a TVM limitation?
Link to model: https://storage.googleapis.com/download.tensorflow.org/models/inception_v4_299_quant_20181026.tgz
Script changes:
60a61,62
> import tflite.Model
>
122a125,132
> elif name == "inception_v4_tflite":
> input_shape = (batch_size, 299, 299, 3)
> input_tensor = "input"
>
> tflite_model_buf = open("/mnt/inception_v4_299_quant.tflite", "rb").read()
> tflite_model = tflite.Model.Model.GetRootAsModel(tflite_model_buf, 0)
> mod, params = relay.frontend.from_tflite(tflite_model, shape_dict={input_tensor: input_shape}, dtype_dict={input_tensor: dtype})
> output_shape = (batch_size, 1001)
225c235
< device_key = "rasp4b-64"
---
> device_key = "board"
235c245
< network = "mobilenet"
---
> network = "inception_v4_tflite"
239c249
< dtype = "float32"
---
> dtype = "uint8"
292c302
< num_measure_trials=200, # change this to 20000 to achieve the best performance
---
> num_measure_trials=216, # change this to 20000 to achieve the best performance
337c347
< module.set_input("data", data_tvm)
---
> module.set_input("input", data_tvm)
Thanks, Ayman