Hi, as a new user I have some questions about using VTA in simulation and RPC server mode:
-
Are fully connected layers (and non-quantized convolutional layers) executed by target CPU (ARM CPU of the board) ? Or by host CPU (x86 CPU of my computer) ?
-
What is measured exactly when using VTA in tsim with the timer() function: Only part offloaded to VTA or also layers executed by target ARM CPU ? It is related to question 1.
-
The value returned by timer() function when I execute the MxNet tutorial (https://tvm.apache.org/docs/vta/tutorials/frontend/deploy_classification.html#sphx-glr-vta-tutorials-frontend-deploy-classification-py) in tsim is about 90 seconds! Why is it so far from the results in the publication?
-
How to interpret the simulation stats in tsim (cycle_count)? and in fsim (inp_load_nbytes, etc…)?
-
Is it possible to measure execution time layer by layer to identify a bottleneck in the neural network ?
Thanks in advance