Unable to reproduce benchmark results on Raspberry Pi 3B

dwofk · August 15, 2018, 10:23pm

Hello,

I am trying to reproduce benchmarking results, as explained in https://github.com/dmlc/tvm/tree/master/apps/benchmark. I have the latest TVM source code and am deploying to a Raspberry Pi 3B.

When I run the benchmark using

python3 arm_cpu_imagenet_bench.py --device rasp3b --rpc-key rasp3b

I observe the following runtimes:

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
squeezenet v1.1      204.01 ms           (1.97 ms)
mobilenet            412.53 ms           (79.38 ms)
resnet-18            775.99 ms           (46.59 ms)

These appear noticeably slower than the ones reported on the repo page (shown below for reference):

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
squeezenet v1.1      92.34 ms            (0.07 ms)
mobilenet            145.22 ms           (0.11 ms)
resnet-18            325.06 ms           (0.23 ms)

I’ve tried two Raspberry Pi 3Bs and two different host CPUs but cannot reproduce the reported results. Has anyone encountered this?

eqy · August 15, 2018, 10:52pm

Can you report the individual collected measurements using results? It seems the variance of your measurement is very high (you can try increasing the --number parameter in this case).

Also, if you are running all of these tests one after the other on a small number of devices, then it is likely that your raspberry pis are thermal throttling due to the continued (stress)-testing unless you are using an effective aftermarket cooling solution.

dwofk · August 16, 2018, 12:02am

When I run

python3 arm_cpu_imagenet_bench.py --device rasp3b --rpc-key rasp3b --network resnet-18 --number 10

I get the following:

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
ProfileResult(mean=0.6405168693, results=(0.6238276397, 0.6342533386, 0.6634696296))
resnet-18            640.52 ms           (16.78 ms)

I tried running a single inference run on a resnet-18 model on a new Raspberry Pi that hadn’t been continuously stress-tested, but even then, I did not observe any noticeable speedup in runtime…

eqy · August 16, 2018, 12:06am

Can you give some more details on your host compilation environment e.g., llvm version?

Try also turning the number down e.g., try number=1. Your settings of number=10 and repeat=3 are more than enough to cause throttling as it does 30 end-to-end inference in a row.

dwofk · August 16, 2018, 12:21am

I have llvm-6.0 installed.
gcc target is x86_64-linux-gnu.

What other information about my host environment would be helpful?

With number=1 and repeat=3, I get:

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
ProfileResult(mean=0.6026539773333334, results=(0.604407877, 0.593555052, 0.609999003))
resnet-18            602.65 ms           (6.83 ms)

With number=1 and repeat=1, I get:

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
ProfileResult(mean=0.609508829, results=(0.609508829,))
resnet-18            609.51 ms           (0.00 ms)

eqy · August 16, 2018, 12:34am

Ok, those results suggest that throttling is not the main cause of the difference here.

We have had issues with different llvm versions, so if it is not too tedious I would recommend also trying llvm-4.0. I think @merrymercy can confirm which version of llvm these schedules were tuned with.

merrymercy · August 16, 2018, 1:38am

These operators are tuned with llvm-4.0.

We also found that they are slow with llvm-6.0. Can you try to build tvm with llvm-4.0?

Later we will release pre-tuned parameters for different llvm versions.

dwofk · August 16, 2018, 1:02pm

Thanks, I rebuilt tvm on my host machine with llvm-4.0.

When I run the benchmark with number=1 and repeat=1, I still get:

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
ProfileResult(mean=0.604963224, results=(0.604963224,))
resnet-18            604.96 ms           (0.00 ms)

merrymercy · August 16, 2018, 9:23pm

Try to replace this line

github.com

dmlc/tvm/blob/11dd933f71e0da53169b89d59abbccf0b73f4f0f/python/tvm/target.py#L435


    Additional options
"""
from . import autotvm


trans_table = {
    "pixel2":    ["-model=snapdragon835", "-target=arm64-linux-android"],
    "mate10":    ["-model=kirin970", "-target=arm64-linux-android"],
    "mate10pro": ["-model=kirin970", "-target=arm64-linux-android"],
    "p20":       ["-model=kirin970", "-target=arm64-linux-android"],
    "p20pro":    ["-model=kirin970", "-target=arm64-linux-android"],
    "rasp3b":    ["-model=bcm2837", "-target=armv7l-linux-gnueabihf"],
    "rk3399":    ["-model=rk3399", "-target=aarch64-linux-gnu"],
    "pynq":      ["-model=pynq", "-target=armv7a-linux-eabi"],
}
pre_defined_opt = trans_table.get(model, ["-model=%s" % model])


# download pre-tuned parameters for arm_cpu if there is not any.
autotvm.tophub.check_package('arm_cpu')


opts = ["-device=arm_cpu"] + pre_defined_opt
opts = _merge_opts(opts, options)

with

"rasp3b":    ["-model=bcm2837", "-target=armv7l-linux-gnueabihf -mattr=+neon"],

It seems neon is disabled by default in your llvm, because I can reproduce your results with neon disabled.
However, neon is enabled by default in my llvm. I will send a pr to add neon for all targets.

dwofk · August 21, 2018, 8:36pm

Thank you, I made that change but it did not resolve the problem for me:

--------------------------------------------------
Network Name         Mean Inference Time (std dev)
--------------------------------------------------
ProfileResult(mean=0.605326326, results=(0.605326326,))
resnet-18            605.33 ms           (0.00 ms)

merrymercy · August 21, 2018, 9:07pm

This is strange. Can you provide the output of gcc -v and cat /proc/cpuinfo on your rasp?

dwofk · August 22, 2018, 6:30pm

Sure, the output of gcc -v is

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/arm-linux-gnueabihf/6/lto-wrapper
Target: arm-linux-gnueabihf
Configured with: ../src/configure -v --with-pkgversion='Raspbian 6.3.0-18+rpi1+deb9u1' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=arm-linux-gnueabihf- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-libitm --disable-libquadmath --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-armhf/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-armhf --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-armhf --with-arch-directory=arm --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-sjlj-exceptions --with-arch=armv6 --with-fpu=vfp --with-float=hard --enable-checking=release --build=arm-linux-gnueabihf --host=arm-linux-gnueabihf --target=arm-linux-gnueabihf
Thread model: posix
gcc version 6.3.0 20170516 (Raspbian 6.3.0-18+rpi1+deb9u1)

And the output of cat /proc/cpuinfo is:

processor	: 0
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 1
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 2
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 3
model name	: ARMv7 Processor rev 4 (v7l)
BogoMIPS	: 38.40
Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

Hardware	: BCM2835
Revision	: a02082
Serial		: 00000000e33ec6b6

tqchen · August 22, 2018, 8:05pm

One quick way for diagnosis if it is a(host) software environment problem or problem of the pi3b, is to use the docker image. You can follow https://github.com/dmlc/tvm/tree/master/docker to do

docker/bash.sh tvmai/ci-gpu

Then build the tvm inside the environment, with llvm-config set to llvm-config-4.0, this will give you exactly the same software env that is being used in the build. If the problem persists, then we can at least confirm that the problem is on the device side.