On remote:
- I think on remote edge (RPC) is nonsense to set TVM_NUM_THREADS (by my logic) it doesn’t help. Also, i can’t see anywhere in the RPC code. It receives one sample kernel test it (using multicore CPU or GPU) then send metering results back. Can’t see what can be parallel on RPC side (either in the code). The kernel under test itself may be run parallelized, but only one kernel (test case) will run at once on edge.
- If one want
parallel
searching on remote RPC then have to use multiple physical edges, each registered to the tracker will receive at same time test kernels, thus N edges yields (Time / N) shortage.
On host:
- Yes it matters a lot. It can be observed during xgboost steps (internal xgb feature re-processing):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19199 cbalint 20 0 24.2g 700932 112244 R 181.8 4.3 0:30.97 tune-mali.py
19198 cbalint 20 0 24.2g 700808 112120 R 154.5 4.3 0:28.81 tune-mali.py
19197 cbalint 20 0 24.2g 700932 112244 R 136.4 4.3 0:31.74 tune-mali.py
19193 cbalint 20 0 24.2g 700796 112108 R 127.3 4.3 0:29.89 tune-mali.py
19195 cbalint 20 0 24.2g 700916 112228 S 127.3 4.3 0:30.40 tune-mali.py
19194 cbalint 20 0 24.2g 700872 112184 S 18.2 4.3 0:29.98 tune-mali.py
19196 cbalint 20 0 24.2g 700924 112236 S 9.1 4.3 0:33.25 tune-mali.py
I think tutorial sets it to 1 (safe demo for any target).