Choosing tvm.build_config parameters

hrgraj · June 25, 2018, 9:54am

Hi am slightly confused with tvm.build_config parameters
how to choose the parameters for tvm.build_config ?

for ex in “gpu_imagenet_bench.py” sample

dmlc/nnvm/blob/ef0ab9b09dbf1318851be311d3752de6c9bd4881/examples/benchmark/gpu_imagenet_bench.py#L52-L60


if args.target == "cuda":
    unroll = 1400
else:
    unroll = 128
with nnvm.compiler.build_config(opt_level=opt_level):
    with tvm.build_config(auto_unroll_max_step=unroll,
                          unroll_explicit=(args.target != "cuda")):
        graph, lib, params = nnvm.compiler.build(
            net, args.target, shape={"data": data_shape}, params=params)

how did the values 1400 or 128 chosen for the target ?
As my "understanding auto_unroll_max_step " refers to the off set in the loop iteration (ie adding copies /unrolling of loop till the threshold) ? Correct me if am wrong .

Could you please explain the other parameters - detect_global_barrier , & partition_const_loop .
Also is there any document that i can go through in understanding these parameters and its effects in performance ?

Thanks

eqy · June 25, 2018, 11:27pm

auto_unroll_max_step's value is chosen based on trial and error/tuning with different hardware target. The best value will vary based on the specific loop body and hardware target; we currently do not have an analytical way of choosing the optimal value.

hrgraj · June 26, 2018, 11:18am

Thanks , I was trying to use Nvidia Gpu with OpenCL ,and when i use value ‘1440’ it results in stack overflow .
So how can i find the max limit of this value based on the hardware ? does this value relate with any spec value of GPU ?

eqy · June 26, 2018, 6:51pm

Does the stack overflow happen during compilation/codegen, or does it happen at runtime?

(more unrolling doesn’t seem like it should create a runtime stack overflow)

hrgraj · June 28, 2018, 10:04am

It happens during compilation .
Below is my compilation code:

with nnvm.compiler.build_config(opt_level=3):
            with tvm.build_config(auto_unroll_max_step=1440,unroll_explicit=True):
              graph, lib, params = nnvm.compiler.build(sym, target='opencl', input_dict, params=params)

and
the error log :

target device is :  opencl
compiling with optlevel =  3
Traceback (most recent call last):
  File "C:\Users\rg\Documents\Visual Studio 2015\Projects\cr\cr\src\nnvm_compile_and_exec.py", line 118, in <module>
    graph, lib, params = nnvm.compiler.build(sym, target, input_dict, params=params)
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\nnvm-0.8.0-py3.6.egg\nnvm\compiler\build_module.py", line 294, in build
    graph = graph.apply("GraphFusePartition").apply("GraphFuseCompile")
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\nnvm-0.8.0-py3.6.egg\nnvm\graph.py", line 234, in apply
    check_call(_LIB.NNGraphApplyPasses(self.handle, npass, cpass, ctypes.byref(ghandle)))
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\nnvm-0.8.0-py3.6.egg\nnvm\_base.py", line 75, in check_call
    raise NNVMError(py_str(_LIB.NNGetLastError()))
nnvm._base.NNVMError: TVMCall CFunc Error:
Traceback (most recent call last):
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\tvm-0.4.0-py3.6-win-amd64.egg\tvm\_ffi\_ctypes\function.py", line 54, in cfun
    rv = local_pyfunc(*pyargs)
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\nnvm-0.8.0-py3.6.egg\nnvm\compiler\build_module.py", line 123, in _build
    return tvm.build(funcs, target=target, target_host=target_host)
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\tvm-0.4.0-py3.6-win-amd64.egg\tvm\build_module.py", line 456, in build
    func = ir_pass.ThreadSync(func, "shared")
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\tvm-0.4.0-py3.6-win-amd64.egg\tvm\_ffi\function.py", line 280, in my_api_func
    return flocal(*args)
  File "C:\Users\rg\AppData\Roaming\Python\Python36\site-packages\tvm-0.4.0-py3.6-win-amd64.egg\tvm\_ffi\_ctypes\function.py", line 183, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
OSError: exception: stack overflow

eqy · June 28, 2018, 6:32pm

If this is a stack overflow at compile time, it could be due to excessive unrolling. The quick fix is to reduce the unroll extent. (see also https://github.com/dmlc/tvm/pull/983)

CC @tqchen

hrgraj · July 2, 2018, 7:29am

Hi , One thing i observed is for ‘auto_unroll_max_step=1440’ , the stack overflow error doesn’t happen if 'unroll_explicit=False ’ . Is this the right configuration to use ‘auto_unroll_max_step value’ ?

eqy · July 2, 2018, 8:17am

if you are on an OpenCL target that may disable the unrolling entirely, which is why you do not see the error any more. I would recommend trying out different configurations to see which gives the best performance (it may be that unrolling does not improve performance).

hrgraj · July 2, 2018, 8:34am

Yes you are absolutely right .
I tried with different value of "auto_unroll_max_step " , “auto_unroll_max_depth” ,“unroll_explicit” etc with OpenCL , but there is not improve in performance ,all resulted in similar execution time .

What are the other configurations possible to improve performance for OpenCL devices ?

Am using nvidia GPU with OpenCL ,but the performance is far poorer(~190 msec for single image inference) than execution in CPU with avx2 enabled (150 msec). !!

eqy · July 2, 2018, 6:45pm

Generally tuning the build parameters will not yield anything close to the performance of tuning schedules if you are defining new operators, so if you are not using schedules in topi, schedule optimization would be my recommendation.

Why are you using the OpenCL backend for an Nvidia GPU instead of CUDA?

hrgraj · July 3, 2018, 4:31am

Yes i agree that , i have to use schedules to improve performance during runtime .

We are evaluating TVM with different GPU devices like Intel , Nvidia etc and OpenCL serves as a common Platform for these . That’s why am sticking to OpenCL instead of Cuda for now.
Also I couldn’t find topi schedules for OpenCL platform for Nvidia GPU . How can i start with ?

Sorry for adding one more question :

As i told before , am also trying to execute under Intel Graphics (iGPU) .I found the topi schedules here . Does it automatically invoked while calling nnvm.compiler.build ? Or should i call those schedule in my code ?