Hi am slightly confused with tvm.build_config parameters
how to choose the parameters for tvm.build_config ?
for ex in “gpu_imagenet_bench.py” sample
how did the values 1400 or 128 chosen for the target ?
As my "understanding auto_unroll_max_step " refers to the off set in the loop iteration (ie adding copies /unrolling of loop till the threshold) ? Correct me if am wrong .
Could you please explain the other parameters - detect_global_barrier , & partition_const_loop .
Also is there any document that i can go through in understanding these parameters and its effects in performance ?
auto_unroll_max_step's value is chosen based on trial and error/tuning with different hardware target. The best value will vary based on the specific loop body and hardware target; we currently do not have an analytical way of choosing the optimal value.
Thanks , I was trying to use Nvidia Gpu with OpenCL ,and when i use value ‘1440’ it results in stack overflow .
So how can i find the max limit of this value based on the hardware ? does this value relate with any spec value of GPU ?
If this is a stack overflow at compile time, it could be due to excessive unrolling. The quick fix is to reduce the unroll extent. (see also https://github.com/dmlc/tvm/pull/983)
Hi , One thing i observed is for ‘auto_unroll_max_step=1440’ , the stack overflow error doesn’t happen if 'unroll_explicit=False ’ . Is this the right configuration to use ‘auto_unroll_max_step value’ ?
if you are on an OpenCL target that may disable the unrolling entirely, which is why you do not see the error any more. I would recommend trying out different configurations to see which gives the best performance (it may be that unrolling does not improve performance).
Yes you are absolutely right .
I tried with different value of "auto_unroll_max_step " , “auto_unroll_max_depth” ,“unroll_explicit” etc with OpenCL , but there is not improve in performance ,all resulted in similar execution time .
What are the other configurations possible to improve performance for OpenCL devices ?
Am using nvidia GPU with OpenCL ,but the performance is far poorer(~190 msec for single image inference) than execution in CPU with avx2 enabled (150 msec). !!
Generally tuning the build parameters will not yield anything close to the performance of tuning schedules if you are defining new operators, so if you are not using schedules in topi, schedule optimization would be my recommendation.
Why are you using the OpenCL backend for an Nvidia GPU instead of CUDA?
Yes i agree that , i have to use schedules to improve performance during runtime .
We are evaluating TVM with different GPU devices like Intel , Nvidia etc and OpenCL serves as a common Platform for these . That’s why am sticking to OpenCL instead of Cuda for now.
Also I couldn’t find topi schedules for OpenCL platform for Nvidia GPU . How can i start with ?
Sorry for adding one more question :
As i told before , am also trying to execute under Intel Graphics (iGPU) .I found the topi schedules here . Does it automatically invoked while calling nnvm.compiler.build ? Or should i call those schedule in my code ?