How to write a correct schedule for simple hybrid script like:
for i in range(batch_size):
valid_count[i] = 0
for j in bind('threadIdx.x', num_anchors):
score = data[i, j, 1]
if score > score_threshold:
for k in bind('threadIdx.y', box_data_length):
out_tensor[i, valid_count[i], k] = data[i, j, k]
valid_count[i] += 1
if j > valid_count[i]:
for k in bind('threadIdx.y', box_data_length):
out_tensor[i, j, k] = -1.0
Currently, the default schedule https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/vision.py#L10 which works for ir_builder doesn’t work for hybrid script with error forget binding…
@were
were
February 21, 2019, 11:47pm
2
Can you please paste the whole code instead of snippet so that I can replicate your issue quickly?
OK, I write a script for you to run:
were
February 22, 2019, 5:01am
4
produce hybrid_get_valid_counts_gpu {
// attr [0] extern_scope = 0
// attr [iter_var(threadIdx.y, Range(min=0, extent=6), threadIdx.y)] thread_extent = 6
// attr [score] storage_scope = "global"
allocate score[float32 * 1]
// attr [iter_var(threadIdx.x, Range(min=0, extent=2500), threadIdx.x)] thread_extent = 2500
for (i, 0, 1) {
hybrid_get_valid_counts_gpu.v0[i] = 0
score[0] = data[(((threadIdx.x + (i*2500))*6) + 1)]
if ((0.000000f < score[0])) {
hybrid_get_valid_counts_gpu.v1[((threadIdx.y + (i*15000)) + (hybrid_get_valid_counts_gpu.v0[i]*6))] = data[((threadIdx.y + (threadIdx.x*6)) + (i*15000))]
hybrid_get_valid_counts_gpu.v0[i] = (hybrid_get_valid_counts_gpu.v0[i] + 1)
}
if ((hybrid_get_valid_counts_gpu.v0[i] < threadIdx.x)) {
hybrid_get_valid_counts_gpu.v1[((threadIdx.y + (threadIdx.x*6)) + (i*15000))] = -1.000000f
}
}
}
tvm._ffi.base.TVMError: [20:57:22] ~/tvm/src/codegen/codegen_cuda.cc:242: Check failed: scope != "global" (global vs. global)
were
February 22, 2019, 5:01am
5
// attr [score] storage_scope = "global"
allocate score[float32 * 1]
produce hybrid_get_valid_counts_gpu {
// attr [0] extern_scope = 0
for (i, 0, 1) {
hybrid_get_valid_counts_gpu.v0[i] = 0
// attr [iter_var(threadIdx.y, Range(min=0, extent=6), threadIdx.y)] thread_extent = 6
// attr [iter_var(threadIdx.x, Range(min=0, extent=2500), threadIdx.x)] thread_extent = 2500
score[0] = data[((((i*2500) + threadIdx.x)*6) + 1)]
if ((0.000000f < score[0])) {
hybrid_get_valid_counts_gpu.v1[((((i*2500) + hybrid_get_valid_counts_gpu.v0[i])*6) + threadIdx.y)] = data[(((i*15000) + threadIdx.y) + (threadIdx.x*6))]
hybrid_get_valid_counts_gpu.v0[i] = (hybrid_get_valid_counts_gpu.v0[i] + 1)
}
if ((hybrid_get_valid_counts_gpu.v0[i] < threadIdx.x)) {
hybrid_get_valid_counts_gpu.v1[(((i*15000) + threadIdx.y) + (threadIdx.x*6))] = -1.000000f
}
}
}
ValueError: Direct host side access to device memory is detected in default_function. Did you forget to bind?
vinx13
February 22, 2019, 5:38am
6
I met this before. Local variables in python will be translated to a global var in IR. @were maybe you have idea on this [Hybrid] variables on CUDA should have 'local' scope
were
February 22, 2019, 5:43am
7
OK. I now know the problem.
Leyuan uses single variable in under a “bind” loop body.
This is bad in both current TVM and hybrid script.
In CUDA, a local variable will be lowered to register.
However, TVM has no CUDA register abstraction.
Thus, using single variable under bind body should be banned.
If you REALLY want to do it, you should allocate a scratchpad outside the bind body.
@were Thanks a lot for solving the issue. I have another error with your updated hybrid script running the same script I linked above.
CUDALaunch Error: CUDA_ERROR_INVALID_VALUE
grid=(1,1,1), block=(2500,6,1)
// func_name=default_function_kernel0
// CUDA Source
// -----------
extern "C" __global__ void default_function_kernel0( int* __restrict__ hybrid_get_valid_counts_gpu_v0, float* __restrict__ data, float* __restrict__ hybrid_get_valid_counts_gpu_v1) {
float score[1];
for (int i = 0; i < 1; ++i) {
hybrid_get_valid_counts_gpu_v0[i] = 0;
score[0] = data[((1 + (((int)threadIdx.x) * 6)) + (i * 15000))];
if (0.000000e+00f < score[0]) {
hybrid_get_valid_counts_gpu_v1[((((int)threadIdx.y) + (i * 15000)) + (hybrid_get_valid_counts_gpu_v0[i] * 6))] = data[((((int)threadIdx.y) + (((int)threadIdx.x) * 6)) + (i * 15000))];
hybrid_get_valid_counts_gpu_v0[i] = (1 + hybrid_get_valid_counts_gpu_v0[i]);
}
if (hybrid_get_valid_counts_gpu_v0[i] < ((int)threadIdx.x)) {
hybrid_get_valid_counts_gpu_v1[((((int)threadIdx.y) + (((int)threadIdx.x) * 6)) + (i * 15000))] = -1.000000e+00f;
}
}
}
It seems that hybrid script cannot detect max_num_threads(tvm.target.current_target(allow_none=False).max_num_threads)
and allocate accordingly. In this case, I’m using too many threads. How shall I allocate only max number of threads available in the device.
were
February 24, 2019, 10:57pm
9
Can you elaborate a little more?
To be explicit, what are you expecting to be injected to the IR?
I’m expecting blockIdx.x
could be automatically bind to correct thread_extent since it doesn’t allow user to get tvm.target.current_target(allow_none=False).max_num_threads
in hybrid script.
were
February 25, 2019, 6:47pm
11
master
← were:max-thread
opened 06:40PM - 25 Feb 19 UTC
Followup this thread of discussion, @Laurawly needs an interface to know the max… number of threads to avoid thread overflow.
https://discuss.tvm.ai/t/hybrid-script-gpu-schedule/1720/2
Look at this PR! I added a function intrinsic max_num_threads
in hybrid script.