[auto-scheduler] Sketch Rules for GPU

I’ve had some success with writing sketch rules for sparse computations, which I hope to upstream once my paper is done.

However, I am having some issues auto-scheduling sparse computations on the GPU.

When running using my sketch rule (one that works on the CPU), I fail with the following error:

TVMError: Parallel_for error with [17:01:12] ../src/auto_scheduler/search_policy/sketch_policy_rules.cc:769:
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (HasCrossThreadReduction(*state, stage_id)) is false:

When I try running the sparse SketchRule tutorial for a fully-connected layer, but change it to running on the GPU, this fails with:

File "../src/support/parallel_for.cc", line 92
TVMError: Parallel_for error with [17:06:03] ../src/te/schedule/bound.cc:144: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (it != rmap->end()) is false: 

It could be related, however fixing dense is not of as much interest to me.

This PR (#6269) (from @jcf94) added sketch policies for the GPU, and we can see this comment even highlights the bit of code giving me the issue.

There is no tutorial I can find about sparse sketch rules for GPUs. And rules of thumb, or primitives I should be using?

I know when hand-writing a schedule for the GPU, I should be assigning loops to blocks and threads, e.g. with:

bx, tx = s[data_vec].split(c, factor=max_threads)
s[data_vec].bind(bx, tvm.te.thread_axis("blockIdx.x"))
s[data_vec].bind(tx, tvm.te.thread_axis("threadIdx.x"))

Is there something like that I could/should be doing?