Thoughts on a Simpler Scheduling Language

srush · February 10, 2021, 11:27pm

Hi all,

I find programming in TVM to result in an extremely large number of non-scoped variables. The main problem is that the axes and tensors are not grouped, and simple mistakes result in extremely verbose low-level errors. 90% of my mistakes are just from not keeping tensors and axes grouped together.

I’m curious what people think of an less low-level scheduling language. I generally write my code in this style, which is much less verbose, fixes double splitting, and prevents errors from mixing up which axis belongs to which tensor.

    ll, nn = s.axes(C)                                                                                                         
    reduce_axis = s.reduce_axis(C)                                                                                             
    ll = ll.split(TPB)                                                                                                         
    nn = nn.split(TPB)                                                                                                         
    mm = reduce_axis.split(TPB)                                                                                                
    s.reorder(C, (ll.outer, nn.outer, ll.inner, nn.inner, mm.outer, mm.inner))

    # Bind blocks and threads to C                                                                                             
    ll.outer.bind(te.thread_axis("blockIdx.x"))                                                                                
    nn.outer.bind(te.thread_axis("blockIdx.y"))                                                                                
    ll.inner.bind(tx)                                                                                                          
    nn.inner.bind(ty)                                                                                                          
                                                                                                                               
    # Set up Caching                                                                                                           
    ll_A, mm_A = s.axes(AA)                                                                                                    
    ll_A = ll_A.split(TPB)                                                                                                       
    mm_A = mm_A.split(TPB)                                                                                                       
    s.reorder(AA, (ll_A.outer, mm_A.outer, ll_A.inner, mm_A.inner))                                                            
    mm.outer.compute_at(AA)                                                                                                    
    ll_A.inner.bind(tx)                                                                                                        
    mm_A.inner.bind(ty)

Do people have any other tricks? Ideally there would be a really nice way to group together spliting of two tensors in the same way (in this case ll_A mirrrors ll, why are they seperate?)

junrushao · February 11, 2021, 6:29am

Hey @srush,

Thanks for asking!

We are actively developing a more straightforward scheduling language based on a new IR called TensorIR. The main features include:

Imperative scheduling. Using schedule primitives is as simple as applying independent compiler passes that transform an old TensorIR to a new one. Like PyTorch’s imperative execution, imperative scheduling allows to print and debug the scheduling process at any step, which provides smoother debugging experience compared with schedule-tree based TE scheduling.
Python-first syntax. The TensorIR, at any step of scheduling, can be printed into python syntax, which as well can be parsed back to TensorIR/schedule status, i.e. it is a round-trippable DSL embedded in Python. The syntax is designed to be human readable and easy to manipulate. For example, @spectrometerHBH and @vinx13 recently implemented block-sparse kernels in TensorIR within 20 lines of this python DSL, and then applied auto scheduling on it.
Competitive GEMM performance with auto tensorization. We noticed growing demand for competitive GEMM performance, like you have mentioned in the previous thread. The TensorIR re-designed the tensorization mechanism, allowing direct embedding of tensor instructions (like Tensor Core) and microkernels; It also comes with an auto scheduling framework that allows searching with a XGB-based cost model. With all those mechanisms, it is possible that we have more chance for competitive performance.
New schedule primitives made easy. The imperative style scheduling makes it much easier to introduce more schedule primitives, including loop partitioning, layout rewrite, etc. In your particular case, we have developed a primitive called reverse_compute_at, which computes the consumer under the specific loop of the producer. The shape of the computed region is handled automatically in our schedule - so you don’t have to repetitive splitting, reordering, etc.

RFC: [RFC] TensorIR: A schedulable IR for TVM. (recently we added a few syntactic sugars to make it looks simpler since that RFC)

We are preparing to upstream the codebase, and will closely update with the community with our latest status

srush · February 11, 2021, 1:35pm

What an amazing answer! Thank you so much for your time and thoughtfulness.