How to optimize the realization of indexing?

Indexing is very common in some ops. When indices are passed with python list, list can not use tvm.var as index in lambda expression. For example, image So tvm.select is used:

If input shape is [5,10,10] and permut is [1,3,4,1,2], lower IR can be

However we expect select can be just at the i0 axis, and the best expression may be more efficient without i0 loop like this: #i0 = 0, copy first 10 * 10 for (i1,0,10) { for (i2,0,10) { … } } #i0 = 1, copy 3rd 10 * 10 for (i1,0,10) { for (i2,0,10) { … } } … So how to realize to print IR like this?

Looks like simply unroll i0 can give you the desired results.

I have used unroll + select + inline to pass a constant matrix to tvm. This matrix has many zeros and tvm can simplify the expression to simplest form.