[AutoScheduler] Do we have plan to support auto schedule ExternOp?

Hi All.

I just noticed that AutoScheduler lacks support for ExternOp. Currently AutoScheduler supports ComputeOp only.

I understand that it is non-trial to auto schedule a op with external function calls, however there are a bunch of topi ops whose algorithm are purely written with tensor expression and NO extern function call involved, are still using “te.Extern()” as their wrapper. It’s really frustrating because those ops can’t be auto-tuned by with auto_scheduler. I can give a few examples, such as scatter_nd: tvm/scatter.py at 8fce89500c520c4dc6ce8733172fa87ead107709 · apache/tvm · GitHub

Welcome for comments, @merrymercy @tqchen

Thank you!

I guess the reason it needs a wrapper is because they are written in TIR instead of TE? Since AutoScheduler can only tune TE compute, it cannot tune such ops anyways. On the other hand, the AutoTIR that @junrushao is working on supports tuning for all levels so you may look forward it.

Meanwhile, I’m not sure how many improvements you will get by tuning such ops (e.g., scatter). There computes are not that complicate, so the tuning space is expected to be small.

Thank you @comaniac , really appreciate!

The reason “…because they are written in TIR instead of TE” does make sense to me. And I agree for the case “scatter”, the improvement would be small. I guess Relay’s default schedule is probably good enough for my case.

@comaniac By the way, do you know the specific reason why ops like “scatter” choose to implement use TIR instead of TE?

According to my quick statistic, there are at least 13 ops in relay use TIR as their implementations:

  1. argwhere
  2. non_max_suppression
  3. scanop
  4. scatter
  5. scatter_nd
  6. scatter_add
  7. sort
  8. topk
  9. sparse_reshape
  10. sparse
  11. unique
  12. proposal
  13. multibox

It’s possible that those ops need a finer control to buffers and it is hard to achieve in TE, but I didn’t write those ops so I don’t know the exact reason.

Thank you, @comaniac .

@jroesch @mbrookhart @ritwikdas54 I noticed you’ve participated in implementing those ops above (by git blame :stuck_out_tongue_winking_eye:), could you explain a little bit about why use TIR instead of TE?

TE is a limited declarative programming model, it’s not possible to write operations that do data-dependent indexing in TE.

Anything that’s sort/scatter related needs to be written directly in the more imperative TIR.

I spent a lot of time optimizing the sort/argsort kernel for GPUs, we get pretty good performance on GPUs from multiple vendors that competes with those vendor’s hand tuned libraries.

If these TIR kernels are well optimized, they shouldn’t end up being the bottleneck in models.

@mbrookhart Make sense, thank you!

I am working with atuo-scheduler with te.extern, but failed with compile. is there any chance using auto-schedule with te.extern op?

It seems not impossible to me, since ExternOp doesn’t even have a schedule but simply calls an “extern” implementation such as CuBLAS.