[ANSOR][CUTLASS] Combination of ANSOR and CUTLASS contrib

Hi, I want to use CUTLASS contrib in my model to utilize tensor core in fp16, also want to keep op fusion and automatic tuning of ansor, which performs well on my model. But it seems that there is no example for combining ansor and cutlass contrib. So I did some experimenting.

Here is my pipeline:

  • Call relay.op.contrib.cutlass.partition_for_cutlass, to match and replace the functions supported by cutlass.
  • Call relay.auto_scheduler.relay_integration.extract_tasks, to extract ansor tasks.
  • Tune ansor tasks.
  • Call relay.build with auto_scheduler.ApplyHistoryBest, to compile model by ansor logs.
  • Call finalize_modules to get updated library.

During this process, I encountered a problem: the extract_tasks function only supports one target, but we need to pass in all targets including cuda and cutlass, to call call_all_topi_funcs. So I made a small modification to the extract_tasks function.

I have verified the performance benefits of this process on our business model, which can not only use cutlass to call tensorcore, but also retain the advantages of ansor in complex models. And I will try to verify it on some open source models.

Do you think we can organize this pipeline into a test script and submit it to the community? Looking forward to reply. Thanks!

3 Likes

Thanks for your proposal.

Of course. The community welcomes features that benefit the community.

BTW, here are some of my questions:

  1. Is there any perf comparison between CUTLASS+Ansor and the current cutlass contrib build flow (It suppose to be cutlass+AutoTVM?)

  2. Have you tried Meta-Schedule, which can tune tensorized programs, i.e. TensorCore kernels, which is faster or aligned than cutlass in most workloads?

Thanks for your reply.

We will do some tests on some open source models to compare the performance of several solutions, which may include CUTLASS+Ansor, CUTLASS+TOPI, Meta-Schedule and TensorRT.

We tested the performance of BERT model on A10 GPU for several solutions.

End2end latency(ms) on A10

  • CUDA 11.8
  • FP16
  • Input shape: (8, 128)
Ansor (n=3000) CUTLASS+TOPI CUTLASS+Ansor (n=3000) Meta-Schedule (n=3000)
55.8870 20.2297 17.2543 19.2774

Conclusion

Combining the above test results and our experience in business models, we have the following conclusions.

  1. CUTLASS+Ansor can improve the performance of CUTLASS scheme on some models, especially on models with scattered structure. And ansor can fuse element-wise operations between GEMM well, which are difficult to be optimized by CUTLASS.

  2. The performance of Meta-Schedule on the bert model exceeds that of cutlass, and we will try to use Meta-Schedule on the business models in the future.

Test examples: tvm-cutlass-eval/bert at master · qingchanghan/tvm-cutlass-eval · GitHub (Forked from @masahi, thanks!)

3 Likes

@qingchanghan

I want to test this feature, can you share the modfication of TVM or did you submit a PR to TVM github?