Hi, I want to use CUTLASS contrib in my model to utilize tensor core in fp16, also want to keep op fusion and automatic tuning of ansor, which performs well on my model. But it seems that there is no example for combining ansor and cutlass contrib. So I did some experimenting.
Here is my pipeline:
Call relay.op.contrib.cutlass.partition_for_cutlass, to match and replace the functions supported by cutlass.
Call relay.auto_scheduler.relay_integration.extract_tasks, to extract ansor tasks.
Tune ansor tasks.
Call relay.build with auto_scheduler.ApplyHistoryBest, to compile model by ansor logs.
Call finalize_modules to get updated library.
During this process, I encountered a problem: the extract_tasks function only supports one target, but we need to pass in all targets including cuda and cutlass, to call call_all_topi_funcs. So I made a small modification to the extract_tasks function.
I have verified the performance benefits of this process on our business model, which can not only use cutlass to call tensorcore, but also retain the advantages of ansor in complex models. And I will try to verify it on some open source models.
Do you think we can organize this pipeline into a test script and submit it to the community?
Looking forward to reply. Thanks!
We will do some tests on some open source models to compare the performance of several solutions, which may include CUTLASS+Ansor, CUTLASS+TOPI, Meta-Schedule and TensorRT.
We tested the performance of BERT model on A10 GPU for several solutions.
End2end latency(ms) on A10
CUDA 11.8
FP16
Input shape: (8, 128)
Ansor (n=3000)
CUTLASS+TOPI
CUTLASS+Ansor (n=3000)
Meta-Schedule (n=3000)
55.8870
20.2297
17.2543
19.2774
Conclusion
Combining the above test results and our experience in business models, we have the following conclusions.
CUTLASS+Ansor can improve the performance of CUTLASS scheme on some models, especially on models with scattered structure. And ansor can fuse element-wise operations between GEMM well, which are difficult to be optimized by CUTLASS.
The performance of Meta-Schedule on the bert model exceeds that of cutlass, and we will try to use Meta-Schedule on the business models in the future.