[ANSOR][CUTLASS] Combination of ANSOR and CUTLASS contrib

qingchanghan · December 30, 2022, 8:08am

Hi, I want to use CUTLASS contrib in my model to utilize tensor core in fp16, also want to keep op fusion and automatic tuning of ansor, which performs well on my model. But it seems that there is no example for combining ansor and cutlass contrib. So I did some experimenting.

Here is my pipeline:

Call relay.op.contrib.cutlass.partition_for_cutlass, to match and replace the functions supported by cutlass.
Call relay.auto_scheduler.relay_integration.extract_tasks, to extract ansor tasks.
Tune ansor tasks.
Call relay.build with auto_scheduler.ApplyHistoryBest, to compile model by ansor logs.
Call finalize_modules to get updated library.

During this process, I encountered a problem: the extract_tasks function only supports one target, but we need to pass in all targets including cuda and cutlass, to call call_all_topi_funcs. So I made a small modification to the extract_tasks function.

I have verified the performance benefits of this process on our business model, which can not only use cutlass to call tensorcore, but also retain the advantages of ansor in complex models. And I will try to verify it on some open source models.

Do you think we can organize this pipeline into a test script and submit it to the community? Looking forward to reply. Thanks!

Hzfengsy · December 30, 2022, 9:40am

Thanks for your proposal.

Of course. The community welcomes features that benefit the community.

BTW, here are some of my questions:

Is there any perf comparison between CUTLASS+Ansor and the current cutlass contrib build flow (It suppose to be cutlass+AutoTVM?)
Have you tried Meta-Schedule, which can tune tensorized programs, i.e. TensorCore kernels, which is faster or aligned than cutlass in most workloads?

qingchanghan · December 30, 2022, 10:10am

Thanks for your reply.

We will do some tests on some open source models to compare the performance of several solutions, which may include CUTLASS+Ansor, CUTLASS+TOPI, Meta-Schedule and TensorRT.

qingchanghan · January 31, 2023, 3:30am

We tested the performance of BERT model on A10 GPU for several solutions.

End2end latency(ms) on A10

CUDA 11.8
FP16
Input shape: (8, 128)

Ansor (n=3000)	CUTLASS+TOPI	CUTLASS+Ansor (n=3000)	Meta-Schedule (n=3000)
55.8870	20.2297	17.2543	19.2774

Conclusion

Combining the above test results and our experience in business models, we have the following conclusions.

CUTLASS+Ansor can improve the performance of CUTLASS scheme on some models, especially on models with scattered structure. And ansor can fuse element-wise operations between GEMM well, which are difficult to be optimized by CUTLASS.
The performance of Meta-Schedule on the bert model exceeds that of cutlass, and we will try to use Meta-Schedule on the business models in the future.

Test examples: tvm-cutlass-eval/bert at master · qingchanghan/tvm-cutlass-eval · GitHub (Forked from @masahi, thanks!)

twmht · October 23, 2023, 3:57am

@qingchanghan

I want to test this feature, can you share the modfication of TVM or did you submit a PR to TVM github?