[RFC][BYOC]NVIDIA CUTLASS Integration

Laurawly · February 19, 2021, 2:40am

If we need to take advantage of cutlass epilogue fusion codegen we have to make it a backend of BYOC.

Laurawly · February 19, 2021, 3:27am

Thanks a lot for your question! I believe that BYOC supports annotating different ops using different codegen cutlass/cublas/cudnn instead of globally using the same one(@zhiics plz correct me if I’m wrong). Regarding performance, because the libraries you mentioned are all managed by Nvidia which they actively update, I believe for singe kernel performance, they won’t vary much. Bringing cutlass codegen is aim to introduce more graph-level optimization, in this case more fusion which TVM is good at.

Meteorix · February 19, 2021, 7:35am

Glad to see the RFC! TVM performance on large gemm has troubled me for a long time. Looking forward to further benchmark on cutlass+fusion against cublas+nofusion.

One potential issue: autotvm selects the best implement from autotuned-gemm and cublas-gemm based on performance, then do the fusion. If cutlass is integrated, we need to select sub-graph level autotuning and then select the best.

comaniac · February 19, 2021, 5:50pm

Thanks for the RFC and it looks exciting.

Like others already mentioned, integrating Cutlass via BYOC makes it at a graph level instead of tensor level. As a result, Relay op strategy and AutoTVM, Ansor won’t be able to consider this implementation along with others such as CUDA or CuBLAS. However, as @Laurawly pointed out, introducing Cutlass is mainly for graph-level optimization (i.e., fusion), and this cannot be done at the TE level at this moment, so the motivation is similar to the TensorRT integration.

Given the above summary, it seems to me that we could have several stages of bringing Cutlass in general:

Integrate Cutlass via BYOC for now. Users have to decide if they want to offload as many ops as possible to Cutlass. As a result, the flow becomes:

Relay: For a whole graph, choose between Cutlass or others.
TE/TIR: For the rest graph (a.k.a. others), choose between CUDA, CuBLAS, CuDNN, etc.

When TensorIR is landed, we should be able to leverage its tensorization rules to make use of Cutlass microkernels in codegen. In this way, we could have more possibilities to boost the performance, because Cutlass becomes transoized intrainsics in the generated kernels. cc @vinx13
Once (2) is available, the corresponding auto-scheduler, AutoTIR, should be able to tune the schedule with Cutlass kernels. cc @junrushao
Stage (2) and (3) enable more interesting research opportunities in TVM fusion strategies. As for now, we only fuse simple patterns (e.g., a reduction op with following injective ops). We can explore how Cutlass could make more fusion patterns useful.

Since stage 2-4 need more tiem to land, making stage 1 available soon is a pretty good idea to me.

Laurawly · February 19, 2021, 10:30pm

Thanks for your great summary! Just one thing to point out is that cutlass is not composed of microkernels, but instead a collection of CUDA C++ template abstractions. I also look forward for the outcome of bringing the core insight of cutlass to TIR

Laurawly · February 28, 2021, 7:24am

As requested, I just tested square GEMM as well as Bert workloads with batch size 64, sequence length 128 on RTX3090 between cublas and cutlass, and here’s the result (note that cutlass’s output is in fp16 because by default, it generates the same data type with the input):

GEMM TFLOPS on RTX3090
cutlass: input (fp16, fp16), accum (fp32), output (fp16)
cublas: input (fp16, fp16), accum (fp32), output (fp32)

M, N, K	cublas (11.2)	cutlass(2.4)
512,512,512	27.812	25.678
1024,1024,1024	43.845	47.353
2048,2048,2048	55.144	71.100
8192,768,3072	66.775	73.808
8192,768,768	53.214	70.544
8192,2304,768	63.425	73.674

Please find my benchmark script here: bench_cublas.py · GitHub

hwu36 · February 28, 2021, 4:42am

I am from CUTLASS. Happy to answer any questions here or in cutlass github. This is exciting!!!

manishucsd · February 28, 2021, 7:07am

This is exciting stuff!! CUTLASS team will be happy to answer any questions here.

The output datatype doesn’t match so this is not a fair comparison. I think might have only looked into CUTLASS device-level unit tests. You should use CUTLASS generator to procedurally generate GEMMs or convolutions for datatypes you are interested in.

A few bullets to help you run this analysis better:

Use CUTLASS generator and profiler to generate and profile CUTLASS GEMM kernels of interest:

cmake ../cutlass -DCUTLASS_NVCC_ARCHS='86' -DCUTLASS_LIBRARY_KERNELS="cutlass_tensorop_f16_s16816gemm_f16*align8"

make cutlass_profiler -j12
./tools/profiler/cutlass_profiler --help

e.g.: ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_f16_s16816gemm_f16*align8 --m=M --n=N --k=K

Laurawly · February 28, 2021, 7:21am

Thanks a lot for the guide, that’s really helpful. I’ll update the benchmark with the same output data type.

junrushao · March 1, 2021, 3:04am

Thanks for the instructions! Our team (mainly @vinx13) are super excited by CUTLASS and currently investigating potential ways to tensorize with CUTLASS in TensorIR. Would love to discuss more details in the future

hwu36 · March 1, 2021, 3:33am

I would like to highlight that Ampere kernels inside CUTLASS profiler are currently tuned for A100 (sm80), not 30xx (sm86). I recommend to use A100 to do benchmarking if possible. 3090 is designed for running games, not for running tensor cores.

If you have to use sm86, you need to adjust the stage number to get the best performance. SM80 has 160KB shared memory, but SM86 has only 100KB. Some stage number used by SM80 are not possible to run in SM86, or has to use lower occupancy. For example, the important tile size 128x128x32 is configured to use 5 stage. It uses (128x32 + 128x32) x 2B x 5 = 80KB shared memory which means 1 (= floor(100/80)) Threadblock per SM. If lowering the stage number to 3, the shared memory footprint is (128x32 + 128x32) x 2B x 3 = 48KB which means 2 (=100/48) Threadblocks per SM.

The configuration of Ampere 16816 float tensor core kernels are here: cutlass/generator.py at master · NVIDIA/cutlass · GitHub. This is an example how to read them

TileDescription([256, 128, 64], 3, [4, 2, 1], math_inst, min_cc, max_cc_smem_limited),

[256, 128, 64] - tile size, MxNxK

3 - stage number, minimum stage number is 3 for all Ampere kernels.

[4, 2, 1] - 4 warps in the M dimension, 2 warps in the N dimension, 1 warp in the K dimension, totally 8 warps, 256 threads

If you find a sweet spot of SM86 stage number, feel free to upstream to CUTLASS github. We haven’t done it ourselves.

Lastly, just want to remind that the numbers measured today will be too old when your integration is done because of the new CUDA compiler and the new CUTLASS code at that time.

hwu36 · April 19, 2021, 1:52am

CUDA 11.3 significantly improves the performance of Ampere/Turing/Volta Tensor Core kernels.

298TFLOPS was recorded when benchmarking CUTLASS FP16 GEMM on A100. This is 14% higher than CUDA 11.2. FP32(via TF32) GEMM is improved by 39% and can reach 143TFLOPS. The same speedup applies to the CONV kernels.

See the discussion in CUDA 11.3 significantly improved the performance of CUTLASS · Discussion #241 · NVIDIA/cutlass · GitHub

It may be the time for you to do the benchmarking again.

Laurawly · April 19, 2021, 9:41pm

That great news! Thanks @hwu36 for sharing this. We can’t wait to benchmark cutlass with cuda 11.3 on T4 and V100 using tensor core as well as TF32 on A100.

masahi · September 15, 2021, 11:27pm

Hi @Laurawly, do you have an update on this? I’m very interested in this feature and happy to help upstreaming effort.

Laurawly · September 16, 2021, 9:22pm

Hi @masahi, glad that you could help. It’s already used in production. However I’m currently focusing on landing it. When I have more bandwidth, I will send a PR but probably the code will be as is. If you could help getting this ready to be merged to upstream that would be great.

masahi · September 16, 2021, 9:56pm

@Laurawly Great!! Very cool to hear that it is being used in production. Yes, you can send your code as it is, and I can do all the necessary integration or clean up work for upstreaming.

danielwong · January 6, 2022, 1:14am

is there any documentation of using CUTLASS in TVM?

masahi · January 6, 2022, 2:45am

No, you can check out usage examples in https://github.com/apache/tvm/blob/e7f36487dfdb6c4b7b544be155d3869002d7281b/tests/python/contrib/test_cutlass.py

I also have some E2E examples at https://github.com/masahi/tvm-cutlass-eval

danielwong · January 8, 2022, 7:24pm

Hi, I have another question, how did you add the compute and schedule strategy for operators that use CUTLASS codegen, since CUTLASS is not included inside the topi

@override_native_generic_func("cumprod_strategy")
def cumprod_strategy(attrs, inputs, out_type, target):
    """cumprod generic strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cumprod),
        wrap_topi_schedule(topi.generic.schedule_extern),
        name="cumprod.generic",
    )
    return strategy

@cumsum_strategy.register(["cuda", "gpu"])
def cumsum_strategy_cuda(attrs, inputs, out_type, target):
    """cumsum cuda strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cuda.cumsum),
        wrap_topi_schedule(topi.cuda.schedule_scan),
        name="cumsum.cuda",
    )
    return strategy

ycliu · March 25, 2022, 6:31am

Can we use cutlass+Tensor Core for int8 inference now? Or only FP16 is supported? Thanks.