[RFC][BYOC]NVIDIA CUTLASS Integration

junrushao · March 1, 2021, 3:04am

Thanks for the instructions! Our team (mainly @vinx13) are super excited by CUTLASS and currently investigating potential ways to tensorize with CUTLASS in TensorIR. Would love to discuss more details in the future

hwu36 · March 1, 2021, 3:33am

I would like to highlight that Ampere kernels inside CUTLASS profiler are currently tuned for A100 (sm80), not 30xx (sm86). I recommend to use A100 to do benchmarking if possible. 3090 is designed for running games, not for running tensor cores.

If you have to use sm86, you need to adjust the stage number to get the best performance. SM80 has 160KB shared memory, but SM86 has only 100KB. Some stage number used by SM80 are not possible to run in SM86, or has to use lower occupancy. For example, the important tile size 128x128x32 is configured to use 5 stage. It uses (128x32 + 128x32) x 2B x 5 = 80KB shared memory which means 1 (= floor(100/80)) Threadblock per SM. If lowering the stage number to 3, the shared memory footprint is (128x32 + 128x32) x 2B x 3 = 48KB which means 2 (=100/48) Threadblocks per SM.

The configuration of Ampere 16816 float tensor core kernels are here: cutlass/generator.py at master · NVIDIA/cutlass · GitHub. This is an example how to read them

TileDescription([256, 128, 64], 3, [4, 2, 1], math_inst, min_cc, max_cc_smem_limited),

[256, 128, 64] - tile size, MxNxK

3 - stage number, minimum stage number is 3 for all Ampere kernels.

[4, 2, 1] - 4 warps in the M dimension, 2 warps in the N dimension, 1 warp in the K dimension, totally 8 warps, 256 threads

If you find a sweet spot of SM86 stage number, feel free to upstream to CUTLASS github. We haven’t done it ourselves.

Lastly, just want to remind that the numbers measured today will be too old when your integration is done because of the new CUDA compiler and the new CUTLASS code at that time.

hwu36 · April 19, 2021, 1:52am

CUDA 11.3 significantly improves the performance of Ampere/Turing/Volta Tensor Core kernels.

298TFLOPS was recorded when benchmarking CUTLASS FP16 GEMM on A100. This is 14% higher than CUDA 11.2. FP32(via TF32) GEMM is improved by 39% and can reach 143TFLOPS. The same speedup applies to the CONV kernels.

See the discussion in CUDA 11.3 significantly improved the performance of CUTLASS · Discussion #241 · NVIDIA/cutlass · GitHub

It may be the time for you to do the benchmarking again.

Laurawly · April 19, 2021, 9:41pm

That great news! Thanks @hwu36 for sharing this. We can’t wait to benchmark cutlass with cuda 11.3 on T4 and V100 using tensor core as well as TF32 on A100.

masahi · September 15, 2021, 11:27pm

Hi @Laurawly, do you have an update on this? I’m very interested in this feature and happy to help upstreaming effort.

Laurawly · September 16, 2021, 9:22pm

Hi @masahi, glad that you could help. It’s already used in production. However I’m currently focusing on landing it. When I have more bandwidth, I will send a PR but probably the code will be as is. If you could help getting this ready to be merged to upstream that would be great.

masahi · September 16, 2021, 9:56pm

@Laurawly Great!! Very cool to hear that it is being used in production. Yes, you can send your code as it is, and I can do all the necessary integration or clean up work for upstreaming.

danielwong · January 6, 2022, 1:14am

is there any documentation of using CUTLASS in TVM?

masahi · January 6, 2022, 2:45am

No, you can check out usage examples in https://github.com/apache/tvm/blob/e7f36487dfdb6c4b7b544be155d3869002d7281b/tests/python/contrib/test_cutlass.py

I also have some E2E examples at https://github.com/masahi/tvm-cutlass-eval

danielwong · January 8, 2022, 7:24pm

Hi, I have another question, how did you add the compute and schedule strategy for operators that use CUTLASS codegen, since CUTLASS is not included inside the topi

@override_native_generic_func("cumprod_strategy")
def cumprod_strategy(attrs, inputs, out_type, target):
    """cumprod generic strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cumprod),
        wrap_topi_schedule(topi.generic.schedule_extern),
        name="cumprod.generic",
    )
    return strategy

@cumsum_strategy.register(["cuda", "gpu"])
def cumsum_strategy_cuda(attrs, inputs, out_type, target):
    """cumsum cuda strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cuda.cumsum),
        wrap_topi_schedule(topi.cuda.schedule_scan),
        name="cumsum.cuda",
    )
    return strategy

ycliu · March 25, 2022, 6:31am

Can we use cutlass+Tensor Core for int8 inference now? Or only FP16 is supported? Thanks.