Non-predefined Reduction

xxc · October 16, 2022, 2:41pm

I want to create a new reduction operation. However, the window size is decided by user input. For example, we want to reduce a 1-dimensional vector A to a smaller size. Instead of summing up every fixed number of element, users input another vector out_ind, which indicates the output indices. Here is a pesudo-code:

for i in range(M):
    out[out_ind[i]] += A[i]

I create an IRModule and try to compile it on Nvidia GPU, but it turns out a wrong result.

@tvm.script.ir_module
class MyModule:
    @T.prim_func
    def main(A: T.Buffer[(1024,), "float32"], 
             out_ind: T.Buffer[(1024,), "int32"], 
             out: T.Buffer[(128,), "float32"]) -> None:
        T.func_attr({"global_symbol": "main", "tir.noalias": True})
        for i in T.grid(1024):
            with T.block("A"):
                vi = T.axis.remap("S", [i])
                with T.init():
                    out[out_ind[vi]] = 0.0
                out[out_ind[vi]] += A[vi]

Does any have any idea about this? Thanks.

Hzfengsy · October 16, 2022, 6:45am

Thanks for your question. It’s pretty interesting.

for i in range(M):
    out[out_ind[i]] += A[i]

It is not a typical reduction computation, which usually reduces whole axes (e.g. the k axis for GEMM case).

I suggest that you can change your code into:

@T.prim_func
def main(A: T.Buffer[(1024,), "float32"], 
         out_ind: T.Buffer[(1024,), "int32"], 
         out: T.Buffer[(128,), "float32"]) -> None:
    for i in range(128):
        with T.block("init"):
            vi = T.axis.remap("S", [i])
            out[vi] = 0.0
    for i in range(1024):
        with T.block("A"):
            vi = T.axis.opaque(1024, i) # it's not a spatial or a reduction axis
            out[out_ind[vi]] += A[vi]

xxc · October 16, 2022, 2:41pm

Thanks for your reply. I tried your suggestion, but I cannot auto-tune the code. Here is how I auto-tuned:

target = tvm.target.Target("cuda -keys=cuda,gpu -arch=sm_75 -max_num_threads=1024 -thread_warp_size=32 -max_threads_per_block=1024 -max_shared_memory_per_block=49152")

ir_database = ms.tune_tir(
    mod=MyModule,
    target=target,
    max_trials_global=64,
    num_trials_per_iter=64,
    work_dir="./tune_tmp",
    task_name="main"
)

shed = ms.tir_integration.compile_tir(ir_database, MyModule, target)

func = tvm.build(shed.mod, target=target)

And here is a screenshot of the error message.

I wonder if I need to tune manually.

Hzfengsy · October 17, 2022, 7:00am

Yes, it can not be auto-tuned. However, I’m not sure how to optimize such a program.

Since it is a reduction-liked computation, we can not parallel the main loop i.

Additionally, cross_thread_reduction can not be applied to this workload.

yzh119 · October 17, 2022, 8:04pm

This is a generic scatter operation, making the loop data parallel might result in race condition: two different i1, i2 can write to the same location if out_ind[v1] = out_ind[v2].

In SparseTIR such blocks are marked as atomic and we will generate atomic operations (such as AtomicAdd) correspondingly, however, this part has not been upstreamed yet.