I want to create a new reduction operation. However, the window size is decided by user input. For example, we want to reduce a 1-dimensional vector A to a smaller size. Instead of summing up every fixed number of element, users input another vector out_ind, which indicates the output indices. Here is a pesudo-code:
for i in range(M):
out[out_ind[i]] += A[i]
I create an IRModule and try to compile it on Nvidia GPU, but it turns out a wrong result.
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(A: T.Buffer[(1024,), "float32"],
out_ind: T.Buffer[(1024,), "int32"],
out: T.Buffer[(128,), "float32"]) -> None:
T.func_attr({"global_symbol": "main", "tir.noalias": True})
for i in T.grid(1024):
with T.block("A"):
vi = T.axis.remap("S", [i])
with T.init():
out[out_ind[vi]] = 0.0
out[out_ind[vi]] += A[vi]
Thanks for your question. It’s pretty interesting.
for i in range(M):
out[out_ind[i]] += A[i]
It is not a typical reduction computation, which usually reduces whole axes (e.g. the k axis for GEMM case).
I suggest that you can change your code into:
@T.prim_func
def main(A: T.Buffer[(1024,), "float32"],
out_ind: T.Buffer[(1024,), "int32"],
out: T.Buffer[(128,), "float32"]) -> None:
for i in range(128):
with T.block("init"):
vi = T.axis.remap("S", [i])
out[vi] = 0.0
for i in range(1024):
with T.block("A"):
vi = T.axis.opaque(1024, i) # it's not a spatial or a reduction axis
out[out_ind[vi]] += A[vi]
This is a generic scatter operation, making the loop data parallel might result in race condition: two different i1, i2 can write to the same location if out_ind[v1] = out_ind[v2].
In SparseTIR such blocks are marked as atomic and we will generate atomic operations (such as AtomicAdd) correspondingly, however, this part has not been upstreamed yet.