Phasing out Legacy Components

LeiWang1999 · September 17, 2024, 6:14am

@varunnaw Good point, in my project we use this approach to retrieve attributes, including the dynamic shared memory size and block/grid information (we add these attribute in a tvm pass), which might be helpful to you.

github.com

microsoft/BitBLAS/blob/main/bitblas/builder/wrapper/tir.py#L64-L80


def parse_source_information(self):
    device_mod = get_annotated_device_mod(self.mod, self.arch.target)
    assert (len(device_mod.functions) == 1
           ), "Only support one function in the module for static shape kernel."
    for g_var, func in device_mod.functions.items():
        self.function_name = g_var.name_hint
        attrs = func.attrs
        if "dyn_shared_memory_buf" in attrs:
            self.dynamic_smem_buf = int(attrs["dyn_shared_memory_buf"])
        if "thread_extent" in attrs:
            thread_extent = attrs["thread_extent"]
            for tag, extent in thread_extent.items():
                if "threadIdx" in tag:
                    self.block_info["xyz".index(tag[-1])] = extent
                elif "blockIdx" in tag:
                    self.grid_info["xyz".index(tag[-1])] = extent

Why this is important?

When users integrate the tvm runtime with 3rdparty frameworks like torch, using dlpack can introduce significant runtime overheads on smaller data shapes, such as gemv and small batched gemv on data-center GPUs. In our benchmarks, we observed delays of around 10 to 50 us. For more details, please refer to this discussion: Strange overhead of tvm.runtime.ndarray.from_dlpack - Apache TVM Discuss.

These overheads arise not only from the ctypes overhead required to initialize a TVMValue from dlpack, but also from occasional calls to CUDASetDevice during the conversion process, which is also cost.

Moreover, when we want to extract the generated code for another usages, tvm doesn’t provide a tool to extract the BlockDim and GridDim and the unified shared memory usage automatically (which can help us to initialize the dynamic shared memory), maybe we can learn a possible solution from the link that I put forward.