[Feature Request] Support C Host Module Packing

LeiWang1999 · February 18, 2025, 5:41pm

Why We Need This Feature?

Right now, when installing tvm or projects based on tvm, most user issues seem to stem from llvm dependencies. However, there isn’t a convenient way to install llvm. Downloading prebuilt binaries isn’t ideal either, as they are quite large—compressed packages can be over 1GB.

I think one best solution is to enable c host codegen instead of llvm, however, interestingly with target=“cuda” and target_host=“llvm” works fine, but using target=“cuda”, target_host=“c” seems uncommon.

Consider the following example:

import tvm
from tvm.script import ir as I
from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(Q: T.Buffer((1, 4096, 32, 128), "float16")):
        T.func_attr({"target": T.target({"arch": "sm_89", "host": {"keys": ["cpu"], "kind": "llvm", "mtriple": "x86_64-unknown-linux-gnu", "tag": ""}, "keys": ["cuda", "gpu"], "kind": "cuda", "max_num_threads": 1024, "tag": "", "thread_warp_size": 32})})
        by = T.launch_thread("blockIdx.y", 32)
        v = T.launch_thread("threadIdx.x", 256)
        for i in T.vectorized(8):
            Q_shared = T.allocate([16384], "float16", "shared.dyn")
            Q_shared_1 = T.Buffer((16384,), "float16", data=Q_shared, scope="shared.dyn")
            Q_1 = T.Buffer((16777216,), "float16", data=Q.data)
            Q_shared_1[v * 8 + i] = Q_1[by * 128 + v * 8 + i]

mod = Module

rt_mod = tvm.build(mod, target="cuda", target_host="c")

print(rt_mod.get_source())
print(rt_mod.imported_modules[0].get_source())


import numpy as np
Q = tvm.nd.array(np.random.randn(1, 4096, 32, 128).astype("float16"), device=tvm.cuda())

rt_mod(Q)

The output is:

'''output
Traceback (most recent call last):
  File "/root/tilelang/debug/unit_vectorize_test.py", line 28, in <module>
    rt_mod(Q)
  File "/usr/local/lib/python3.10/dist-packages/tilelang/3rdparty/tvm/python/tvm/runtime/module.py", line 201, in __call__
    return self.entry_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/tilelang/3rdparty/tvm/python/tvm/runtime/module.py", line 128, in entry_func
    self._entry = self.get_function(self.entry_name)
  File "/usr/local/lib/python3.10/dist-packages/tilelang/3rdparty/tvm/python/tvm/runtime/module.py", line 176, in get_function
    raise AttributeError(f"Module has no function '{name}'")
AttributeError: Module has no function '__tvm_main__'

From my debugging, it looks like the C host module (CSourceModuleNode) only loads the source without actually compiling it. I was trying to implement the compilation stack—for example, writing similar to LLVMModuleNode. But it’s actually a bit difficult, any help with the implementation or support would be greatly appreciated!

tqchen · February 18, 2025, 9:57pm

the particular c compilation problem itself can be solved through export_library, or automatically via jit in Unified Python First Compilation Flow through tvm.compile

After that, you should be able to at least call into a function by its name(not directly use rt_mod, but use rt_mod["main"], however, ptx packing through c compiler can be slower than llvm, and indeed this path is not well checked.

LeiWang1999 · February 19, 2025, 3:54am

Thanks, export_library works for me with simple fix for c host codegen. Think we can automatically handle it within CSourceModuleNode

tqchen · February 19, 2025, 1:42pm

Executable proposal in Unified Python First Compilation Flow through tvm.compile will resolve this part