Run tvm software stack on risc-v

zhupijuan_lkl · September 6, 2024, 3:32am

Hi all,

I am trying to run the entire TVM software stack (not just the TVM runtime) on a RISC-V CPU, but I cannot get even a simple case to run. According to the printed error messages, there seems to be an issue during the execution phase after LLVM code generation. Could you all help me figure out what might be the problem?

llvm: 15.0.7
gcc: gcc version 13.1.0 (Debian 13.1.0-6)
tvm: commit 35e74cc4c9c8dec658217ffeea85f2ba25e35a35

case:

import numpy as np
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import os


a = np.arange(16).reshape(4, 4)
b = np.arange(16, 0, -1).reshape(4, 4)


c_np = a + b

@tvm.script.ir_module
class MyAdd:
    @T.prim_func
    def add(
        A: T.Buffer((4, 4), "int64"),
        B: T.Buffer((4, 4), "int64"),
        C: T.Buffer((4, 4), "int64"),
    ):
        T.func_attr({"global_symbol": "add"})
        for i, j in T.grid(4,4):
            with T.block("C"):
                vi, vj = T.axis.remap("SS", [i, j])
                C[vi, vj] = A[vi, vj] + B[vi, vj]

print(MyAdd)

rt_lib = tvm.build(
    MyAdd,
    target="llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c"
)
print("after build...")

a_tvm = tvm.nd.array(a)
b_tvm = tvm.nd.array(b)
c_tvm = tvm.nd.array(np.empty((4,4), dtype=np.int64))
rt_lib["add"](a_tvm, b_tvm, c_tvm)
np.testing.assert_allclose(c_tvm.numpy(), c_np, rtol=1e-5)

error log:

Program received signal SIGSEGV, Segmentation fault.
0x0000003ff0dee6be in llvm::RuntimeDyldELF::computePlaceholderAddress(unsigned int, unsigned long) const ()
   from /mnt/v_nfs/zifeng/v-repos/tvm/build/libtvm.so
(gdb) bt
#0  0x0000003ff0dee6be in llvm::RuntimeDyldELF::computePlaceholderAddress(unsigned int, unsigned long) const ()
   from /mnt/v_nfs/zifeng/v-repos/tvm/build/libtvm.so
#1  0x0000003ff0de686e in llvm::RuntimeDyldImpl::applyExternalSymbolRelocations(llvm::StringMap<llvm::JITEvaluatedSymbol, llvm::MallocAllocator>) () from /mnt/v_nfs/zifeng/v-repos/tvm/build/libtvm.so
#2  0x0000000001c4d720 in ?? ()

@tqchen

tqchen · September 6, 2024, 12:57pm

seems this is an issue of llvm jit. I rember we have two llvmjit modes, @cbalint13 might know, see https://github.com/apache/tvm/pull/15964

Maybe we should switch to the new jit mode as the default.

Another alternative is you can directly run export_library, which will exports to a shared library and load it back

cbalint13 · September 6, 2024, 1:26pm

@zhupijuan_lkl

Could please try target="llvm -jit=orcjit -mtriple=riscv64{...}" ?

Let me know of the outcome, we can fix this properly if there are further issues.

tqchen · September 6, 2024, 1:28pm

@cbalint13 if we confirm orcjit works for all cases and we don;t need to keep compact with older LLVM, perhaps we can simply move towards this as a default

cbalint13 · September 6, 2024, 1:43pm

@tqchen , @zhupijuan_lkl

Let’s confirm it, then I will rise a proper PR to propose the new the ORCJIT as default.

Will also look later on the code snippet from here on a real riscv box to check, but this should work either in this form (running on a local real riscv machine) either by a running on remote riscv via the rpc API.

As reminder the reason for the ORCJIT executor for riscv case was originally rised here: ⚙ D127842 [RuntimeDyld][RISCV] Minimal riscv64 support

zhupijuan_lkl · September 7, 2024, 2:32am

@tqchen Thank you very much for your reply!

Yes, I also noticed the previous PR submitted by the community regarding ocrjit, and I tried it out. However, I found that it still doesn’t work properly and returns the following error:

I set target = “llvm -jit=orcjit -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c”

JIT session error: In graph TVMMod-jitted-objectbuffer, section .text: relocation target "__TVMAPISetLastError" at address 0x3fbd3ac000 is out of range of R_RISCV_HI20 fixup at 0x3fbd3ad266 (add, 0x3fbd3ad000 + 0x266)

Furthermore, I attempted to convert the module to a .so file first and then reloaded it for execution, and the case completed successfully:

import numpy as np
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
import os


a = np.arange(16).reshape(4, 4)
b = np.arange(16, 0, -1).reshape(4, 4)


c_np = a + b

@tvm.script.ir_module
class MyAdd:
    @T.prim_func
    def add(
        A: T.Buffer((4, 4), "int64"),
        B: T.Buffer((4, 4), "int64"),
        C: T.Buffer((4, 4), "int64"),
    ):
        T.func_attr({"global_symbol": "add"})
        for i, j in T.grid(4,4):
            with T.block("C"):
                vi, vj = T.axis.remap("SS", [i, j])
                C[vi, vj] = A[vi, vj] + B[vi, vj]

# print(MyAdd)

rt_lib = tvm.build(
    MyAdd,
    target="llvm -jit=orcjit -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c"
)
rt_lib.export_library("model.so", cc="g++")
print("after build...")

a_tvm = tvm.nd.array(a)
b_tvm = tvm.nd.array(b)
c_tvm = tvm.nd.array(np.empty((4,4), dtype=np.int64))


from tvm import runtime
dev_lib = runtime.load_module("model.so")
print("before run...")
dev_lib["add"](a_tvm, b_tvm, c_tvm)

# print("before run...")
# rt_lib["add"](a_tvm, b_tvm, c_tvm)

np.testing.assert_allclose(c_tvm.numpy(), c_np, rtol=1e-5)

zhupijuan_lkl · September 7, 2024, 2:34am

@cbalint13 Thanks for your reply!

I have tried “llvm -jit=orcjit -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c”, but still fails.

JIT session error: In graph TVMMod-jitted-objectbuffer, section .text: relocation target "__TVMAPISetLastError" at address 0x3fbd3ac000 is out of range of R_RISCV_HI20 fixup at 0x3fbd3ad266 (add, 0x3fbd3ad000 + 0x266)

Could you help to see what might have caused the error?Thanks

cbalint13 · September 7, 2024, 10:50pm

zhupijuan_lkl:

JIT session error: In graph TVMMod-jitted-objectbuffer, section .text:
relocation target  "__TVMAPISetLastError" at address 0x3fbd3ac000 is 
out of range of R_RISCV_HI20 fixup at

Hmm that is interesting, it is not ORCJIT related, it is linkage related, linker cannot relocate section above 2G.

I am looking into this, it seems something happened in between with newer tvm.

zhupijuan_lkl · September 8, 2024, 6:54am

@cbalint13 Thank you. If you need me to provide any further information or there are any updates, please feel free to let me know.

cbalint13 · September 8, 2024, 12:20pm

I manage to fix this here: https://github.com/apache/tvm/pull/17347

Let me know if still have issues.

Thank you @zhupijuan_lkl !

zhupijuan_lkl · September 9, 2024, 2:11am

@cbalint13 This patch works for me, and the above case is now running successfully！！