Can't run RPC GPU tutorial on my own device

BenTaylor3115 · August 2, 2018, 4:12pm

I’m looking at using RPC to cross compile and run on a Jetson TX2.

I first looked at the RPC tutorial and I can cross compile for llvm, and running remotely on a CPU.

But When I followed the GPU tutorial, switching the remote connection to my device, and cuda I keep getting the following error, and it’s not clear what it means.
Check failed: f != nullptr Cannot find function fuse_pad_kernel0 in the imported modules or global registry

Could anyone give me some context here?

merrymercy · August 2, 2018, 7:29pm

It seems you don’t export the library correctly. Can you post your code?

Faldict · August 3, 2018, 3:52am

I encountered problems when use rpc on my TX2 board, too. It works fine for llvm target but fails for cuda. Here’s the errors:

  File "tvm/_ffi/_cython/./function.pxi", line 267, in tvm._ffi._cy3.core.FunctionBase.__call__
  File "tvm/_ffi/_cython/./function.pxi", line 216, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./function.pxi", line 208, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 132, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: Except caught from RPC call: TVMCall CFunc Error:
Traceback (most recent call last):
  File "/home/nvidia/tvm/python/tvm/_ffi/_ctypes/function.py", line 54, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/nvidia/tvm/python/tvm/rpc/server.py", line 47, in load_module
    m = _load_module(path)
  File "/home/nvidia/tvm/python/tvm/module.py", line 219, in load
    _cc.create_shared(path + ".so", files)
  File "/home/nvidia/tvm/python/tvm/contrib/cc.py", line 33, in create_shared
    _linux_shared(output, objects, options, cc)
  File "/home/nvidia/tvm/python/tvm/contrib/cc.py", line 58, in _linux_shared
    raise RuntimeError(msg)
RuntimeError: Compilation error:
/usr/bin/ld: /tmp/tmp9sycy8oy/lib.o: Relocations in generic ELF (EM: 62)
/usr/bin/ld: /tmp/tmp9sycy8oy/lib.o: Relocations in generic ELF (EM: 62)
/tmp/tmp9sycy8oy/lib.o: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status

Can you take a look at this?

merrymercy · August 3, 2018, 5:31am

To @Faldict: I met this error when I set target_host incorrectly.

Faldict · August 3, 2018, 5:40am

I set target = 'cuda'. So what’s the correct target on Jetson TX2 board?

merrymercy · August 3, 2018, 5:47am

We need some llvm code to launch cuda kernels. So we will do cross compilation for the ARM CPU on TX2 board.
You should use something like

with nnvm.compiler.build_config(opt_level=opt_level):
    graph, lib, params = nnvm.compiler.build(
        net, target='cuda', 
        target_host='llvm -target=aarch64-linux-gnu', ## ADD THIS LINE
        shape={"data": data_shape}, params=params)

I don’t know the target triple for TX2, you can query it by executing "gcc -v" on your board and looking for the line starts with Target:

Faldict · August 3, 2018, 6:41am

The problem solved after adding target_host='llvm -target=aarch64-linux-gnu'. Really thanks!

BenTaylor3115 · August 3, 2018, 10:34am

Here is the code from my function which compiles and executes a keras model on a jetson.

model is the keras model - in this case a pretrained keras_resnet50 model
X is the input data, which i preprocess using resnet50 preprocess function

target = 'cuda'
target_host = "llvm -target=aarch64-linux-gnu"
lib_fname = 'net.o'

################################
# Compile and upload the model #
################################
# convert the keras model(NHWC layout) to NNVM format(NCHW layout), then compile
sym, params = nnvm.frontend.from_keras(model)
shape_dict = {'input_1': X.shape}
with nnvm.compiler.build_config(opt_level=2):
    graph, lib, params = nnvm.compiler.build(sym, target, shape_dict, target_host=target_host,
                                             params=params)

# Save the library at local temporary directory.
tmp = util.tempdir()
lib_path = tmp.relpath(lib_fname)
lib.save(lib_path)

print("Connecting to device...")
remote = connect_to_jetson()

print("Uploading model to device...")
remote.upload(lib_path)
rlib = remote.load_module(lib_fname)

print("Uploading params to device...")
ctx = remote.cpu() if target == 'llvm' else remote.gpu()
rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}


########################
# Execute remotely TVM #
########################
print("Setting up model on remote device...")
module = graph_runtime.create(graph, rlib, ctx)

# set inputs
module.set_input('input_1', tvm.nd.array(X.astype('float32')))
module.set_input(**rparams)

# run
print("Running model on remote device...")
# module.run()
runtime = time_inference(module)
print("Runtime:", runtime, "ms")

# get output
print("Retrieving output from device...")
out_shape = (1000,)
out = module.get_output(0, tvm.nd.empty(out_shape, 'float32', ctx=ctx)).asnumpy()
top1_tvm = np.argmax(out)

return top1_tvm

merrymercy · August 3, 2018, 4:23pm

Your exporting code only works for cpu. For gpu, you have to

change

lib_fname = 'net.o'

to

lib_fname = 'net.tar'

change

lib.save(lib_path)

to

lib.export_library(lib_path)

tqchen · August 5, 2018, 2:36am

It is good that we have a generic solution for that. Maybe it makes sense to create a complete RPC deployment tip on how to configure common boards and targets, so we next time we have such question, we can directly give links to the doc

@merrymercy @eqy

BenTaylor3115 · August 5, 2018, 3:16pm

Thanks for the help! I made the changes you mentioned, but now I’m getting a different error

CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX

Jokeren · August 6, 2018, 12:13am

Looks like you haven’t configured to the correct arch

BenTaylor3115 · August 6, 2018, 1:26pm

I ran gcc -v on my TX2 and the result was: Target: aarch64-linux-gnu, which is what I use as my target host when compiling.

merrymercy · August 6, 2018, 11:12pm

It is due to the mismatch of CUDA version.

I can reproduce your error if the host uses CUDA 9.1 and TX2 uses CUDA 9.0.
But I can run the kernel successfully if both of them use CUDA 9.0

BenTaylor3115 · August 15, 2018, 1:25pm

Thanks! This solves my problem!

merrymercy · August 15, 2018, 9:31pm

To get better performance I think you should set the cross compilation target correctly.
By default tvm will use this function to compile cuda code to ptx.

github.com

dmlc/tvm/blob/54a115ef14fb6dabbf6ea8eb9e6dd85846030c72/python/tvm/autotvm/measure/measure_methods.py#L485




if 'cuda' in target.keys:
    kwargs["cuda_arch"] = "sm_" + "".join(ctx.compute_version.split('.'))


def set_cuda_target_arch(arch):
"""set target architecture of nvcc compiler"""
AutotvmGlobalScope.current.cuda_target_arch = arch




@register_func
def tvm_callback_cuda_compile(code):
"""use nvcc to generate ptx code for better optimization"""
ptx = nvcc.compile_cuda(code, target="ptx", arch=AutotvmGlobalScope.current.cuda_target_arch)
return ptx




def gpu_verify_pass(**kwargs):
"""Verify the validity of a gpu kernel.
This pass will check memory usage and number of threads per block.
"""
def verify_pass(stmt):

pfk123 · April 28, 2022, 10:28am

I have create PR for solving similar problem using tvmc compile command: https://github.com/apache/tvm/pull/11159