Can't run RPC GPU tutorial on my own device

I’m looking at using RPC to cross compile and run on a Jetson TX2.

I first looked at the RPC tutorial and I can cross compile for llvm, and running remotely on a CPU.

But When I followed the GPU tutorial, switching the remote connection to my device, and cuda I keep getting the following error, and it’s not clear what it means.
Check failed: f != nullptr Cannot find function fuse_pad_kernel0 in the imported modules or global registry

Could anyone give me some context here?

It seems you don’t export the library correctly. Can you post your code?

I encountered problems when use rpc on my TX2 board, too. It works fine for llvm target but fails for cuda. Here’s the errors:

  File "tvm/_ffi/_cython/./function.pxi", line 267, in tvm._ffi._cy3.core.FunctionBase.__call__
  File "tvm/_ffi/_cython/./function.pxi", line 216, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./function.pxi", line 208, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 132, in tvm._ffi._cy3.core.CALL
tvm._ffi.base.TVMError: Except caught from RPC call: TVMCall CFunc Error:
Traceback (most recent call last):
  File "/home/nvidia/tvm/python/tvm/_ffi/_ctypes/function.py", line 54, in cfun
    rv = local_pyfunc(*pyargs)
  File "/home/nvidia/tvm/python/tvm/rpc/server.py", line 47, in load_module
    m = _load_module(path)
  File "/home/nvidia/tvm/python/tvm/module.py", line 219, in load
    _cc.create_shared(path + ".so", files)
  File "/home/nvidia/tvm/python/tvm/contrib/cc.py", line 33, in create_shared
    _linux_shared(output, objects, options, cc)
  File "/home/nvidia/tvm/python/tvm/contrib/cc.py", line 58, in _linux_shared
    raise RuntimeError(msg)
RuntimeError: Compilation error:
/usr/bin/ld: /tmp/tmp9sycy8oy/lib.o: Relocations in generic ELF (EM: 62)
/usr/bin/ld: /tmp/tmp9sycy8oy/lib.o: Relocations in generic ELF (EM: 62)
/tmp/tmp9sycy8oy/lib.o: error adding symbols: File in wrong format
collect2: error: ld returned 1 exit status

Can you take a look at this?

To @Faldict: I met this error when I set target_host incorrectly.

I set target = 'cuda'. So what’s the correct target on Jetson TX2 board?

We need some llvm code to launch cuda kernels. So we will do cross compilation for the ARM CPU on TX2 board.
You should use something like

with nnvm.compiler.build_config(opt_level=opt_level):
    graph, lib, params = nnvm.compiler.build(
        net, target='cuda', 
        target_host='llvm -target=aarch64-linux-gnu', ## ADD THIS LINE
        shape={"data": data_shape}, params=params)

I don’t know the target triple for TX2, you can query it by executing "gcc -v" on your board and looking for the line starts with Target:

1 Like

The problem solved after adding target_host='llvm -target=aarch64-linux-gnu'. Really thanks!

Here is the code from my function which compiles and executes a keras model on a jetson.

model is the keras model - in this case a pretrained keras_resnet50 model
X is the input data, which i preprocess using resnet50 preprocess function

target = 'cuda'
target_host = "llvm -target=aarch64-linux-gnu"
lib_fname = 'net.o'

################################
# Compile and upload the model #
################################
# convert the keras model(NHWC layout) to NNVM format(NCHW layout), then compile
sym, params = nnvm.frontend.from_keras(model)
shape_dict = {'input_1': X.shape}
with nnvm.compiler.build_config(opt_level=2):
    graph, lib, params = nnvm.compiler.build(sym, target, shape_dict, target_host=target_host,
                                             params=params)

# Save the library at local temporary directory.
tmp = util.tempdir()
lib_path = tmp.relpath(lib_fname)
lib.save(lib_path)

print("Connecting to device...")
remote = connect_to_jetson()

print("Uploading model to device...")
remote.upload(lib_path)
rlib = remote.load_module(lib_fname)

print("Uploading params to device...")
ctx = remote.cpu() if target == 'llvm' else remote.gpu()
rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}


########################
# Execute remotely TVM #
########################
print("Setting up model on remote device...")
module = graph_runtime.create(graph, rlib, ctx)

# set inputs
module.set_input('input_1', tvm.nd.array(X.astype('float32')))
module.set_input(**rparams)

# run
print("Running model on remote device...")
# module.run()
runtime = time_inference(module)
print("Runtime:", runtime, "ms")

# get output
print("Retrieving output from device...")
out_shape = (1000,)
out = module.get_output(0, tvm.nd.empty(out_shape, 'float32', ctx=ctx)).asnumpy()
top1_tvm = np.argmax(out)

return top1_tvm

Your exporting code only works for cpu. For gpu, you have to

  1. change
lib_fname = 'net.o'

to

lib_fname = 'net.tar'
  1. change
lib.save(lib_path)

to

lib.export_library(lib_path)

It is good that we have a generic solution for that. Maybe it makes sense to create a complete RPC deployment tip on how to configure common boards and targets, so we next time we have such question, we can directly give links to the doc

@merrymercy @eqy

Thanks for the help! I made the changes you mentioned, but now I’m getting a different error

CUDAError: cuModuleLoadData(&(module_[device_id]), data_.c_str()) failed with error: CUDA_ERROR_INVALID_PTX

Looks like you haven’t configured to the correct arch

I ran gcc -v on my TX2 and the result was: Target: aarch64-linux-gnu, which is what I use as my target host when compiling.

It is due to the mismatch of CUDA version.

I can reproduce your error if the host uses CUDA 9.1 and TX2 uses CUDA 9.0.
But I can run the kernel successfully if both of them use CUDA 9.0

Thanks! This solves my problem!

To get better performance I think you should set the cross compilation target correctly.
By default tvm will use this function to compile cuda code to ptx.

I have create PR for solving similar problem using tvmc compile command: https://github.com/apache/tvm/pull/11159