External codegen with CUDA target

jonso · March 31, 2020, 5:27pm

I am working on an external codegen that will run on GPU. My external codegen module is a CSourceModule. The code generated in this module will call some CUDA APIs.

If I go through the external codegen workflow and set the target to cuda -libs=cublas,cudnn, will my C wrapper get the correct data? In my first test, the program crashes when I try to read the data.

I want to run the whole model on GPU and use my custom GPU code for some subgraphs.

Thanks!

comaniac · March 31, 2020, 7:02pm

No that’s a different flow. TVM itself has the cuBLAS and cuDNN support already (example). If you set the target with -libs, it’s using the TVM builtin one instead of your codegen. To use your codegen, now you have to annotate the graph with op-based approach (example) or a customized annotation pass (example).

Note that we have merged a PR to support op-merging for op-based annotation. Check the test case in this PR for details.

jonso · March 31, 2020, 7:16pm

Sorry about that, I think I misspoke. I already have the annotation pass set up properly and my codegen is being called. However, when I try to print out one of my inputs from my codegen, the program crashes.

I have a feeling that since the target is “cuda”, the data isn’t being moved from GPU back to CPU. Is there a way to verify this flow? Do you have an example with external codegen on GPU?

When the target is llvm it works properly.

Btw, this is same transformer pattern as I was using before just a different backend implementation.

comaniac · March 31, 2020, 8:12pm

Ah I see. One reason might be an empty host module in this case. I’d call out @trevor-m since he has the experience to offload subgraphs to TRT while keeping thre rest on CUDA.

zhiics · March 31, 2020, 10:46pm

@jonso if you can get into the GetFunction in external module, it means there is no problem for runtime symbol lookup. Can you check if the input data is correct? For example, the data you have in the external runtime should be from here:

github.com

apache/incubator-tvm/blob/master/src/runtime/graph/graph_runtime.cc#L398


    };
    return {fexec, arg_ptr};
  }


  // Get compiled function from the module that contains both host and device
  // code.
  tvm::runtime::PackedFunc pf = module_.GetFunction(param.func_name, true);
  CHECK(pf != nullptr) << "no such function in module: " << param.func_name;


  auto fexec = [arg_ptr, pf]() {
    TVMRetValue rv;
    TVMArgs targs(arg_ptr->arg_values.data(),
                  arg_ptr->arg_tcodes.data(),
                  static_cast<int>(arg_ptr->arg_values.size()));
    pf.CallPacked(targs, &rv);
  };
  return {fexec, arg_ptr};
}


PackedFunc GraphRuntime::GetFunction(
    const std::string& name,

trevor-m · March 31, 2020, 11:00pm

Hi @jonso, when I do relay.build with target=“cuda”, the data inputs supplied to my runtime module are already placed on the GPU by the graph runtime.The DLTensor->data will be a device pointer to the data in GPU memory and you can pass this directly to CUDA libraries.

If you need to get the data back onto the CPU, you could use cudaMemcpy to move the data as part of your generated code. But it sounds like you want everything to be on the GPU.

jonso · March 31, 2020, 11:25pm

Awesome, thanks a lot @trevor-m. One more quick question before I try it out - what data type is DLTensor->data in this case? By default, the codegen_c base casts it to the type that the argument to the function is (in my case, input is a float* and input_mask is an int*).

Edit: I can probably just keep it as a void*