Question: BYOC : replace nn.conv2d() to our nucfpga_conv2d()

I use the latest version TVM from github. Now I want to customize the operator instead of nn.conv2d.

According to the articleHow to Bring Your Own Codegen to TVM (

I have changed these places:


GenerateBodyOutput GenerateOpCall(const CallNode* call) {
const auto* op_node = call-><OpNode>();
CHECK(op_node) << "Expect OpNode, but got " << call->op->GetTypeKey();
using ArgFunType = std::function<std::vector<std::string>(const CallNode*)>;
static const std::map<std::string, std::pair<std::string, ArgFunType>> op_map = {
    {"nn.conv2d", {"nuc_fpga_conv2d", Conv2d}},
    {"nn.dense", {"nuc_fpga_dense", Dense}},


extern "C" void nuc_fpga_conv2d(
    int8_t* feature_map, int8_t* kernel, int* out, 
    int batch, int in_ch, int in_height, int in_width,
    int kernel_number, int group, 
    int pad_height, int pad_width, 
    int kernel_height, int kernel_width,
    int stride_height, int stride_width) {
        printf("Calling From nuc_fpga_conv2d\n");


extern "C" TVM_DLL void nuc_fpga_conv2d(
int8_t* feature_map, int8_t* kernel, int* out, 
int batch, int in_ch, int in_height, int in_width,
int kernel_number, int group, 
int pad_height, int pad_width, 
int kernel_height, int kernel_width,
int stride_height, int stride_width);




    file(GLOB NUCFPGA_CODEGEN_SRC src/relay/backend/contrib/nuc_fpga/*.cc)
    file(GLOB NUCFPGA_CONTRIB_SRCS src/runtime/contrib/nuc_fpga/
    message(STATUS "Build with contrib.nuc_fpga")


Then cmake … and make j4

When I run the program(Resnet18):

extern_mod = relay.transform.AnnotateTarget(['nuc_fpga'])(mod)
extern_mod = relay.transform.MergeCompilerRegions()(extern_mod)
extern_mod = relay.transform.PartitionGraph()(extern_mod)
print("extern_mod:", extern_mod)
target ='llvm')
with tvm.transform.PassContext(opt_level=3):
    grf_mod =, target=target, params=params)

The graph still is nn.conv2d not the nuc_fpga_conv2d

Part of extern_mod log is :

%0 = nn.conv2d(%input0, %conv1.0.weight, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7]) /* ty=Tensor[(1, 64, 112, 112), float32] */;

%1 = nn.bias_add(%0, %conv1.0.bias) /* ty=Tensor[(1, 64, 112, 112), float32] */;

%2 = nn.relu(%1) /* ty=Tensor[(1, 64, 112, 112), float32] */;

%3 = nn.max_pool2d(%2, pool_size=[3, 3], strides=[2, 2], padding=[1, 1, 1, 1]) /* ty=Tensor[(1, 64, 56, 56), float32] */;

%4 = nn.conv2d(%3, %layer1.0.conv1.0.weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* ty=Tensor[(1, 64, 56, 56), float32] */;

%5 = nn.bias_add(%4, %layer1.0.conv1.0.bias) /* ty=Tensor[(1, 64, 56, 56), float32] */;

%6 = nn.relu(%5) /* ty=Tensor[(1, 64, 56, 56), float32] */;

It should be nuc_fpga_conv2d instead of nn.conv2d,but not.

Did I do something wrong?


The replacement happens in the codegen, which is launched during the build process, so it hasn’t happend yet at the line you printed extern_mod.

In addition, you should not see nuc_fpga_conv2d in Relay graph anyways, because nuc_fpga_conv2d is not a Relay op. The implementation of nuc_fpga_conv2d in your codegen is not registering an op to Relay. Instead, it just tells Relay when the op is offloaded to nuc_fpga, it should use the function you implemented (i.e., nuc_fpga_conv2d) to perform nn.conv2d.

Thank for your reply.

But when I run the model:

rlib = tvm.runtime.module.load_module(dso_path)
ctx = tvm.cpu()
rt_mod = graph_executor.GraphModule(rlib['default'](ctx))

The net is not use nuc_fpga_conv2d, I don’t see the nuc_fpga_conv2d() 's output “Calling From nuc_fpga_conv2d”,It still use nn.conv2d .

From your example it’s hard to judge whether nuc_fpga_conv2d is invoked correctly. You may first check the partitioned graph to see if nn.conv2d is partitioned to a function with kCompiler=“nuc_fpga”.

I have done these:

extern_mod = relay.transform.AnnotateTarget(['nuc_fpga'])(mod)
extern_mod = relay.transform.MergeCompilerRegions()(extern_mod)
extern_mod = relay.transform.PartitionGraph()(extern_mod)
print("extern_mod:", extern_mod)

output is

%0 = nn.conv2d(%input0, %conv1.0.weight, strides=[2, 2], padding=[3, 3, 3, 3], channels=64, kernel_size=[7, 7]) /* ty=Tensor[(1, 64, 112, 112), float32] */;
%1 = nn.bias_add(%0, %conv1.0.bias) /* ty=Tensor[(1, 64, 112, 112), float32] */;
%2 = nn.relu(%1) /* ty=Tensor[(1, 64, 112, 112), float32] */;

The partitioned graph still is nn.con2d

This doesn’t look right tho. It should be like

%1 = fn(..., kPrimitive=1, kCompiler="nuc_fpga")  {
%2 = %1(...);
%3 = nn.bias_add(...);
%4 = nn.relu(...);

Please check your annotation rules.

Thanks,I check my annotation rules again,found the problem that I forget add

 from .nucfpga import *

in python/tvm/relay/op/contrib/

The problem has been solved. Thank you very much.

Hey, I have also done the sam, But how can I generate c code from here and build an executable to tun on my hardware? If any hint you can provide


A good question and very good answer!