Hi, I wondered if I have a SoC which is combined with ARM core(as cpu), a DSP with vector ISA and a Deep Learning Accelerator(DLA), how to re-target TVM to it? They should be a lot of scenarios need such kind of SoC, for example, a object detection such as YOLOv3/v4, If ARM as “target host” for calling the DSP and DLA, DLA handle all CNN ops(such as Conv2d, ReLu, BN), DSP handle NMS/sort. Mightbe TVM framework already support such kind of things, not sure, :frowning:

You might consider using BYOC to integrate the custom codegen to TVM:

so, what you means that if I use BYOC, this is the SoC above, ARM should be “target host”, and DSP should be a “ext” target, also DLA should be another "ext’ target, right? However, is that means “calling” DSP and DLA “kernels” only can be ARM, and data exchange/task sync between DSP and DLA should be only routed by ARM?

No. ARM will be the “target”, and both DSP and DLA are the external targets.

Thank for replying so fast, :). TVM seems will automatically generate Host “calling” IRModule/PrimFunc in tir::SplitHostDevice pass, for this SoC, ARM seems should be “target_host”? if not, target_host should be the server where we compile NN model? sorry for bothering, but little confused…

Arm is ‘target_host’ , DSP and DLA is ‘device’

Program run on ‘target_host’ call instructions on ‘device’ to accelerate inference

In tvm crosscompile ‘target’ means where inference on. In your case is ‘DSP & DLA’ which is a ‘device’. Then you need a ‘target_host’ to run program

Agree with that, thanks for your kind answer! Another question about graph partition: in graph_runtime_codegen, the “ext” funcs will be compiled via “LowerExternalFunctions”, after digging the source code, it seems that the “ext” funcs for “relay.ext.<user_compiler>” target, NO models’ params(inputs for each func) feed into “user_compiler”, so that means “user_compiler” can NOT do some optimizing things such as “fusion dedicated ops” based on dedicated DSP/DLA, because the “user_compiler” don’t know about input params infos. That’s weird for me, might be I’m wrong?

@zhiics, @comaniac sorry for bothering, pls give me some tips, thanks!

I didn’t fully get your question. I guess you want to use some Relay optimizations in the partitioned functions? If so, you’re right about the flow that the partitioned external function won’t be optimized by some Relay passes. However, you can apply them in your codegen by defining an optimize pass. See the following example that applies ConvertLayout and FoldConstant passes to the external functions.

Got it! Actually what I’m asking is that, for a NN model with params(weights, BatchNorm params, etc.), we have graph-partition for heterogeneous executing, however do we need also a “params-partition” corresponding, because I think for ext.compiler with optimizer(graph-level optimizing), besides a subgraph needed to send to exe.compiler processing, the exe.compiler should also need a “sub-params” b64NDArray as meta-data also?

Please trace the ACL example I posted above. It processes the subgraph parameters by itself. the processed subgraph parameters will be later stored in the metadata module along with other parameters.

1 Like

I’ll trace it, thanks so much! B.T.W., TVM conf videos on YouTube are very useful for me, should be also valuable for other new developer, appreciate that!

Emmm… sorry to say that, but after trace the ACL and TensorRT codegen with AsText(), still don’t see where the “b64ndarrays” in “metadata” input into such as ACLCompiler. I’ve print input func with AsText, no “metadata” in it, so I guess for ACLCompiler, no sub-params input; Also PreProcessModule I don’t see where to call it. :frowning: @comaniac

For details, what I mean is that LowerExternalFunctions in, input src_func for relay.ext.tensorrt external compiler doesn’t include “subgraph params”. If it’s true, then a hardware vendor want to process a subgraph(with params, such as weights for conv2d), e.g., fused conv2d with add/relu/bn, they might need params.

BINGO! Sorry for this stupid question, “relay.ext.arm_compute_lib.optimize” will be called when Pass PartitionGraph. Again, thanks!