TVM for ARM+DSP+DLA

roger-zhao · December 30, 2020, 1:52am

Hi, I wondered if I have a SoC which is combined with ARM core(as cpu), a DSP with vector ISA and a Deep Learning Accelerator(DLA), how to re-target TVM to it? They should be a lot of scenarios need such kind of SoC, for example, a object detection such as YOLOv3/v4, If ARM as “target host” for calling the DSP and DLA, DLA handle all CNN ops(such as Conv2d, ReLu, BN), DSP handle NMS/sort. Mightbe TVM framework already support such kind of things, not sure,

comaniac · December 30, 2020, 2:13am

You might consider using BYOC to integrate the custom codegen to TVM: https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm

roger-zhao · December 30, 2020, 2:25am

so, what you means that if I use BYOC, this is the SoC above, ARM should be “target host”, and DSP should be a “ext” target, also DLA should be another "ext’ target, right? However, is that means “calling” DSP and DLA “kernels” only can be ARM, and data exchange/task sync between DSP and DLA should be only routed by ARM?

comaniac · December 30, 2020, 2:42am

No. ARM will be the “target”, and both DSP and DLA are the external targets.

roger-zhao · December 30, 2020, 2:47am

Thank for replying so fast, :). TVM seems will automatically generate Host “calling” IRModule/PrimFunc in tir::SplitHostDevice pass, for this SoC, ARM seems should be “target_host”? if not, target_host should be the server where we compile NN model? sorry for bothering, but little confused…

fantasyRqg · January 4, 2021, 3:20am

Arm is ‘target_host’ , DSP and DLA is ‘device’

Program run on ‘target_host’ call instructions on ‘device’ to accelerate inference

In tvm crosscompile ‘target’ means where inference on. In your case is ‘DSP & DLA’ which is a ‘device’. Then you need a ‘target_host’ to run program

roger-zhao · January 4, 2021, 4:26am

Agree with that, thanks for your kind answer! Another question about graph partition: in graph_runtime_codegen, the “ext” funcs will be compiled via “LowerExternalFunctions”, after digging the source code, it seems that the “ext” funcs for “relay.ext.<user_compiler>” target, NO models’ params(inputs for each func) feed into “user_compiler”, so that means “user_compiler” can NOT do some optimizing things such as “fusion dedicated ops” based on dedicated DSP/DLA， because the “user_compiler” don’t know about input params infos. That’s weird for me, might be I’m wrong?

roger-zhao · January 4, 2021, 5:22am

@zhiics, @comaniac sorry for bothering, pls give me some tips, thanks!

comaniac · January 4, 2021, 8:10am

I didn’t fully get your question. I guess you want to use some Relay optimizations in the partitioned functions? If so, you’re right about the flow that the partitioned external function won’t be optimized by some Relay passes. However, you can apply them in your codegen by defining an optimize pass. See the following example that applies ConvertLayout and FoldConstant passes to the external functions.

github.com

apache/tvm/blob/main/src/relay/backend/contrib/arm_compute_lib/codegen.cc#L340-L349


IRModule PreProcessModule(const IRModule& mod) {
  IRModule preprocessed_module;
  tvm::Map<String, Array<String>> desired_layouts = {{"nn.conv2d", {"NHWC", "OHWI"}},
                                                     {"qnn.conv2d", {"NHWC", "OHWI"}}};
  preprocessed_module = transform::ConvertLayout(desired_layouts)(mod);
  preprocessed_module = transform::FoldConstant()(preprocessed_module);
  return preprocessed_module;
}

TVM_REGISTER_GLOBAL("relay.ext.arm_compute_lib.optimize").set_body_typed(PreProcessModule);

roger-zhao · January 4, 2021, 9:38am

Got it! Actually what I’m asking is that, for a NN model with params(weights, BatchNorm params, etc.), we have graph-partition for heterogeneous executing, however do we need also a “params-partition” corresponding, because I think for ext.compiler with optimizer(graph-level optimizing), besides a subgraph needed to send to exe.compiler processing, the exe.compiler should also need a “sub-params” b64NDArray as meta-data also?

comaniac · January 4, 2021, 5:29pm

Please trace the ACL example I posted above. It processes the subgraph parameters by itself. the processed subgraph parameters will be later stored in the metadata module along with other parameters.

roger-zhao · January 5, 2021, 1:39am

I’ll trace it, thanks so much! B.T.W., TVM conf videos on YouTube are very useful for me, should be also valuable for other new developer, appreciate that!

roger-zhao · January 6, 2021, 8:14am

Emmm… sorry to say that, but after trace the ACL and TensorRT codegen with AsText(), still don’t see where the “b64ndarrays” in “metadata” input into such as ACLCompiler. I’ve print input func with AsText, no “metadata” in it, so I guess for ACLCompiler, no sub-params input; Also PreProcessModule I don’t see where to call it. @comaniac

roger-zhao · January 6, 2021, 10:30am

For details, what I mean is that LowerExternalFunctions in compile_engine.cc, input src_func for relay.ext.tensorrt external compiler doesn’t include “subgraph params”. If it’s true, then a hardware vendor want to process a subgraph(with params, such as weights for conv2d), e.g., fused conv2d with add/relu/bn, they might need params.

roger-zhao · January 6, 2021, 10:39am

BINGO! Sorry for this stupid question, “relay.ext.arm_compute_lib.optimize” will be called when Pass PartitionGraph. Again, thanks!