[BYOC] Run CPU only inference on partitioned graph?

I’m looking for a way to run a partitioned graph (subgraphs, composite functions, etc…) on CPU only to help assist with integration efforts of a backend with the BYOC framework. The main point would be for us to be able to check at each intermediate step that our I/O for each operator and composite operator matches with the ‘ground-truth’ implementation.

Currently, if I partition the graph and then run with target as LLVM, then it will fail without a correct backend implementation for whatever should be offloaded in the partitioned subgraph.

This is a fair requirement. I guess you need to remove the kCompiler attribute in the partitioned functions to let it be treat as a normal Relay function, but this may result in two problems:

  1. Whether the Relay fusion pass will fuse ops inside this function body (I guess the answer is yes but need verification).
  2. Whether the TE compiler could handle nested functions; otherwise a partitioned function will be treat as a fused function, and TE compile may not be able to schedule it due to multiple anchor ops.

cc @jroesch @mbs-octoml @electriclilies

Good points, fusion inside the subgraph function bodies would probably be undesirable. If it was just inside composite functions I think that would be fine.

I thought I had read some time ago about a BYOC + PTQ (post training quantization) infrastructure which was running inferences on cpu for partitioned graphs as an initial step? I can’t seem to find it now, but curious if that line of work is still active.