Dividing a graph and running it on multiple PEs

Hello, I was trying to use tvm to divide a graph into smaller subgraphs , schedule and run them on multiple arm Cortex M4 CPUs, I found a good reference in test_pipeline_executor.py , I can now export lib and param files for each sub graph. at this point , there is multiple options in front for the runtime. I want to know exactly what to do at this step, what I want to do is to use uTVM for the implemetation, would you share with me any idea how the workflow would look like. could you point me to the right direction to pipeline this execution correctly. thank you