Can TVM split work into different layers, and assign layers into different cores?

popojames · September 28, 2021, 9:22pm

Hello community,

I am using Hikey 970 with 4 big and 4 little CPU to do inference now. (I am only using CPU to do inference only). My question is can TVM split work into different layers, and assign layers into different cores to do inference?

For instance, for 4 layers neural network, assigning the first 2 layers into 3 Big cores, the third layer into another big core, and the last layer into 4 small). I am wondering if TVM supports such layer-level splitting settings as shown below?

I think I might need to modify with this file: tvm/src/runtime/threading_backend.cc and I have gone through posts like Number of threads used during auto-tunning, but can anyone provide more details?

Any advice or related post are welcomed! Thanks community in advance

masahi · September 29, 2021, 2:48am

There is an on-going work to bring pipeline parallelism to TVM. That seems closest to what you described. See https://github.com/apache/tvm-rfcs/pull/14

cc @hjiang @comaniac

popojames · September 29, 2021, 3:01am

@masahi Awesome! I will definitely take a look at that. Thanks for your reply.

popojames · October 1, 2021, 12:03pm

Hello @masahi @hjiang.

I have tried to build TVM with hjiang’s TVM main repo. I found and followed this page: [WIP][Runtime]Pipeline Executor For Compute graph pipeline #7892 to understand how “pipeline” works. I was able to pass the test_pipeline_executor.py in the current apache tvm version (file: tvm/tests/python/relay/test_pipeline_executor.py)

However, since it is still working in progress, may I kindly ask is there any example code to reproduce the example of the pipeline work as follow?

Thanks for your reading and kindly in advance.

areusch · October 1, 2021, 6:55pm

@popojames we are also working on modeling multi-core systems in the core compiler and pushing this into the AOT executor (see “TargetDevice” RFC forthcoming sometime soon) by leveraging the Device API (and are implementing in parallel a C Device API to ensure this approach functions across both bare metal and linux systems).

hjiang · October 2, 2021, 1:45am

@popojames, #7892 is a legacy PR, currently we use a serial of small PR to upstreaming our work like #8702, #9108, we recommend you wait until all upstream done, because the API of pipeline executor may change and we may can not maintain #7892 for bug fix, but at same time if you like to try , #7892 is also a work patch, you can follow the logic in test_pipeline_executor.py to try network split and pipeline.

popojames · October 15, 2021, 1:31pm

Hello, @hjiang Thanks for open-sourcing your file “test_pipeline_executor.py”, it really helpful for me.

Now I am able to split the network into subgraphs. In my case, I split the entire BERT into 3 subgraphs, and trying to assign 3 subgraphs into different CPU clusters. (For instance: subgraph 0 using 2 big cores, subgraph 1 using 2 big cores, subgraph 2 using 4 small cores.)

However, I still have some questions.

In your compute effect example, you assign subgraph 1 to a specific CPU (e.g. A53 CPU 0-2). Can you explain how do you set the thread affinity and only use 2 of 4 cores?

When I calling target_list = tvm.testing.enabled_targets()

github.com

huajsj/tvm/blob/47c5cc24dc01248b0c1b7ea76cb3ff2806445888/tests/python/relay/test_pipeline_executor.py#L336


    #Verify result
    """
    for ref_out, out in zip(outs, pipeline_outputs):
        for ref in ref_out:
            tvm.testing.assert_allclose(ref_out[ref], out[int(ref) - 1])
            print(ref_out[ref])




def test_pipeline():
    if pipeline_executor.pipeline_executor_enabled():
        target_list = tvm.testing.enabled_targets()
        for target in target_list:
            run_pipeline(target)




if __name__ == "__main__":
    test_pipeline()

It will show out the following errors:

Check failed: (err_code == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY

I am guessing is it possible to get rid of this line and set my target_list as [2 big cores, 2 big cores, 4 small cores], which will corresponding [CPU 4-5, CPU 6-7, CPU 0-3] in Hikey 970 board?

Thanks again for your inspiration and explanation! It helps a lot!

cc @masahi @comaniac

hjiang · October 18, 2021, 11:00pm

@popojames , we have a separate patch to support control flow and data flow cpu affinity setting, #7982 not included this part logic, please stay tune, we would submit related patch soon.

popojames · October 19, 2021, 8:50pm

Hello @hjiang

Thanks very much for your update. I will keep following up on your update. May I ask some questions regarding pipeline executer?

Can I check if the aforementioned example (subgraph 0 using 2 big cores, subgraph 1 using 2 big cores, subgraph 2 using 4 small cores.) can be implemented by using the future patch?
Do you have a roadmap of when will it be available?

=================================================================

[update: Oct 19, 16:45 pm EST After re-building https://github.com/huajsj/tvm instead of main TVM, I was able to run it!]

I tried to apply your recent commit. May I ask are you able to run test_pipeline_executor.py correctly without any error? I run this file but it shows TVMError: do not support key mod_indx.
image1480×339 30.7 KB

More info about what I did:

A. I updated the “tvm/python/tvm/contrib/pipeline_executor.py” which contains build_pipeline and create functions, and its PipelineModule class has “run” function to do benchmarking.

B. Modify the cc and h file.

Should I re-build TVM from scratch to run the benchmark? Or is there any way I can apply this change without rebuilding it?

=================================================================

Thanks for your help in advance.

hjiang · October 20, 2021, 6:14pm

@popojames

Can I check if the aforementioned example (subgraph 0 using 2 big cores, subgraph 1 using 2 big cores, subgraph 2 using 4 small cores.) can be implemented by using the future patch?

Yes, TVM can support this example after all of our pipeline executor patches are merged into upstream, we have similar use case like your example and that already supported by our internal build.

Do you have a roadmap of when will it be available?

Current plan is to submit all pipeline executor patches include the affinity patch by mid of Nov.

I tried to apply your recent commit. May I ask are you able to run test_pipeline_executor.py correctly without any error? I run this file but it shows TVMError: do not support key mod_indx.

like your update mentioned, rebuild can fix this issue.

max1996 · November 1, 2021, 8:31am

hi, I am also very interested in your work and have a couple of questions:

Do you support heterogeneous executions using CPUs and GPUs in combination? Are necessary layout transformations added automatically?

How will the mapping of layers and subgraphs to targets take place? Will it be done manually by the user?

popojames · November 2, 2021, 4:21am

Now I am only focusing on using CPU, and I did manually setting such CPU affinity for now.

Xuyuanjia2014 · December 13, 2021, 9:36am

Hi, I see your example picture that each subgraph may use heterogeneous resources (FPGA+ A53 CPU 3), and I have two questions:

Does that mean CPU or GPU or FPGA runs the same subgraph with different input tensors (input pictures), is this called horizontal partition?
CPU to GPU(FGPA) may have very different speedup ratios, how to deal with this imbalance? Are this partition common used in SoC/edge-cloud collaboration scenario?

hjiang · December 13, 2021, 6:51pm

@Xuyuanjia2014

Does that mean CPU or GPU or FPGA runs the same subgraph with different input tensors (input pictures), is this called horizontal partition?

Different heterogenous cores run different subgraph, the first subgraph use the input tensor as data, following subgraphs use the output tensor of previous subgraph as input data.

graph splitting reduced the network deep, call that as horizontal partition is make sense.

CPU to GPU(FGPA) may have very different speedup ratios, how to deal with this imbalance and automatically generate best practice subgraph splitting? Are this partition common used in SoC/edge-cloud collaboration scenario?

we have an automatically splitting module(in developing) which will generate a list of loading balance subgraph.

about the use case, this solution can be used at SOC edge device as well as edge-cloud collaboration scenario.

Xuyuanjia2014 · December 14, 2021, 7:16am

Ths. That helps a lot.

popojames · December 17, 2021, 2:41am

Hello @hjiang

I have made some modifications based on your code and added CPU affinity setting into my function, but it seems to have some problems. I am looking for your advice.

According to the following previous posts,

Use all cores in a big.LITTLE architecture - #5 by FrozenGene

Use all cores in a big.LITTLE architecture

config_threadpool = remote.get_function('runtime.config_threadpool')
# affinity_mode: kBig = 1, kLittle = -1. kDefault = 0. pass 1 or -1 to control the cores
config_threadpool(affinity_mode, num_threads)

Setting the CPU affinity and number of cores locally without RPC Remote - #4 by popojames

Now I am using config_threadpool of each subgraph to enable the CPU setting. In another word

Splitting the network into N subgraphs (Using your pipeline_graph function)
Then I add config_threadpool for each subgraphs
Then I assign CPU affinity to each subgraphs by config_threadpool_numofsubgraph(affinity_mode, num_threads)

For example, If I split into two sub-graphs wanna set first graph → 4 small cores and second graph ->4 big cores, I will use config_threadpool_0(-1, 4) and config_threadpool_1(1, 4) But it seems like those setting are not behave as I expect.

I am wondering how did you implement CPU affinity setting? Is it possible to share your code regarding how you set the cpu affinity even if it’s not fully ready to merge into main TVM?

hjiang · December 20, 2021, 1:13am

@popojames , please wait couple days, I plan to submit the affinity PR soon.

hjiang · December 24, 2021, 1:32am

@popojames , the cpu affinity PR already submit https://github.com/apache/tvm/pull/9802, please reference the example in “threading_backend_test.cc” for the cpu affinity setting logic.

popojames · December 30, 2021, 10:19pm

Hello @hjiang,

Thanks very much for sharing your implementation.

First of all, I wanna double-check if I wanna re-build a new TVM environment, should I go with this GitHub branch: https://github.com/huajsj/tvm/tree/threadaffinity?

Regarding your reply: as you explained in https://github.com/apache/tvm/pull/9802, you introduced “kSpecify” and “tvm::runtime::threading ::Configure” to specify the CPU list for the CPU affinity. For this, I have some general questions.

In this setting, I wanna double-check if I wanna use the “kSpecify” mode, is it similar to the original setting as shown below (i.e, calling config_threadpool) but with CPU affinity mode = -2 and adding parameters “cpus” and “exclude_worker0”?

Use all cores in a big.LITTLE architecture

config_threadpool = remote.get_function('runtime.config_threadpool')
# affinity_mode: kBig = 1, kLittle = -1. kDefault = 0. pass 1 or -1 to control the cores
config_threadpool(affinity_mode, num_threads)

For the example that I mentioned above, if I wanna split the network into two sub-graphs, then set the first graph → 4 small cores, second graph ->4 big cores. (1) Should I set “cpus” =[4,4]? (2) How exactly to set the small & big CPU affinity order?
I do not fully understand how to implement this instruction:
image1556×134 90.6 KB
How exactly to launch multi-threads if I call tvm backend in python to run the benchmark?
Is it possible to share any example code for this CPU affinity setting? I think one simple example like a hand-made multilayer perceptron with CPU splitting would be really helpful for me and other users to understand the whole process.

Thanks again for your great help. Happy new year

hjiang · January 4, 2022, 6:59pm

@popojames

First of all, I wanna double-check if I wanna re-build a new TVM environment, should I go with this GitHub branch: https://github.com/huajsj/tvm/tree/threadaffinity?

yes, rebuild is necessary to use the said cpu affinity feature,

In this setting, I wanna double-check if I wanna use the “kSpecify” mode, is it similar to the original setting as shown below (i.e, calling config_threadpool) but with CPU affinity mode = -2 and adding parameters “cpus” and “exclude_worker0”?

user need to use “Configure” function like "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);"to do the cpu affinity settings.

For the example that I mentioned above, if I wanna split the network into two sub-graphs, then set the first graph → 4 small cores, second graph ->4 big cores. (1) Should I set “cpus” =[4,4]? (2) How exactly to set the small & big CPU affinity order?

the 4 small cpu should like {0,1,2,3} the 4 big cpu should like {4, 5, 6, 7}

How exactly to launch multi-threads if I call tvm backend in python to run the benchmark?

tvm::runtime::threading ::Configure is a c++ function, you only can call it in c++ library, after split compute graph into 2 sub-graph, you should run each sub-graph with specify runtime in different thread and call the said function.

Is it possible to share any example code for this CPU affinity setting? I think one simple example like a hand-made multilayer perceptron with CPU splitting would be really helpful for me and other users to understand the whole process.

tests/cpp/threading_backend_test.cc::TVMBackendAffinityConfigure is the example.

Happy new year