Can TVM split work into different layers, and assign layers into different cores?

@popojames we are also working on modeling multi-core systems in the core compiler and pushing this into the AOT executor (see “TargetDevice” RFC forthcoming sometime soon) by leveraging the Device API (and are implementing in parallel a C Device API to ensure this approach functions across both bare metal and linux systems).

1 Like

@popojames, #7892 is a legacy PR, currently we use a serial of small PR to upstreaming our work like #8702, #9108, we recommend you wait until all upstream done, because the API of pipeline executor may change and we may can not maintain #7892 for bug fix, but at same time if you like to try , #7892 is also a work patch, you can follow the logic in test_pipeline_executor.py to try network split and pipeline.

2 Likes

Hello, @hjiang Thanks for open-sourcing your file “test_pipeline_executor.py”, it really helpful for me.

Now I am able to split the network into subgraphs. In my case, I split the entire BERT into 3 subgraphs, and trying to assign 3 subgraphs into different CPU clusters. (For instance: subgraph 0 using 2 big cores, subgraph 1 using 2 big cores, subgraph 2 using 4 small cores.)

However, I still have some questions.

  1. In your compute effect example, you assign subgraph 1 to a specific CPU (e.g. A53 CPU 0-2). Can you explain how do you set the thread affinity and only use 2 of 4 cores?

image

  1. When I calling target_list = tvm.testing.enabled_targets()

It will show out the following errors:

Check failed: (err_code == CL_SUCCESS) is false: OpenCL Error, code=-6: CL_OUT_OF_HOST_MEMORY

I am guessing is it possible to get rid of this line and set my target_list as [2 big cores, 2 big cores, 4 small cores], which will corresponding [CPU 4-5, CPU 6-7, CPU 0-3] in Hikey 970 board?

Thanks again for your inspiration and explanation! It helps a lot!

cc @masahi @comaniac

@popojames , we have a separate patch to support control flow and data flow cpu affinity setting, #7982 not included this part logic, please stay tune, we would submit related patch soon.

1 Like

Hello @hjiang

Thanks very much for your update. I will keep following up on your update. May I ask some questions regarding pipeline executer?

  1. Can I check if the aforementioned example (subgraph 0 using 2 big cores, subgraph 1 using 2 big cores, subgraph 2 using 4 small cores.) can be implemented by using the future patch?

  2. Do you have a roadmap of when will it be available?

=================================================================

[update: Oct 19, 16:45 pm EST After re-building https://github.com/huajsj/tvm instead of main TVM, I was able to run it!]

  1. I tried to apply your recent commit. May I ask are you able to run test_pipeline_executor.py correctly without any error? I run this file but it shows TVMError: do not support key mod_indx.

More info about what I did:

A. I updated the “tvm/python/tvm/contrib/pipeline_executor.py” which contains build_pipeline and create functions, and its PipelineModule class has “run” function to do benchmarking.

B. Modify the cc and h file.

Should I re-build TVM from scratch to run the benchmark? Or is there any way I can apply this change without rebuilding it?

=================================================================

Thanks for your help in advance.

@popojames

  1. Can I check if the aforementioned example (subgraph 0 using 2 big cores, subgraph 1 using 2 big cores, subgraph 2 using 4 small cores.) can be implemented by using the future patch?
Yes, TVM can support this example after all of our pipeline executor patches are merged into upstream, we have similar use case like your example and that already supported by our internal build.
  1. Do you have a roadmap of when will it be available?
Current plan is to submit all pipeline executor patches include the affinity patch by mid of Nov.
  1. I tried to apply your recent commit. May I ask are you able to run test_pipeline_executor.py correctly without any error? I run this file but it shows TVMError: do not support key mod_indx.
like your update mentioned, rebuild can fix this issue.
1 Like

hi, I am also very interested in your work and have a couple of questions:

Do you support heterogeneous executions using CPUs and GPUs in combination? Are necessary layout transformations added automatically?

How will the mapping of layers and subgraphs to targets take place? Will it be done manually by the user?

Now I am only focusing on using CPU, and I did manually setting such CPU affinity for now.

1 Like

Hi, I see your example picture that each subgraph may use heterogeneous resources (FPGA+ A53 CPU 3), and I have two questions:

  1. Does that mean CPU or GPU or FPGA runs the same subgraph with different input tensors (input pictures), is this called horizontal partition?

  2. CPU to GPU(FGPA) may have very different speedup ratios, how to deal with this imbalance? Are this partition common used in SoC/edge-cloud collaboration scenario?

1 Like

@Xuyuanjia2014

  1. Does that mean CPU or GPU or FPGA runs the same subgraph with different input tensors (input pictures), is this called horizontal partition?

Different heterogenous cores run different subgraph, the first subgraph use the input tensor as data, following subgraphs use the output tensor of previous subgraph as input data.

graph splitting reduced the network deep, call that as horizontal partition is make sense.

  1. CPU to GPU(FGPA) may have very different speedup ratios, how to deal with this imbalance and automatically generate best practice subgraph splitting? Are this partition common used in SoC/edge-cloud collaboration scenario?

we have an automatically splitting module(in developing) which will generate a list of loading balance subgraph.

about the use case, this solution can be used at SOC edge device as well as edge-cloud collaboration scenario.

2 Likes

Ths. That helps a lot.

Hello @hjiang

I have made some modifications based on your code and added CPU affinity setting into my function, but it seems to have some problems. I am looking for your advice.

According to the following previous posts,

  1. Use all cores in a big.LITTLE architecture - #5 by FrozenGene
  1. Setting the CPU affinity and number of cores locally without RPC Remote - #4 by popojames

Now I am using config_threadpool of each subgraph to enable the CPU setting. In another word

  1. Splitting the network into N subgraphs (Using your pipeline_graph function)
  2. Then I add config_threadpool for each subgraphs
  3. Then I assign CPU affinity to each subgraphs by config_threadpool_numofsubgraph(affinity_mode, num_threads)

For example, If I split into two sub-graphs wanna set first graph → 4 small cores and second graph ->4 big cores, I will use config_threadpool_0(-1, 4) and config_threadpool_1(1, 4) But it seems like those setting are not behave as I expect.

I am wondering how did you implement CPU affinity setting? Is it possible to share your code regarding how you set the cpu affinity even if it’s not fully ready to merge into main TVM?

@popojames , please wait couple days, I plan to submit the affinity PR soon.

1 Like

@popojames , the cpu affinity PR already submit https://github.com/apache/tvm/pull/9802, please reference the example in “threading_backend_test.cc” for the cpu affinity setting logic.

1 Like

Hello @hjiang,

Thanks very much for sharing your implementation.

First of all, I wanna double-check if I wanna re-build a new TVM environment, should I go with this GitHub branch: https://github.com/huajsj/tvm/tree/threadaffinity?


Regarding your reply: as you explained in https://github.com/apache/tvm/pull/9802, you introduced “kSpecify” and “tvm::runtime::threading ::Configure” to specify the CPU list for the CPU affinity. For this, I have some general questions.

  1. In this setting, I wanna double-check if I wanna use the “kSpecify” mode, is it similar to the original setting as shown below (i.e, calling config_threadpool) but with CPU affinity mode = -2 and adding parameters “cpus” and “exclude_worker0”?
  1. For the example that I mentioned above, if I wanna split the network into two sub-graphs, then set the first graph → 4 small cores, second graph ->4 big cores. (1) Should I set “cpus” =[4,4]? (2) How exactly to set the small & big CPU affinity order?

  2. I do not fully understand how to implement this instruction:

    How exactly to launch multi-threads if I call tvm backend in python to run the benchmark?

  3. Is it possible to share any example code for this CPU affinity setting? I think one simple example like a hand-made multilayer perceptron with CPU splitting would be really helpful for me and other users to understand the whole process.

Thanks again for your great help. Happy new year :slight_smile:

@popojames

First of all, I wanna double-check if I wanna re-build a new TVM environment, should I go with this GitHub branch: https://github.com/huajsj/tvm/tree/threadaffinity?

yes, rebuild is necessary to use the said cpu affinity feature,

In this setting, I wanna double-check if I wanna use the “kSpecify” mode, is it similar to the original setting as shown below (i.e, calling config_threadpool) but with CPU affinity mode = -2 and adding parameters “cpus” and “exclude_worker0”?

user need to use “Configure” function like "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);"to do the cpu affinity settings.

For the example that I mentioned above, if I wanna split the network into two sub-graphs, then set the first graph → 4 small cores, second graph ->4 big cores. (1) Should I set “cpus” =[4,4]? (2) How exactly to set the small & big CPU affinity order?

the 4 small cpu should like {0,1,2,3} the 4 big cpu should like {4, 5, 6, 7}

How exactly to launch multi-threads if I call tvm backend in python to run the benchmark?

tvm::runtime::threading ::Configure is a c++ function, you only can call it in c++ library, after split compute graph into 2 sub-graph, you should run each sub-graph with specify runtime in different thread and call the said function.

Is it possible to share any example code for this CPU affinity setting? I think one simple example like a hand-made multilayer perceptron with CPU splitting would be really helpful for me and other users to understand the whole process.

tests/cpp/threading_backend_test.cc::TVMBackendAffinityConfigure is the example.

Happy new year :slight_smile:

1 Like

Hello @hjiang

Thanks for your reply. Actually, I still have some confusion regarding using such CPU affinity since I am not quite familiar with C++ backend.

Question 1:

According to your answer: tvm::runtime::threading ::Configure is a c++ function, you only can call it in c++ library, after split compute graph into 2 sub-graph, you should run each sub-graph with specify runtime in different thread and call the said function => So my understanding is that the users cannot use python function to call such C++ function as you do in pipeline_executors:

Question 2:

What is the meaning of concurrency_config in the following Configure? "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);

Question 3:

May I ask for the example that splitting the network into two sub-graphs, then setting the first graph → 4 small cores, second graph ->4 big cores. In C++ setting: I should set 4 small CPU as {0,1,2,3}, 4 big CPU as {4, 5, 6, 7} with "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, CPUs, concurrency_config);

But my question is that since I have two sub-graphs, how exactly can use such function to do CPU affinity settings? Should I call these functions twice?

Thanks again.

@popojames

Question1..So my understanding is that the users cannot use python function to call such C++ function as you do in pipeline_executors:

Python user can go through the interface “runtime.config_threadpool” to set the cpu list affinity. the example as following

config_threadpool(affinity_mode, num_threads, cpu_list)

But the said way to use config_threadpool in python may can not do the multiple runtime cpu affinity setting work, in our full solution, the c++ runtime library would call "“tvm::runtime::threading ::Configure” in each runtime thread to do the affinity setting and this part logic is transparent to python user at same time python user just need to forward the cpu affinity setting into c++ library.

**Question 2:** What is the meaning of concurrency_config in the following Configure "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);

set the cpu affinity for the runtime launched by current thread, “cpus” is the affinity cpu list

Question 3:since I have two sub-graphs, how exactly can use such function to do CPU affinity settings? Should I call these functions twice?

yes if you prefer implement your own runtime, you should create 2 threads and call the said functions in each thread.

1 Like

Hello @hjiang Thanks for your explanation,

According to your suggestion, here is my understanding. Please correct me if I make any mistakes.

For the example that splitting the network into two sub-graphs, then setting the first graph → 4 small cores, second graph → 4 big cores

  1. Splitting the network into 2 subgraphs (Using your pipeline_graph function)
  2. Then I add config_threadpool for each subgraph by forwarding the cpu affinity setting into c++ library.
  3. Then I assign CPU affinity to each subgraph: config_threadpool_1st_subgraph(-2, 4, {0,1,2,3}) , config_threadpool_2nd_subgraph(-2, 4, {4, 5, 6, 7})

Thanks again for your help.

Hello @hjiang

I have rebuilt TVM on Jan 28th with version: tvm-0.9.dev423+g6a274af9c-py3.8-linux-aarch64. I also apply this CPU affinity setting when I building TVM to utilize CPU affinity = -2.

I followed the same setting as you mention in splitting logic python file to split the network into 2 subgraphs and try to run in pipeline format. I am wondering for the following the setting

Does pipeline.run(True) mean the pipeline module running in sequential mode instead of pipeline format?

The result I got with normal graph_executor (without any pipeline setting): (Using only 4 thread) Mean inference time (std dev): 1326.98 ms (15.09 ms) Throughput of inference is : 0.75 batch/sec

The result I got with the current TVM version pipeline module: (Using only 4 thread) Mean inference time (std dev): 1318.96 ms (9.06 ms) Throughput of inference is : 0.76 batch/sec

The result I got with the previous TVM version pipeline module: (Using 8 thread) Throughput of inference: it would totally different than 0.76 batch/sec.

If so, may I ask how can I run them in the pipeline format? Or if It’s not implemented/supported yet, may I ask what’s the timeline to add pipeline executing into the current TVM?