Can TVM split work into different layers, and assign layers into different cores?

hi, I am also very interested in your work and have a couple of questions:

Do you support heterogeneous executions using CPUs and GPUs in combination? Are necessary layout transformations added automatically?

How will the mapping of layers and subgraphs to targets take place? Will it be done manually by the user?

Now I am only focusing on using CPU, and I did manually setting such CPU affinity for now.

1 Like

Hi, I see your example picture that each subgraph may use heterogeneous resources (FPGA+ A53 CPU 3), and I have two questions:

  1. Does that mean CPU or GPU or FPGA runs the same subgraph with different input tensors (input pictures), is this called horizontal partition?

  2. CPU to GPU(FGPA) may have very different speedup ratios, how to deal with this imbalance? Are this partition common used in SoC/edge-cloud collaboration scenario?

1 Like

@Xuyuanjia2014

  1. Does that mean CPU or GPU or FPGA runs the same subgraph with different input tensors (input pictures), is this called horizontal partition?

Different heterogenous cores run different subgraph, the first subgraph use the input tensor as data, following subgraphs use the output tensor of previous subgraph as input data.

graph splitting reduced the network deep, call that as horizontal partition is make sense.

  1. CPU to GPU(FGPA) may have very different speedup ratios, how to deal with this imbalance and automatically generate best practice subgraph splitting? Are this partition common used in SoC/edge-cloud collaboration scenario?

we have an automatically splitting module(in developing) which will generate a list of loading balance subgraph.

about the use case, this solution can be used at SOC edge device as well as edge-cloud collaboration scenario.

2 Likes

Ths. That helps a lot.

Hello @hjiang

I have made some modifications based on your code and added CPU affinity setting into my function, but it seems to have some problems. I am looking for your advice.

According to the following previous posts,

  1. Use all cores in a big.LITTLE architecture - #5 by FrozenGene
  1. Setting the CPU affinity and number of cores locally without RPC Remote - #4 by popojames

Now I am using config_threadpool of each subgraph to enable the CPU setting. In another word

  1. Splitting the network into N subgraphs (Using your pipeline_graph function)
  2. Then I add config_threadpool for each subgraphs
  3. Then I assign CPU affinity to each subgraphs by config_threadpool_numofsubgraph(affinity_mode, num_threads)

For example, If I split into two sub-graphs wanna set first graph → 4 small cores and second graph ->4 big cores, I will use config_threadpool_0(-1, 4) and config_threadpool_1(1, 4) But it seems like those setting are not behave as I expect.

I am wondering how did you implement CPU affinity setting? Is it possible to share your code regarding how you set the cpu affinity even if it’s not fully ready to merge into main TVM?

@popojames , please wait couple days, I plan to submit the affinity PR soon.

1 Like

@popojames , the cpu affinity PR already submit https://github.com/apache/tvm/pull/9802, please reference the example in “threading_backend_test.cc” for the cpu affinity setting logic.

1 Like

Hello @hjiang,

Thanks very much for sharing your implementation.

First of all, I wanna double-check if I wanna re-build a new TVM environment, should I go with this GitHub branch: https://github.com/huajsj/tvm/tree/threadaffinity?


Regarding your reply: as you explained in https://github.com/apache/tvm/pull/9802, you introduced “kSpecify” and “tvm::runtime::threading ::Configure” to specify the CPU list for the CPU affinity. For this, I have some general questions.

  1. In this setting, I wanna double-check if I wanna use the “kSpecify” mode, is it similar to the original setting as shown below (i.e, calling config_threadpool) but with CPU affinity mode = -2 and adding parameters “cpus” and “exclude_worker0”?
  1. For the example that I mentioned above, if I wanna split the network into two sub-graphs, then set the first graph → 4 small cores, second graph ->4 big cores. (1) Should I set “cpus” =[4,4]? (2) How exactly to set the small & big CPU affinity order?

  2. I do not fully understand how to implement this instruction:

    How exactly to launch multi-threads if I call tvm backend in python to run the benchmark?

  3. Is it possible to share any example code for this CPU affinity setting? I think one simple example like a hand-made multilayer perceptron with CPU splitting would be really helpful for me and other users to understand the whole process.

Thanks again for your great help. Happy new year :slight_smile:

@popojames

First of all, I wanna double-check if I wanna re-build a new TVM environment, should I go with this GitHub branch: https://github.com/huajsj/tvm/tree/threadaffinity?

yes, rebuild is necessary to use the said cpu affinity feature,

In this setting, I wanna double-check if I wanna use the “kSpecify” mode, is it similar to the original setting as shown below (i.e, calling config_threadpool) but with CPU affinity mode = -2 and adding parameters “cpus” and “exclude_worker0”?

user need to use “Configure” function like "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);"to do the cpu affinity settings.

For the example that I mentioned above, if I wanna split the network into two sub-graphs, then set the first graph → 4 small cores, second graph ->4 big cores. (1) Should I set “cpus” =[4,4]? (2) How exactly to set the small & big CPU affinity order?

the 4 small cpu should like {0,1,2,3} the 4 big cpu should like {4, 5, 6, 7}

How exactly to launch multi-threads if I call tvm backend in python to run the benchmark?

tvm::runtime::threading ::Configure is a c++ function, you only can call it in c++ library, after split compute graph into 2 sub-graph, you should run each sub-graph with specify runtime in different thread and call the said function.

Is it possible to share any example code for this CPU affinity setting? I think one simple example like a hand-made multilayer perceptron with CPU splitting would be really helpful for me and other users to understand the whole process.

tests/cpp/threading_backend_test.cc::TVMBackendAffinityConfigure is the example.

Happy new year :slight_smile:

1 Like

Hello @hjiang

Thanks for your reply. Actually, I still have some confusion regarding using such CPU affinity since I am not quite familiar with C++ backend.

Question 1:

According to your answer: tvm::runtime::threading ::Configure is a c++ function, you only can call it in c++ library, after split compute graph into 2 sub-graph, you should run each sub-graph with specify runtime in different thread and call the said function => So my understanding is that the users cannot use python function to call such C++ function as you do in pipeline_executors:

Question 2:

What is the meaning of concurrency_config in the following Configure? "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);

Question 3:

May I ask for the example that splitting the network into two sub-graphs, then setting the first graph → 4 small cores, second graph ->4 big cores. In C++ setting: I should set 4 small CPU as {0,1,2,3}, 4 big CPU as {4, 5, 6, 7} with "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, CPUs, concurrency_config);

But my question is that since I have two sub-graphs, how exactly can use such function to do CPU affinity settings? Should I call these functions twice?

Thanks again.

@popojames

Question1..So my understanding is that the users cannot use python function to call such C++ function as you do in pipeline_executors:

Python user can go through the interface “runtime.config_threadpool” to set the cpu list affinity. the example as following

config_threadpool(affinity_mode, num_threads, cpu_list)

But the said way to use config_threadpool in python may can not do the multiple runtime cpu affinity setting work, in our full solution, the c++ runtime library would call "“tvm::runtime::threading ::Configure” in each runtime thread to do the affinity setting and this part logic is transparent to python user at same time python user just need to forward the cpu affinity setting into c++ library.

**Question 2:** What is the meaning of concurrency_config in the following Configure "tvm::runtime::threading ::Configure(tvm::runtime::threading::ThreadGroup::kSpecify, 0, cpus, concurrency_config);

set the cpu affinity for the runtime launched by current thread, “cpus” is the affinity cpu list

Question 3:since I have two sub-graphs, how exactly can use such function to do CPU affinity settings? Should I call these functions twice?

yes if you prefer implement your own runtime, you should create 2 threads and call the said functions in each thread.

1 Like

Hello @hjiang Thanks for your explanation,

According to your suggestion, here is my understanding. Please correct me if I make any mistakes.

For the example that splitting the network into two sub-graphs, then setting the first graph → 4 small cores, second graph → 4 big cores

  1. Splitting the network into 2 subgraphs (Using your pipeline_graph function)
  2. Then I add config_threadpool for each subgraph by forwarding the cpu affinity setting into c++ library.
  3. Then I assign CPU affinity to each subgraph: config_threadpool_1st_subgraph(-2, 4, {0,1,2,3}) , config_threadpool_2nd_subgraph(-2, 4, {4, 5, 6, 7})

Thanks again for your help.

Hello @hjiang

I have rebuilt TVM on Jan 28th with version: tvm-0.9.dev423+g6a274af9c-py3.8-linux-aarch64. I also apply this CPU affinity setting when I building TVM to utilize CPU affinity = -2.

I followed the same setting as you mention in splitting logic python file to split the network into 2 subgraphs and try to run in pipeline format. I am wondering for the following the setting

Does pipeline.run(True) mean the pipeline module running in sequential mode instead of pipeline format?

The result I got with normal graph_executor (without any pipeline setting): (Using only 4 thread) Mean inference time (std dev): 1326.98 ms (15.09 ms) Throughput of inference is : 0.75 batch/sec

The result I got with the current TVM version pipeline module: (Using only 4 thread) Mean inference time (std dev): 1318.96 ms (9.06 ms) Throughput of inference is : 0.76 batch/sec

The result I got with the previous TVM version pipeline module: (Using 8 thread) Throughput of inference: it would totally different than 0.76 batch/sec.

If so, may I ask how can I run them in the pipeline format? Or if It’s not implemented/supported yet, may I ask what’s the timeline to add pipeline executing into the current TVM?

Does pipeline.run(True) mean the pipeline module running in sequential mode instead of pipeline format?

Yes, currently Pipeline executor still in the process of upstreaming, and only support sequential mode.

If so, may I ask how can I run them in the pipeline format? Or if It’s not implemented/supported yet, may I ask what’s the timeline to add pipeline executing into the current TVM?

Like what I mentioned in before comments, to try the pipeline executor feature please wait for the whole upstream getting done, about the timeline, as a rough prediction it may still need one or two month for all of the rest patches, and please refer related tracking issue (https://github.com/apache/tvm/issues/8596) for the progress.

1 Like

@hjiang

Thanks for your answer, I will keep an eye on your tracking progress.

For my previous inference evaluations, I built and extended my pipeline executor upon this previous PR #7892 (I know it’s outdated now and already closed) and split networks into 2 subgraphs, then running 2 subgraphs in pipeline mode with “pipeline.run()”.

I used “htop” to check CPU utilization and I could see 8 threads running now and CPU utilization 800% (which means all CPU resources are being utilized) and higher throughput, so I think these subgraphs are indeed running in pipeline format.

May I double-check for those results, are they still legit?

Thanks again for your help! Happy Lunar new year :slight_smile:

@popojames , I guess your questions is that you get same throughput by using latest TVM as enabling subgraph pipeline, is that normal? the answer is ‘YES’ for that the ‘parallel feature’ still on the way to upstreaming. hopefully this answered your question, and Happy Lunar new year :slightly_smiling_face:

1 Like

@hjiang Thanks for your answer. I understand for the latest TVM, I will get the same result as normal inference (without pipeline).

Maybe I didn’t make my question clear enough. In TVM dev0.8 version,

I think function “pipeline.run()”, this version is with enable running in pipeline mode. My question is that are the results obtained from pipeline.run() in PR7892 reliable?

Thanks again for your help :slight_smile:

I think function “pipeline.run()”, this version is with enable running in pipeline mode. My question is that are the results obtained from pipeline.run() in PR7892 reliable?

the PR #7892 is a closed PR, you can do some try on this PR, but we highly recommend you to wait and use the official TVM subgraph pipe feature after all upstreaming done, as we will not sustain the said closed PR7892.

1 Like

Hello @hjiang

Thanks for your reply. I will wait for official upstreaming and keep an eye on your tracking progress.

Meanwhile, I have extended PR #7892 with this CPU affinity setting.

I was able to pin the desired CPU affinity successfully. For example, the following code means the model is only running on two big cores (and using core 6 and 7).

image

Following the same logic, and according to your previous answer,

I create two threads for two sub-graphs with setting 1st graph to LITTLE and 2nd graph to big. Here is the code:

image

wherein, config_threadpool_0 is CPU affinity controller for subgraph_0 and config_threadpool_1 is CPU affinity controller for subgraph_1

However, I found out with this setting, if two thread_config is set, only the second one would be updated. In other words, the setting in the figure would make subgraph_1 running on 4 big cores, and subgraph_0 is not activated and running on default mode (which is 4 big cores).

As for another setting with 1st graph to big and 2nd graph to LITTLE,

image

Here, subgraph_1 running on 4 small cores and subgraph_0 will run with default setting (which is 4 big cores). Although this second setting fulfills what I wanna do, the overall setting is somehow inflexible and hard to use.

May I ask do you have any comment on that or do you have a better way to create threads and set CPU affinity in python simulation?

Thanks.