Autoscheduler and VM

masahi · December 29, 2020, 4:42am

~~For some reasons, my tuning has been stuck on the first task with output something like below. I don’t think it’s making any progress. Any idea what’s going on?~~

UPDATE: It seems after I removed the old log and restart tuning, it seems working again.

----------------------------------------------------------------------                                                                                                       
------------------------------  [ Measure ]                                                                                                                                  
----------------------------------------------------------------------                                                                                                       
Get 64 programs to measure:                                                                                                                                                  
........*T*T*T*T*T*T*T*T                                                                                                                                                     
........*T*T*T*T*T*T*T*T                                                              
........*T*T*T*T*T*T*T*T                                                              
........*T*T*T*T*T*T*T*T                                                              
........*T*T*T*T*T*T*T*T                                                              
........*T*T*T*T*T*T*T*T                                                              
........*T*T*T*T*T*T*T*T                                                              
........*T*T*T*T*T*T*T*T                                                              
Time elapsed for measurement: 666.03 s                                                
----------------------------------------------------------------------                                                                                                       
------------------------------  [ Train cost model ]                                                                                                                         
----------------------------------------------------------------------
Time elapsed for training: 0.26 s                                                     
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]                    
----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |            - |              - |     64 |                         
|    1 |            - |              - |      0 |

masahi · December 29, 2020, 9:14am

@comaniac @merrymercy

I’m happy to report that, after only a few hours of tuning, auto-scheduled MaskRCNN run gets 0.23 elapsed sec, improving on 0.32 sec of default autotvm schedules. The default autotvm schedules should be reasonable because most of convolution layers of MaskRCNN are coming from resnet50. I don’t see many fallback warnings even with default schedules.

Unfortunately I had to kill tuning early because starting from around 4000 trials tuning got stuck with 0.0 GFLOPS and BuildTimeoutError. See the output here https://github.com/masahi/torchscript-to-tvm/blob/master/maskrcnn/out_1229.txt#L5296

Is there a way to get around this? I’m especially confused because 0.0 GFLOPS started happening in the middle of tuning process, until then everything was working fine.

merrymercy · December 29, 2020, 4:13pm

Is your GPU an AMD GPU or NVIDIA GPU?

I can explain the meaning of this string in the output

........*******E*

......T.T.T*E****

.......T.T*E***E**

........*****E***

........T*****E**E

“.” means compiling a program, “*” means measuring a program, “T” means timeout, “E” means error.

From your output, I see most of the errors are “RunTimeoutError” instead of " “BuildTimeoutError”. It means auto-scheduler compiled the gpu kernels successfully but cannot run it on the gpu. Possible reasons are

The GPU runtime or RPC tracker has errors, so we cannot connect to GPU again.
During measurement, we use a lot of python threading and processing. Sometimes after running a long time, we will get stuck when creating new threading or new processes.

Can you kill the script and run it again to see whether the errors still exit? We can continue the tuning by loading the log file (help).

If you are willing to debug, this is the entrance of measurement. We can insert print statements in this function and it children functions to see which line the function was stuck at. We can then think about solutions.

masahi · December 29, 2020, 12:26pm

This time it is NV GTX 1070 ti. MaskRCNN performance is not great even on CUDA. I always try CUDA first before I do something non trivial with AMD.

I think this was the problem. I saw a CUDA runtime error (CUDA_INVALID something) in the middle of tuning, and after that the tuning got stuck.

Indeed, if I restart the tuning, the error is gone: the tuner is making progress from where it left off before it got stuck, great!!

comaniac · December 30, 2020, 5:57pm

Last memo: the merged PR also includes the updated logic of identifying if a task is “complex”. Now we don’t rely on the pattern of the anchor op in Relay. Instead, we traverse the TE compute directly and seek for tir.Reduce. As long as a compute with at least one tir.Reduce, we consider it to be a “complex” task. With the PR, the extracted tasks from MaskRCNN reduced from 62 to 54, and all shape_of tasks were gone.