Autoscheduler and VM

@merrymercy @comaniac

I’m interested in auto tuning MaskRCNN on GPU via Ansor. This is a dynamic model so VM is required. But I think workload to each convolution and dense layer should be fixed, so I think it should be feasible to extract tasks from it.

Does that sound reasonable? If lack of support for VM is simply a missing feature due to limited resources, I’m happy to take on it.

I think MaskRCNN is interesting for Ansor because in addition to large number of conv layers, it also has a large dense layer that is currently the second bottleneck next to NMS.

I actually have a branch that uses VM to extract tasks, but I haven’t got a chance to further work on it since I was lazy to produce a minimal example…

All you need to change should be just the relay_integration.py, so it shouldn’t be too complicate if I didn’t miss something important.

Can you point me to your branch? I can take it from there.

I don’t think MaskRCNN counts as a minimal example, but it can be run on CI using pytorch frontend. We actually run MaskRCNN tutorial frontend/deploy_object_detection_pytorch.py on every CI run. If you want to simply test task extraction it should be ok.

Oh I was able to extract tasks from VM simply by calling relay.vm.compile inside of call_all_topi_funcs.

Interestingly, I can extract 47 tasks from Mask RCNN, but extraction fails at some point:

========== Task 47  (workload key: ["2042b858ca74fd9bcba240cbcb4eb7d5"]) ==========
placeholder = PLACEHOLDER [1, 128, 13, 13, 2]
placeholder = PLACEHOLDER [1, 128, 1, 1, 2, 3]
conv2d_NCHWc(n, oc_chunk, oh, ow, oc_block) += (placeholder[n, floordiv(ic, 2), (oh + kh), (ow + kw), floormod(ic, 2)]*placeholder[oc_chunk, floordiv(ic, 2), kh, kw, floormod(ic, 2), oc_block])
placeholder = PLACEHOLDER [1, 1, 1, 1, 3]
T_add(ax0, ax1, ax2, ax3, ax4) = (conv2d_NCHWc[ax0, ax1, ax2, ax3, ax4] + placeholder[ax0, ax1, 0, 0, ax4])

========== Task 48  (workload key: ["7610cb83902a56c2f66079da590238f4"]) ==========
Traceback (most recent call last):
  File "maskrcnn_test.py", line 192, in <module>
    print(task.compute_dag)
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/compute_dag.py", line 220, in __str__
    raw_lines = super().__str__().split("\n")
  File "/home/masa/projects/dev/tvm/python/tvm/runtime/object.py", line 50, in __repr__
    return _ffi_node_api.AsRepr(self)
  File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (4) /home/masa/projects/dev/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7f1dc04c8e53]
  [bt] (3) /home/masa/projects/dev/tvm/build/libtvm.so(+0x8813b7) [0x7f1dbf9673b7]
  [bt] (2) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::ReprPrinter::Print(tvm::runtime::ObjectRef const&)+0xfd) [0x7f1dbf966fdd]
  [bt] (1) /home/masa/projects/dev/tvm/build/libtvm.so(+0x6f443c) [0x7f1dbf7da43c]
  [bt] (0) /home/masa/projects/dev/tvm/build/libtvm.so(+0x6f2d48) [0x7f1dbf7d8d48]
  File "/home/masa/projects/dev/tvm/src/auto_scheduler/compute_dag.cc", line 1399
TVMError: Unsupported reduction operator(x*y)
1 Like

I also got a different error on a different run. It seems dynamic shapes are not handled well.

    rv = local_pyfunc(*pyargs)                                                                                                                                               
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/relay_integration.py", line 261, in auto_schedule_topi
    key = register_workload_tensors(dag.hash_key(), io_tensors)                                                                                                              
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/compute_dag.py", line 204, in hash_key
    str_key += str(get_const_tuple(t.shape)) + ","                                                                                                                           
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/utils.py", line 95, in get_const_tuple                                                                         
    return tuple(get_const_int(x) for x in in_tuple)                                                                                                                         
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/utils.py", line 95, in <genexpr>                                                                                   return tuple(get_const_int(x) for x in in_tuple)                      
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/utils.py", line 76, in get_const_int                                                                               exp = opt(exp)                                                                    
  File "/home/masa/projects/dev/tvm/python/tvm/ir/transform.py", line 127, in __call__                                                                                           return _ffi_transform_api.RunPass(self, mod)
  File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__                                                                               raise get_last_ffi_error()                                                                                                                                               
  [bt] (3) /home/masa/projects/dev/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7fa505553f83]                                                                                    
  [bt] (2) /home/masa/projects/dev/tvm/build/libtvm.so(+0x86e6c4) [0x7fa5049b96c4]
  [bt] (1) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::IRModule tvm::runtime::TVMPODValue_::AsObjectRef<tvm::IRModule>() const+0x481) [0x7fa50495a4a1]
  [bt] (0) /home/masa/projects/dev/tvm/build/libtvm.so(+0x804ff8) [0x7fa50494fff8]
  File "/home/masa/projects/dev/tvm/include/tvm/runtime/packed_func.h", line 1405
TVMError: 
---------------------------------------------------------------
An internal invariant was violated during the execution of TVM.
Please read TVM's error reporting guidelines.
More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
---------------------------------------------------------------
  Check failed: ObjectTypeChecker<TObjectRef>::Check(ptr) == false: Expect IRModule but get tir.Var

seems call topi wrongly because I see call NCHWc.

My branch should be pretty much the same as you did: https://github.com/comaniac/tvm/commit/47e93bd25d4a9f8fa2192d787fc4a2b3edf6d65f

The errors you encountered are exactly I meant by “missing something important” :sweat_smile:, I’ll use the tutorial you pointed out to take a trace.

@masahi The errors you got are all related to printing. I think it is easy to fix them. We can add more if branches to handle them. In auto-scheduler, we print a compute_dag as a string and compute the hash of this string as the workload key. The workload key is used for matching the tuning logs.

@masahi @merrymercy I give it a try and now this branch (https://github.com/comaniac/tvm/tree/ansor_vm) can extract 64 tasks from PyTorch Mask R-CNN, but I haven’t checked the tuning and compilation. Here are some issues I got and the changes I made:

  1. Improve get_const_tuple to keep Any when shape is dynamic.
  2. traverse_to_get_io_tensors.traverse crashed due to multiple compile_engine_const with ndims=0, because te.Tensor.__eq__ does not deal with this case, so I made a workaround.
  3. Added a checker in auto_schedule_topi to skip the TE compute with dynamic shape, although it seems useless now which I don’t know why yet.
  4. Mask R-CNN has the custom reduction op x*y which ComputeDAG cannot print. I made a workaround to force it print something, but I’m not sure if there have other problems for auto_scheduler to deal with this op.
  5. I haven’t dived into the reason why it selects NCHWc computes for Conv2D.

My next step would be trying to get a small DAG from Mask R-CNN to be a test case for better developments. Any comments are welcomed.

Also cc @jcf94 @FrozenGene

1 Like

We don’t register the unpacked topi.nn.conv2d_nchw for NCHW layout correctly for auto-scheduler. So the x86 op strategy or alter_op_layout will pick the topi.nn.conv2d_NCHWc. topi.nn.conv2d_NCHWc is an autotvm template, auto-scheduler can correctly handle it but the performance is not guaranteed. In this case, auto-scheduler will start from a packed compute definition specified by a fallback autotvm config.

The correct way to use auto-scheduler is to start from an unpacked layout and let the auto-scheduler to decide the layout.

To fix this, we can

  1. Add some checks in x86 op strategy or alter op layout, so we don’t go to topi.nn.conv2d_NCHWc when auto-scheduler is enabled.
  2. or convert the model to NHWC layout.

Updated branch: https://github.com/comaniac/tvm/tree/ansor_vm

  1. Fixed the issue that the compute with dynamic shape tensors didn’t be filter out.
  2. Added test cases to show that now we can extract tasks from a model with control flow.

Remaining issues

  1. There are still lots of extracted tasks only contain a shape_of op. This is because shape_of has TOpPattern kOpaque which is higher than kCommReduce.
  2. Although the source tensor of shape_of in Mask R-CNN has dynamic shape (e.g., (Any, 28, 28)), it cannot be detected by checking I/O tensors, because it has no input tensors, and the shape of its output tensor is (3,) which is static.
  3. I found that the NCHWc issue is general to all existing test cases. The current task extraction unit tests also gets NCHWc op, so it would put it as a separate issue for now.

I think it is worth revisiting the decision of repurposing OpPattern as a measure of compute complexity. Especially in more advanced models there are lots of Opaque op which needs no tuning (shape_of, scatter, cumsum and other ops that are written using te.extern).

Maybe we can introduce a new attribute to mark only ops that we do want to tune?

Introducing a new attribute is not an ideal solution. We originally use a similar approach that only tunes the op registered in the relay op strategy, but auto_scheduler should be a lot more flexible. That’s why we now use this approach.

Meanwhile, I do agree that we should have a better logic to determine if a compute is reasonable to be a task.

@comaniac I tried your branch on maskrcnn but still got this error. Is this expected?

Traceback (most recent call last):
  File "maskrcnn_test.py", line 155, in <module>
    auto_schedule()
  File "maskrcnn_test.py", line 106, in auto_schedule
    print(task.compute_dag)
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/compute_dag.py", line 254, in __str__
    raw_lines = super().__str__().split("\n")
  File "/home/masa/projects/dev/tvm/python/tvm/runtime/object.py", line 50, in __repr__
    return _ffi_node_api.AsRepr(self)
  File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (4) /home/masa/projects/dev/tvm/build/libtvm.so(TVMFuncCall+0x63) [0x7feefe88ea23]
  [bt] (3) /home/masa/projects/dev/tvm/build/libtvm.so(+0x8aa897) [0x7feefdd24897]
  [bt] (2) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::ReprPrinter::Print(tvm::runtime::ObjectRef const&)+0xfd) [0x7feefdd244bd]
  [bt] (1) /home/masa/projects/dev/tvm/build/libtvm.so(+0x71bdec) [0x7feefdb95dec]
  [bt] (0) /home/masa/projects/dev/tvm/build/libtvm.so(+0x71ae18) [0x7feefdb94e18]
  File "/home/masa/projects/dev/tvm/src/auto_scheduler/compute_dag.cc", line 1412
TVMError: Unsupported reduction operator(x*y)

You might need to re-build the TVM for this change in compute_dag.cc: https://github.com/comaniac/tvm/blob/ansor_vm/src/auto_scheduler/compute_dag.cc#L1400

Thanks I forgot rebuilding. Now I got 64 tasks and kicked off tuning for CUDA.

Unfortunately, the large dense layer that is the bottleneck of MaskRCNN on CUDA (taking more than 40% of total time) is a dynamic workload whose input is (num_detected_box * 12254) so it cannot be tuned.

Still, I’m looking forward to the outcome of the auto tuning in terms of following points:

  • Most of convolution layers in MaskRCNN come from resnet50, so autotvm should be already picking reasonable schedules out of the box. It would be interesting to see how much win I’d get if I run the resnet50 backbone using auto-scheduled kernels.

  • MaskRCNN from PyTorch is in NCHW layout. I understand that Ansor prefers NHWC, but I wonder how it performs with NCHW. If NCHW perf is not good, I’ll try ConvertLayout and NHWC tuning.

@comaniac I got the following interesting error when tuning on a task with ID 48. The tuning script was aborted, so this is not a invalid code problem but the problem of auto scheduler itself.

Estimated total latency: - ms   Trials: 3058    Used time : 5765 s      Next ID: 48                                                                                          
----------------------------------------------------------------------                                                                                                       
------------------------------  [ Search ]                                                                                                                                   
----------------------------------------------------------------------                                                                                                       
Generate Sketches               #s: 1                                                 
Traceback (most recent call last):                                                                                                                                           
  File "maskrcnn_test.py", line 155, in <module>                                                                                                                             
    auto_schedule()                                                                                                                                                          
  File "maskrcnn_test.py", line 118, in auto_schedule
    tuner.tune(tune_option)
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/task_scheduler.py", line 324, in tune
    self._tune_task(idx)
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/task_scheduler.py", line 420, in _tune_task
    self.num_measures_per_round, self.measurer
  File "/home/masa/projects/dev/tvm/python/tvm/auto_scheduler/search_policy.py", line 86, in continue_search_one_round
    return _ffi_api.SearchPolicyContinueSearchOneRound(self, num_measure, measurer)
  File "/home/masa/projects/dev/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
...
...
  [bt] (2) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::auto_scheduler::InitThreadBind::Apply(tvm::auto_scheduler::SketchPolicyNode*, tvm::auto_scheduler::State*, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>*) const+0x5a5) [0x7f2f95c1c205]
  [bt] (1) /home/masa/projects/dev/tvm/build/libtvm.so(tvm::auto_scheduler::FuseAllOuterSpaceIterators(tvm::auto_scheduler::State const&, int, tvm::auto_scheduler::Iterator*)+0x3a9) [0x7f2f95c24129]
  [bt] (0) /home/masa/projects/dev/tvm/build/libtvm.so(+0x7b8f28) [0x7f2f95c0ff28]
  File "/home/masa/projects/dev/tvm/src/support/parallel_for.cc", line 92
TVMError: Parallel_for error with [08:39:59] /home/masa/projects/dev/tvm/src/auto_scheduler/search_policy/utils.h:612: 
---------------------------------------------------------------
An internal invariant was violated during the execution of TVM.
Please read TVM's error reporting guidelines.
More details can be found here: https://discuss.tvm.ai/t/error-reporting/7793.
---------------------------------------------------------------
  Check failed: !to_fuse.empty() == false: 

The task 48 is this one:

========== Task 48  (workload key: ["d19859b192abcefb252f6ea36ca4b5c5"]) ==========
placeholder = PLACEHOLDER [1, 256, 13, 13]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 14)) && (i3 >= 1)) && (i3 < 14)), placeholder[i0, i1, (i2 - 1), (i3 - 1)], 0f)
placeholder = PLACEHOLDER [256, 256, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*placeholder[ff, rc, ry, rx])
placeholder = PLACEHOLDER [1, 256, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, 0, 0])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

Interestingly, this is the same task that gave Unsupported reduction operator(x*y) error earlier. Could this be related?

The task 48 you post doesn’t have the unsupported reduction operator. Maybe it’s typo? This is the task 48 I got, and I am able to reproduce the error you got by tuning this task:

=== TASK 48 ===
placeholder = PLACEHOLDER [2]
placeholder_red()reduce(x*y)

This error looks like a bug or a limitation in the auto_scheduler to me (cc @jcf94 ).

I’ve filed a PR first to cover the above changes I made to get this point so that others can chime in:

In addition, here is the serialized task 48 and the script to reproduce the error:

2 Likes

This is a bug of auto-scheduler. We didn’t test this kind of compute_dag before. I think it is easy to fix. I or @jcf94 can take a look later.

Since this task only has two elements, it does not need to be tuned. As a work around, we can simply delete this task after task extraction by something like

del tasks[48]
del task_weights[48]

Then it will fallback to topi schedule during compilation.

1 Like

ok, it seems there are four tasks with UnsupportedReduce. For now I hacked the task extraction code in compile_engine.cc to only send conv2d op to the auto scheduler. Still there are 46 tuning tasks to keep my gpu busy for a while.