Autoscheduler doesn't tune Transpose op correctly

jamdb · November 24, 2021, 3:25pm

Hi, I am new to TVM and I am working with an application which has to perform a lot of transpose operations on tensors with a lot of indices (often >10)

I wanted to see how well the auto-scheduler can optimize these transpose operations, but it doesn’t seem to generate a wide range of schedules to search through:

@auto_scheduler.register_workload
def transpose(shape, axes):

    A = te.placeholder( shape, name = 'Placeholder_A' )

    out = topi.transpose( A, axes )

    return [A,out]

axes = [0,1,2,3,5,4],
shape = [6,8,2,4,2,5]
task = auto_scheduler.SearchTask( func = transpose, args =  (shape, axes), target = 'llvm' )

print(task.compute_dag)

log_file = "transpose.json"
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=500,
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    verbose=2,
)

task.tune(tune_option)

I have included the tuning output log below. The output log looks like it is only searching through a relatively small search space, and not trying loop reordering at all, where I think it should be generating quite a large search space of schedules.

Does anyone know what I can do to make it tune correctly? Any help would be greatly appreciated.

Placeholder_A = PLACEHOLDER [6, 8, 2, 4, 2, 5]
T_transpose(ax0, ax1, ax2, ax3, ax4, ax5) = Placeholder_A[ax0, ax1, ax2, ax3, ax5, ax4]

----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches               #s: 1
Sample Iter: 5  #Pop: 1 #Target: 50     fail_ct: 10239  Time elapsed: 0.79
#Target has been reduced to 25 due to too many failures or duplications
Sample Iter: 10 #Pop: 1 #Target: 25     fail_ct: 20479  Time elapsed: 1.59
#Target has been reduced to 12 due to too many failures or duplications
Sample Iter: 15 #Pop: 1 #Target: 12     fail_ct: 30719  Time elapsed: 2.37
#Target has been reduced to 6 due to too many failures or duplications
Sample Iter: 20 #Pop: 1 #Target: 6      fail_ct: 40959  Time elapsed: 3.16
#Target has been reduced to 3 due to too many failures or duplications
Sample Iter: 25 #Pop: 1 #Target: 3      fail_ct: 51199  Time elapsed: 3.95
#Target has been reduced to 1 due to too many failures or duplications
Sample Initial Population       #s: 1   fail_ct: 53247  Time elapsed: 4.12
GA Iter: 0      Max score: 0.4089       Min score: 0.4089       #Pop: 1 #M+: 0  #M-: 0
GA Iter: 4      Max score: 0.6576       Min score: 0.0138       #Pop: 5 #M+: 63 #M-: 8920
EvolutionarySearch              #s: 5   Time elapsed: 0.64
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 5 programs to measure:
.....*****
==================================================
No: 1   GFLOPS: 0.00 / 0.00     results: MeasureResult(cost:[0.0000], error_no:0, all_cost:0.33, Tstamp:1637765955.24)
==================================================
Placeholder: Placeholder_A
parallel ax0@ax1@ax2@ax3@ax4@ (0,1920)
  vectorize ax5 (0,2)
    T_transpose = ...

==================================================
No: 2   GFLOPS: 0.00 / 0.00     results: MeasureResult(cost:[0.0000], error_no:0, all_cost:0.31, Tstamp:1637765955.50)
==================================================
Placeholder: Placeholder_A
parallel ax0@ax1@ax2@ (0,96)
  for ax3 (0,4)
    for ax4 (0,5)
      vectorize ax5 (0,2)
        T_transpose = ...

==================================================
No: 3   GFLOPS: 0.00 / 0.00     results: MeasureResult(cost:[0.0000], error_no:0, all_cost:0.27, Tstamp:1637765955.74)
==================================================
Placeholder: Placeholder_A
parallel ax0@ax1@ax2@ax3@ (0,384)
  for ax4 (0,5)
    vectorize ax5 (0,2)
      T_transpose = ...

==================================================
No: 4   GFLOPS: 0.00 / 0.00     results: MeasureResult(cost:[0.0000], error_no:0, all_cost:0.33, Tstamp:1637765955.99)
==================================================
Placeholder: Placeholder_A
parallel ax0@ax1@ (0,48)
  for ax2 (0,2)
    for ax3 (0,4)
      for ax4 (0,5)
        vectorize ax5 (0,2)
          T_transpose = ...

==================================================
No: 5   GFLOPS: 0.00 / 0.00     results: MeasureResult(cost:[0.0000], error_no:0, all_cost:0.35, Tstamp:1637765956.23)
==================================================
Placeholder: Placeholder_A
parallel ax0@ (0,6)
  for ax1 (0,8)
    for ax2 (0,2)
      for ax3 (0,4)
        for ax4 (0,5)
          vectorize ax5 (0,2)
            T_transpose = ...

Time elapsed for measurement: 2.38 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.10 s
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population       #s: 1   fail_ct: 2047   Time elapsed: 0.17
GA Iter: 0      Max score: N/A  Min score: N/A  #Pop: 0 #M+: 0  #M-: 0
GA Iter: 4      Max score: N/A  Min score: N/A  #Pop: 0 #M+: 64 #M-: 8913
EvolutionarySearch              #s: 0   Time elapsed: 0.58
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population       #s: 1   fail_ct: 2047   Time elapsed: 0.17
GA Iter: 0      Max score: N/A  Min score: N/A  #Pop: 0 #M+: 0  #M-: 0
GA Iter: 4      Max score: N/A  Min score: N/A  #Pop: 0 #M+: 59 #M-: 8861
EvolutionarySearch              #s: 0   Time elapsed: 0.60
It seems all candidates in the search space have been measured.
----------------------------------------------------------------------
------------------------------  [ Done ]
----------------------------------------------------------------------
No valid state found in this search round. Check if it has traversed all of the search space.

comaniac · November 25, 2021, 12:22am

At the first glance I don’t think transpose could be significantly speed up with auto-scheduler or AutoTVM because it is mainly just data copy without any reduction.

When we have lots of transpose ops in the model, we usually try to reduce them at the graph level. For example, transpose -> dense might be transformed to just matmul. Another direction is fusing them with the previous/next op that performs reduction. In this case, the transpose op might be inlined and results in negligible overhead.

jamdb · November 25, 2021, 9:33am

Hi, Thanks for the reply. I have come across a few attempts to increase performance of transpositions, specifically for large rank tensors, mostly for quantum chemistry applications:

These approaches look like they use loop ordering and blocking, as well as some other heuristics, to increase the performance which seems like the auto scheduler should be able to try the same adjustments.

Do you think that there is no need for this in TVM because of the graph optimisations you mention, even for large rank tensors?

comaniac · November 25, 2021, 7:13pm

No that’s not what I meant. I meant the transpose workloads in deep learning that TVM has considered are often optimized along with other reduction ops, so we don’t really know how it goes with high rank transpose.

Meanwhile, you are welcome to share your workloads or even help improve the transpose scheduling

jamdb · November 26, 2021, 1:40pm

I would love to try improve the transpose scheduling - I am not sure where I would begin with that. Is there some way to define a custom schedule generation for a particular operation, like the transpose op?

comaniac · November 28, 2021, 5:57am

In the case of auto-scheduler, we don’t need and cannot define a scheduler for a particular op. Meanwhile, if you would like to contribute a schedule for transpose op, you may take a look at the TOPI/AutoTVM. For example, you could start by learning how we register a softmax schedule to CUDA: https://github.com/apache/tvm/blob/main/python/tvm/topi/cuda/softmax.py