Sizes of the input matrices is NxL and LxM. If the size L is a constant known durning compilation, this gives a 1.25x faster code than using L as a variable, even though generated code looks the same and the value of L didn’t change.
The tutorial suggested using s[Z].reorder(i_outer, j_outer, k, i_inner, j_inner). I understand why that might be useful, but in practice, not having reorder at all works better than this. Also, I don’t understand why k comes before i_inner, j_inner not after.
I think that tutorial doesn’t really push the performance since it was mainly focusing on how to create a tuning space. For example, the throughput shown in the log is just 10+ GFlop/s, which is far away from what GEMM should have. Maybe that’s also why constant shape doesn’t bring too much speedup.
The best way to understand how to use TVM to optimize GEMM on GPU, in my opinion, is to read the TOPI scheduling implementation here. You might be interested in the function schedule_dense_small_batch and schedule_dense_large_batch.
@comaniac, so I followed schedule_dense_small_batch to implement batched matrix multiplication, and it gave a nice speedup. However, it is still 10x-15x slower than PyTorch’s torch.bmm depending on the GPU.
Did you compare the performance with small batch? I remember the optimization applied to that schedule was also limited. If yours is better than that schedule in TOPI, then you are eligible to help us improve the quality of the TOPI schedules
Most of the ops so far are specialized for constant dimensions. So in this case we will need to put in constant dimensions(instead of te.var). Because batch_matmul itself is compute intensive, we might also benefit from AutoTVM, instead of using the default one.
Note that specializations necessary for this type of kernels, this is also what cudnn will do(which pytorch calls under the hood). The easiest way to make use such kernel is to generate a few candidate and and fall back to a normal kernel when the shape does not match. There are also more efforts in the community to work on automatic bucketing etc.
Thanks @tqchen, this was helpful. Using constants as follows:
bsz = 24
d1 = 16384
d2 = 512
d3 = 64
A = te.placeholder((bsz, d1, d3), name='A', dtype='float32') # first tensor
B = te.placeholder((bsz, d2, d3), name='B', dtype='float32') # second tensor
R = topi.nn.batch_matmul(A, B)
s = topi.cuda.batch_matmul.schedule_batch_matmul(R)
makes it 5x faster, which is still 3x slower than PyToch, but much better. So,
How do I improve this further? use AutoTVM? I have tried it before for custom schedules that I implemented, but not sure how to do that for schedule_batch_matmul.
More importantly, in my application, d2, d3 are constants, but bsz, d1 are not; bsz has 16 possible values, and d1 has 32 possible values, making it a total of 512 configuration. Does that mean I need to compile 512 different version of the code? there must be a better solution. Maybe the schedule can be updated to support range of values instead of constants?
others might be able to jump in to comment about running autotvm, which will certainly help.
The easiest way to make use such kernel is to generate a few candidate and and fall back to a normal kernel when the shape does not match. There are also more efforts in the community to work on automatic bucketing etc.
The above comment(about generating a few key ones and fallback to normal kernel) is a quick work around for the constant constraints. I agree that we will need better solutions for dynamic workloads, there are some ongoing efforts, see dynamic workload presos in Apache TVM User Conference Archive
I just checked it out and unfortunately batch_matmul schedulefor CUDA does not have a tuning space (https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/cuda/batch_matmul.py). A schedule with a tuning space should have a set of knobs defined in the schedule, such as this. You are very welcome to add a tuning space to topi.cuda.batch_matmul.schedule_batch_matmul.
After a tuning space has been created, you could follow this tutorial to register the schedule function as an AutoTVM task and tune it to see the performance improvement.
Sorry I didn’t dive into batch_matmul schedule function so I have no idea. You could do some experiments and file a PR once you identified a tuning space.
I didn’t quite get your second question. Are you asking how define_split works?
sorry the question wasn’t clear. I was wondering if there’s documentation or high level description for the schedule (no comments in the code), but I guess the answer is no.
Sorry as you expected we don’t have that documents. All schedules were developed and committed by contributors. Since end-users are not supposed to read the schedule functions, how the schedule function was developed is not documented. The similar tutorials you can find is like this one.
@comaniac, hopefully last question, do you know how to save multiple modules into one .so file? I want to save multiple versions of the function as @tqchen suggested, but tvm.runtime.export_library saves only one, and tvm.cc.create_shared doesn’t link the host code with the target code as tvm.runtime.export_library does. I am sure this can be done with gcc or something, but was wondering if tvm already has a solution for this given that it is a common use case.
AFAIK you can only keep one module in one .so file. If you want to keep multiple versions to deal with dynamic workloads, it’s better to wait for the corresponding support by @haichen. Otherwise, you can only integrate all versions to a single schedule function for now.
Given that it is not easy to get multiple versions of the function compiled together, I tried to see if schedule_batch_matmul can be changed to work for dynamic input sizes (at the expense of being slightly less efficient). It seem that this line and the one above can be changed to constants y_bn = x_bn = 64. However, when I do so, I get the following error:
The full codegen and the error are here. Any suggestions what could be wrong? my guess is that some of the scheduling primitives are still relying on constant input sizes, but couldn’t pin point which one. It is probably compute_at but I am sure I used compute_at with variable input sizes before, and it worked nicely.
Any ideas?