My idea looks at adding official TVM tensor expression implementations of kernels from the polybench benchmarking system.
It’s possible that some kernels do not make sense within the TVM programming model, however I still believe there is value here. I’ve started my implementation, but would like to engage with the community during this process.
Polybench is a very common benchmark suite used to benchmark the performance of compilation techniques, especially in the polyhedral compilation community, using small kernels. >2000 papers, when searching Google Scholar with the term “polybench”. To quote from the README of PolyBench/C
PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.).
TVM is not using the polyhedral model, at least not as a first-class citizen. However, in some of my experiments, I have found that using TVM and its auto-scheduler (Ansor), can get much better performance than polyhedral compilers (e.g. polly).
I believe that having an official implementation of these kernels would allow TVM to be more consistently evaluated against these systems by compiler researchers.
The programming model of TVM means that perhaps not all of these kernels can be implemented as an exact match of the original polybench-C kernels, and thus there may be some caveats. However, an open community effort will allows these considerations to be exposed and discussed, and improve the quality of any subsequent research.
Polybench is a fairly stable suite of benchmarks, so its inclusion will not create a large ongoing maintenance overhead to the project.
I believe that having this project under the
apps/ directory of TVM is the place for it to go.
It doesn’t make sense to integrate it in the wider codebase, since these kernels should be standalone.
- The code should be clean and documented, and easy to extend.
- Each kernel should be defined as its own Python file, with CLI arguments for size, target, and whether or not to run Ansor
- Users should be able to run the whole benchmark suite in a single command (so that users unfamiliar with TVM can bring it as a point of comparison).
- The coding conventions (including variable names) should aim to be as close to the polybench-C codebase as possible. This will make comparison easier.
I have began implementation of the linalg kernels, which I believe are the best fit for TVM. The other classes (e.g., datamining, stencils), may be less suited, however we can establish which ones it makes sense to implement.
TVM’s programming model may not be suitable for getting perfect equivalence to the original unmodified kernels. We should aim for “best-effort”, and document any differences.
- Questions around fairness in how TVM does “in-place” computations - as far as I know it doesn’t. Thus we may not be comparing exactly the same thing for some kernels:
- things like 0-initialization may not be performed in the same way
- or we may allocate more memory for cases such as
C := alpha*A*B + beta*C, in TVM the LHS C will be a new array.
I have started developing this
polybench-tvm, under my branch feat/polybench-tvm, and have a few kernels ready, with a default CPU schedule (default GPU schedule in-coming).
However, I want to engage the community to input in terms of direction, and code structure.
Other features I am thinking of:
- Each kernel has various predefined data sizes for benchmarking (e.g., MINI, SMALL, LARGE, etc). These should be included
- I am using NumPy to generate test data, however these tests should be verified against the reference implementation
I will update this table as I go, but does anyone have any thoughts or comments?
|2 Matrix Multiplications (D=A.B; E=C.D)
|3 Matrix Multiplications (E=A.B; F=C.D; G=E.F)
|Alternating Direction Implicit solver
|Matrix Transpose and Vector Multiplication
|BiCG Sub Kernel of BiCGStab Linear Solver
|Multiresolution analysis kernel (MADNESS)
|Toeplitz system solver
|Dynamic programming (2D)
|2-D Finite Different Time Domain Kernel
|FDTD using Anisotropic Perfectly Matched Layer
|Vector Multiplication and Matrix Addition
|Scalar, Vector and Matrix Multiplication
|1-D Jacobi stencil computation
|2-D Jacobi stencil computation
|Matrix Vector Product and Transpose
|2-D Image processing
|2-D Seidel stencil computation
|Symmetric rank-2k operations
|Symmetric rank-k operations