[pre-RFC] Polybench-TVM

My idea looks at adding official TVM tensor expression implementations of kernels from the polybench benchmarking system.

It’s possible that some kernels do not make sense within the TVM programming model, however I still believe there is value here. I’ve started my implementation, but would like to engage with the community during this process.

Motivation

What

Polybench is a very common benchmark suite used to benchmark the performance of compilation techniques, especially in the polyhedral compilation community, using small kernels. >2000 papers, when searching Google Scholar with the term “polybench”. To quote from the README of PolyBench/C

PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.).

Relevance to TVM

TVM is not using the polyhedral model, at least not as a first-class citizen. However, in some of my experiments, I have found that using TVM and its auto-scheduler (Ansor), can get much better performance than polyhedral compilers (e.g. polly).

I believe that having an official implementation of these kernels would allow TVM to be more consistently evaluated against these systems by compiler researchers.

The programming model of TVM means that perhaps not all of these kernels can be implemented as an exact match of the original polybench-C kernels, and thus there may be some caveats. However, an open community effort will allows these considerations to be exposed and discussed, and improve the quality of any subsequent research.

Polybench is a fairly stable suite of benchmarks, so its inclusion will not create a large ongoing maintenance overhead to the project.

Plan

I believe that having this project under the apps/ directory of TVM is the place for it to go. It doesn’t make sense to integrate it in the wider codebase, since these kernels should be standalone.

  • The code should be clean and documented, and easy to extend.
  • Each kernel should be defined as its own Python file, with CLI arguments for size, target, and whether or not to run Ansor
  • Users should be able to run the whole benchmark suite in a single command (so that users unfamiliar with TVM can bring it as a point of comparison).
  • The coding conventions (including variable names) should aim to be as close to the polybench-C codebase as possible. This will make comparison easier.

I have began implementation of the linalg kernels, which I believe are the best fit for TVM. The other classes (e.g., datamining, stencils), may be less suited, however we can establish which ones it makes sense to implement.

Caveats

TVM’s programming model may not be suitable for getting perfect equivalence to the original unmodified kernels. We should aim for “best-effort”, and document any differences.

  • Questions around fairness in how TVM does “in-place” computations - as far as I know it doesn’t. Thus we may not be comparing exactly the same thing for some kernels:
    • things like 0-initialization may not be performed in the same way
    • or we may allocate more memory for cases such as C := alpha*A*B + beta*C, in TVM the LHS C will be a new array.

I have started developing this polybench-tvm, under my branch feat/polybench-tvm, and have a few kernels ready, with a default CPU schedule (default GPU schedule in-coming). However, I want to engage the community to input in terms of direction, and code structure.

Other features I am thinking of:

  • Each kernel has various predefined data sizes for benchmarking (e.g., MINI, SMALL, LARGE, etc). These should be included
  • I am using NumPy to generate test data, however these tests should be verified against the reference implementation

I will update this table as I go, but does anyone have any thoughts or comments?

Benchmark Supported Description
2mm :white_check_mark: 2 Matrix Multiplications (D=A.B; E=C.D)
3mm :white_check_mark: 3 Matrix Multiplications (E=A.B; F=C.D; G=E.F)
adi Alternating Direction Implicit solver
atax :white_check_mark: Matrix Transpose and Vector Multiplication
bicg BiCG Sub Kernel of BiCGStab Linear Solver
cholesky Cholesky Decomposition
correlation Correlation Computation
covariance Covariance Computation
doitgen Multiresolution analysis kernel (MADNESS)
durbin Toeplitz system solver
dynprog Dynamic programming (2D)
fdtd-2d 2-D Finite Different Time Domain Kernel
fdtd-apml FDTD using Anisotropic Perfectly Matched Layer
gauss-filter Gaussian Filter
gemm :white_check_mark: Matrix-multiply C=alpha.A.B+beta.C
gemver :hourglass: Vector Multiplication and Matrix Addition
gesummv Scalar, Vector and Matrix Multiplication
gramschmidt Gram-Schmidt decomposition
jacobi-1D 1-D Jacobi stencil computation
jacobi-2D 2-D Jacobi stencil computation
lu LU decomposition
ludcmp LU decomposition
mvt Matrix Vector Product and Transpose
reg-detect 2-D Image processing
seidel 2-D Seidel stencil computation
symm Symmetric matrix-multiply
syr2k Symmetric rank-2k operations
syrk Symmetric rank-k operations
trisolv Triangular solver
trmm Triangular matrix-multiply
2 Likes