[pre-RFC] Polybench-TVM

Wheest · June 21, 2022, 2:59pm

My idea looks at adding official TVM tensor expression implementations of kernels from the polybench benchmarking system.

It’s possible that some kernels do not make sense within the TVM programming model, however I still believe there is value here. I’ve started my implementation, but would like to engage with the community during this process.

Motivation

What

Polybench is a very common benchmark suite used to benchmark the performance of compilation techniques, especially in the polyhedral compilation community, using small kernels. >2000 papers, when searching Google Scholar with the term “polybench”. To quote from the README of PolyBench/C

PolyBench is a benchmark suite of 30 numerical computations with static control flow, extracted from operations in various application domains (linear algebra computations, image processing, physics simulation, dynamic programming, statistics, etc.).

Relevance to TVM

TVM is not using the polyhedral model, at least not as a first-class citizen. However, in some of my experiments, I have found that using TVM and its auto-scheduler (Ansor), can get much better performance than polyhedral compilers (e.g. polly).

I believe that having an official implementation of these kernels would allow TVM to be more consistently evaluated against these systems by compiler researchers.

The programming model of TVM means that perhaps not all of these kernels can be implemented as an exact match of the original polybench-C kernels, and thus there may be some caveats. However, an open community effort will allows these considerations to be exposed and discussed, and improve the quality of any subsequent research.

Polybench is a fairly stable suite of benchmarks, so its inclusion will not create a large ongoing maintenance overhead to the project.

Plan

I believe that having this project under the apps/ directory of TVM is the place for it to go. It doesn’t make sense to integrate it in the wider codebase, since these kernels should be standalone.

The code should be clean and documented, and easy to extend.
Each kernel should be defined as its own Python file, with CLI arguments for size, target, and whether or not to run Ansor
Users should be able to run the whole benchmark suite in a single command (so that users unfamiliar with TVM can bring it as a point of comparison).
The coding conventions (including variable names) should aim to be as close to the polybench-C codebase as possible. This will make comparison easier.

I have began implementation of the linalg kernels, which I believe are the best fit for TVM. The other classes (e.g., datamining, stencils), may be less suited, however we can establish which ones it makes sense to implement.

Caveats

TVM’s programming model may not be suitable for getting perfect equivalence to the original unmodified kernels. We should aim for “best-effort”, and document any differences.

Questions around fairness in how TVM does “in-place” computations - as far as I know it doesn’t. Thus we may not be comparing exactly the same thing for some kernels:
- things like 0-initialization may not be performed in the same way
- or we may allocate more memory for cases such as C := alpha*A*B + beta*C, in TVM the LHS C will be a new array.

I have started developing this polybench-tvm, under my branch feat/polybench-tvm, and have a few kernels ready, with a default CPU schedule (default GPU schedule in-coming). However, I want to engage the community to input in terms of direction, and code structure.

Other features I am thinking of:

Each kernel has various predefined data sizes for benchmarking (e.g., MINI, SMALL, LARGE, etc). These should be included
I am using NumPy to generate test data, however these tests should be verified against the reference implementation

I will update this table as I go, but does anyone have any thoughts or comments?

Benchmark	Supported	Description
2mm		2 Matrix Multiplications (D=A.B; E=C.D)
3mm		3 Matrix Multiplications (E=A.B; F=C.D; G=E.F)
adi		Alternating Direction Implicit solver
atax		Matrix Transpose and Vector Multiplication
bicg		BiCG Sub Kernel of BiCGStab Linear Solver
cholesky		Cholesky Decomposition
correlation		Correlation Computation
covariance		Covariance Computation
doitgen		Multiresolution analysis kernel (MADNESS)
durbin		Toeplitz system solver
dynprog		Dynamic programming (2D)
fdtd-2d		2-D Finite Different Time Domain Kernel
fdtd-apml		FDTD using Anisotropic Perfectly Matched Layer
gauss-filter		Gaussian Filter
gemm		Matrix-multiply C=alpha.A.B+beta.C
gemver		Vector Multiplication and Matrix Addition
gesummv		Scalar, Vector and Matrix Multiplication
gramschmidt		Gram-Schmidt decomposition
jacobi-1D		1-D Jacobi stencil computation
jacobi-2D		2-D Jacobi stencil computation
lu		LU decomposition
ludcmp		LU decomposition
mvt		Matrix Vector Product and Transpose
reg-detect		2-D Image processing
seidel		2-D Seidel stencil computation
symm		Symmetric matrix-multiply
syr2k		Symmetric rank-2k operations
syrk		Symmetric rank-k operations
trisolv		Triangular solver
trmm		Triangular matrix-multiply