Numerial Stability (?) of matmul example

Hi there,

When I follow the tutorial of writing OP / Kernels with TVM, I notice that for some operators, the numerial results does not always match the numpy’s result and we have to set a small atol rtol to pass the assertion. For example, for matmul

def my_assertion(actual, desired, min_rtol=1, max_rtol=12):
    for i in range(min_rtol, max_rtol):
        try:
            np.testing.assert_allclose(actual, desired, rtol=0.1 ** i)
        except AssertionError as e:
            print(f"Pass atol 1e-{i-1}")
            print(f"Fail atol 1e-{i}")
            return 
    print(f"Pass all atol tests")


if __name__ == "__main__":
    B, M, K, N = 1, 20, 2, 20
    dev = tvm.cpu()

    input = te.placeholder((B, M, K), name="input")
    weight = te.placeholder((B, N, K), name="input")
    out = topi.nn.batch_matmul(input, weight)
    sch = te.create_schedule(out.op)
    tlinear = tvm.build(sch, [input, weight, out])

    idtype = input.dtype
    a = tvm.nd.array(np.random.uniform(size=(B, M, K)).astype(idtype), dev)
    b = tvm.nd.array(np.random.uniform(size=(B, N, K)).astype(idtype), dev)
    c = tvm.nd.array(np.random.uniform(size=(B, M, N)).astype(idtype), dev)
    tlinear(a, b, c)
    
    ta = torch.from_numpy(a.numpy())
    tb = torch.from_numpy(b.numpy())
    tb = tb.transpose(1, 2)
    
    tc = torch.matmul(ta, tb)

    print("Compare TVM with PTH")
    my_assertion(c.numpy(), tc.numpy())

    print("Compare TVM with Numpy")
    na = a.numpy()
    nb = b.numpy()

    nc_slices = [
        np.expand_dims(np.matmul(na[b], nb[b].T), axis=0)
        for b in range(B)
    ]
    nc = np.concatenate(nc_slices, axis=1)
    my_assertion(c.numpy(), nc) 

    print("Compare PyTorch with Numpy")
    my_assertion(tc.numpy(), nc)

While PyTorch’s result perfectly matches numpy’s, TVM’s output showing a numerial difference of rtol=1e-6. Is there any ways to close the gap?

Numerical stability really depends on the input data and the compute order (for example, it may happen that a + b + c != a + c + b if they are floating point numbers), so I don’t think there is any perfect solution here to close the small gap of 1e-6.

Thanks Junru. I undertstand the error / gaps are sometime unavoidable due to IEEE float formats. But on some operators, e.g., topi.nn.batch_norm, the relative error can be as large as 1e-2 and 1e-1, which definitely will hurt the precision.

You may try different schedules which result in different compute order, but it really depends on the distribution of the input tensors. If the inputs are quite ill-formed, neither PyTorch/NumPy or TVM could guarantee the numeric stability

It really depends on the workload and platforms, although 1e-2 is definitely not acceptable in the case of float32.

Here is my result after running your test script:

Compare TVM with PTH
Pass atol 1e-6
Fail atol 1e-7
Compare TVM with Numpy
Pass atol 1e-6
Fail atol 1e-7
Compare PyTorch with Numpy
Pass all atol tests

It seems not a problem to me, as we usually use 1e-5 as the criteria, which is usually sufficient for inference.

That holds for matmul, but for batch_norm, the nuemrail stability is indeed an issue.

For example, in the following code

from sys import version
import numpy as np

import torch as th
import torch.nn as nn
import torch.optim
import torch.nn.functional as F
from torch import autograd

import collections.abc
from itertools import repeat
import warnings

import tvm
from tvm import auto_scheduler, te, relay
from tvm.contrib.dlpack import to_pytorch_func

from tvm import topi
from tvm.topi.nn import get_pad_tuple, pad, dilate
from tvm.topi.nn.pad import pad
from tvm.topi import get_const_int, get_const_tuple, tag
from tvm.topi.utils import simplify

def get_bn_data(n, c, h, w, ):
    data = np.random.normal(size=(n, c, h, w)).astype('float32')
    mean = np.random.normal(size=(c, )).astype('float32')

    # move the mean of the normal distribution to be 1
    var = np.random.normal(loc=1.0, size=(c, )).astype('float32')
    # make sure all variance numbers are not negative
    var = np.absolute(var)

    gamma = np.random.normal(size=(c, )).astype('float32')
    beta = np.random.normal(size=(c, )).astype('float32')
    out = np.empty((n, c, h, w), dtype='float32')
    return data, mean, var, gamma, beta, out

def get_bn_param():
    n = 2
    c = 32
    h = 28
    w = 28
    return n, c, h, w

def convert_np_data(*args, constructor=None):
    if constructor:
        return (constructor(x) for x in args)
    return args

def my_assertion(actual, desired, min_rtol=1, max_rtol=12, verbose=True):
    for i in range(min_rtol, max_rtol):
        try:
            np.testing.assert_allclose(actual, desired, rtol=0.1 ** i)
        except AssertionError as e:
            if i > min_rtol and verbose:
                print(f"Pass atol 1e-{i-1}")
                print(f"Fail atol 1e-{i}")
            break 
    if i == min_rtol:
        raise AssertionError(f"Cannot pass minimal rtol={i}")
    elif i == max_rtol and verbose:
        print(f"Pass all atol tests")
    return i

relay.nn.batch_norm
def batch_norm_with_shape(n, c, h, w, eps=1e-5):
    X = te.placeholder((n, c, h, w), name='X')
    mean = te.placeholder((c, ), name='Mean')
    var = te.placeholder((c, ), name='Var')
    gamma = te.placeholder((c, ), name='Gamma')
    beta = te.placeholder((c, ), name='Beta')
    Y = te.compute(
        (n, c, h, w),
        lambda nn, cc, hh, ww: 
            (X[nn, cc, hh, ww] - mean[cc]) / te.sqrt(var[cc] + eps) * gamma[cc] + beta[cc],
        name="batch_norm_2d"
    )
    return X, mean, var, gamma, beta, Y


if __name__ == "__main__":
    from tqdm import tqdm

    n, c, h, w = get_bn_param()
    X, Mean, Var, Gamma, Beta, Y = batch_norm_with_shape(n, c, h, w)
    sch = te.create_schedule(Y.op)
    mod = tvm.build(sch, [X, Mean, Var, Gamma, Beta, Y])

    success = 0
    failure = 0
    for i in tqdm(range(5000)):
        data, mean, var, gamma, beta, out = get_bn_data(n, c, h, w)
        
        data_tvm, mean_tvm, var_tvm, gamma_tvm, beta_tvm, out_tvm = convert_np_data(data, mean, var, gamma, beta, out, constructor=tvm.nd.array)
        mod(data_tvm, mean_tvm, var_tvm, gamma_tvm, beta_tvm, out_tvm)

        data_pth, mean_pth, var_pth, gamma_pth, beta_pth, out_pth = convert_np_data(data, mean, var, gamma, beta, out, constructor=th.Tensor)
        out_pth = F.batch_norm(data_pth, mean_pth, var_pth, gamma_pth, beta_pth)
        try:
            pr = my_assertion(out_pth.numpy(), out_tvm.numpy(), verbose=False)
        except AssertionError:
            pr = 0
        if pr < 2:
            failure += 1
        else:
            success += 1

    print(success, failure)

38 of 5000 runs have relative errors as large as 1e-2 on my server.

That’s another story then. It would be easier for people to chime in if you could summarize the issues you’ve found, such as a list of ops with accuracy issue on certain platforms, instead of reporting single op at a time.

It is really hard to say there is a correct result, because pt/np/tf/tvm accumulates errors in their own ways, depending on the compute order.

Sure, I have wrote unit tests for common OPs used in vision and NLP. I can share the summarization of numerical differences between TVM and other frameworks. How can I get access to TVM chime channel?