[RFC] Parametrized Unit Tests

Lunderberg · May 10, 2021, 3:53pm

Currently, many of the unit tests loop over tvm.testing.enabled_targets() in order to run on all targets being tested. While the @tvm.testing.parametrize_targets decorator does allow for test results for each target, it silently hides disabled targets, which can hide failing tests.

I propose parametrizing the unit tests across both target and array size parameters, to improve the accuracy of the test reports. An implementation of this is shown in PR#8010, along with a modified test_topi_relu.py that uses the new features. The pytest output before (left) and after (right) are shown below.

# Frequently used in tests
def test_feature():
    def verify_target(target, dev):
        # do test code here

    for target, dev in tvm.testing.enabled_targets():
        verify_target(target,dev)
        
        
#Currently possible, but rarely used
@tvm.testing.parametrize_targets
def test_feature(target, dev):
    # Do test code here
    
    
#Proposed standard
def test_feature(target, dev):
    # Do test code here

If a test_* function accepts parameters target and dev, the test would automatically be run on all enabled targets. Any targets that cannot be tested, either because their compilation wasn’t enabled, or because no such physical device exists, would be explicitly listed as skipped. This method also splits up the single test_feature test into separate tests test_feature[llvm], test_feature[cuda], and so on. Rather than indicating only a single success/failure, splitting up the tests can indicate whether a bug lies within the feature or within a specific runtime.

By making the default test behavior be to test on all targets specified in TVM_TEST_TARGETS, it is harder for failing unit tests to be committed. For example, currently several of the OpenCL unit tests fail. The current CI setup runs tests with USE_OPENCL=OFF, and there is no indication printed that these tests are being skipped.

If tests should run on all targets except some known subset (e.g. everything except llvm), these exclusions can be specified with the new tvm_excluded_targets variable. If this variable is specified in a test module, either as a string or as a list or strings, then the targets listed will be dropped entirely from those tests, and will not be displayed. tvm_excluded_targets can also be specified in a conftest.py file to apply to an entire directory, or applied to a single function with @tvm.testing.exclude_targets. Alternatively the current behavior of specifying explicit targets with @tvm.testing.parametrize_targets('target1','target2') can be used.

It is expected both that enabling unit tests across additional targets may uncover several unit tests failures, and that some unit tests may fail during the early implementation of supporting a new runtime or hardware. In these cases, the tvm_known_failing_targets variable or @tvm.testing.known_failing_targets should be used instead of excluding the targets altogether. These failing targets show up on the pytest report as being skipped, whereas excluded targets do not show up at all. It is intended that these act as a to-do list, either of newly exposed bugs to resolve, or of features that a newly-implemented runtime does not yet implement.

Parametrization can be done for array size/shape parameters as well, with similar benefits to test visibility as for the target. The proposed test style below would indicate which of the three array sizes resulted in an error, rather than just a single success/failure for all three.

# Current test style
def verify_prelu(x, w, axis, weight_reshape):
    # Perform tests here
    assert(...)

def test_prelu():
    verify_prelu((1, 3, 2, 2), (3,), 1, (3, 1, 1))
    verify_prelu((1, 3, 2, 2), (2,), 2, (2, 1))
    verify_prelu((1, 3), (3,), 1, (3,))

# Proposed test style
@pytest.mark.parametrize(
    "x, w, axis, weight_reshape",
    [
        ((1, 3, 2, 2), (3,), 1, (3, 1, 1)),
        ((1, 3, 2, 2), (2,), 2, (2, 1)),
        ((1, 3), (3,), 1, (3,)),
    ],
)
def test_prelu(x, w, axis, weight_reshape):
    # Perform tests here
    assert(...)

Once both the target and parameters are parametrized, this allows target-specific tests to use pytest.skip. For example, the following check for float16 support can be displayed as an explicitly skipped test.

# Current method
def verify_relu(m, n, dtype="float32"):
    def check_target(target, dev):
        if dtype == "float16" and target == "cuda" and not have_fp16(tvm.gpu(0).compute_version):
            print("Skip because %s does not have fp16 support" % target)
            return
        # Run test

    for target, dev in tvm.testing.enabled_targets():
        check_target(target, dev)

@tvm.testing.uses_gpu
def test_relu():
    verify_relu(10, 128, "float32")
    verify_relu(128, 64, "float16")


# Proposed standard
@pytest.mark.parametrize(
    "m, n, dtype",
    [
        (10, 128, "float32"),
        (128, 64, "float16"),
    ],
)
def test_relu(target, dev, m, n, dtype):
    if dtype == "float16" and target == "cuda" and not have_fp16(dev.compute_version):
        pytest.skip("Skip because %s does not have fp16 support" % target)
    # Run test

Though not part of the current proposed changes, this also opens the door for other improvements in the future.

Using tests as benchmarks for cross-target comparisons, since each function call pertains to a single target.
Fuzzing the parametrized values, potentially exposing edge cases.
Profiling time required to run each test. If a small percentage of tests are taking the vast majority of time for the CI to run, these tests may be pulled out into nightly tests, allowing the per-pull-request CI to run faster.

tkonolige · May 10, 2021, 4:35pm

@tvm.testing.parametrize_targets already skips tests if the target is not enabled instead of silently discarding them. So it seems like the main changes you are proposing is 1. we drop the need to adding the parametrize_targets decorator, and 2. instead of specifying which targets a test works on, we instead exclude ones it doesn’t work on. But, I think the real problem you are hitting is the same one I hit when I introduced @tvm.testing.parameterize_targets: a massive number of tests need to be rewritten so that they use the new test format.

On your last bullet point: we already measure test runtimes in CI (here is the latest on master: tvm/main #961 Test Results [Jenkins]).

Lunderberg · May 10, 2021, 5:16pm

Thank you, and that’s a good summary of the changes proposed. For (1), dropping the need to specify the decorator would be to try to make the preferred outcome of testing on all targets be the easiest style to write, and reducing any friction in doing so. For (2), that’s exactly the goal, again so that the desired behavior of testing all functionality on all enabled targets is the default behavior.

Regarding the existing @tvm.testing.parametrize_targets, from my reading of it, if there is an explicit listing of targets (@parametrize_targets("llvm","cuda")) it would show the tests as explicitly being skipped, but if no explicit list is given it defaults to using enabled_targets(). Since enabled_targets() is already filtered by whether the runtime is enabled and the device present, those tests aren’t included in the parametrization. Is that a correct understanding of it?

Completely agreed that there’s a huge number of tests that need to be rewritten. (At first grep, 327 tests using enabled_targets as compared to only 39 using parametrize_targets.) In part, I was thinking that converting the tests over would be a good low-impact way to spend small chunks of time. If there isn’t enough time in the 5-10 minutes before a meeting, updating a few tests is a nice granular task that doesn’t need a deep dive. But before I started doing so, I wanted to discuss and make sure there weren’t any downsides/objections that would be raised after the fact.

Thank you for the link on the test runtimes! I knew that unittests had pass/fail checks, but I didn’t know that the test runtimes were already measured. And I like that for tests that already use parametrize_targets, the time for each target is separate. That’s the sort of use case I was picturing, such as running all topi tests on cuda/opencl/vulkan, and using significant speed discrepancies between runtimes to find where improvements can be made.

Lunderberg · May 11, 2021, 6:33pm

Adding notes from a few video chats, so that there is a record of the discussion

From @tkonolige , confirmed that the current implementation of @tvm.testing.parametrize_targets shows skipped targets if they are explicitly listed in the decorator, but not if they come from TVM_TEST_TARGETS environment variable.

From @tkonolige , some of the unit tests have a significant amount of setup required before the loop over enabled_targets(), and repeating the setup for many targets would increase the runtime of the tests. I agree, this definitely would be an issue and should have a recommended style to avoid duplicating work. I think the best style would be to use xunit-style setup_class to perform the setup, then have the test methods within the class implemented using parametrized fixtures. I’ll test out a few options and get back on it.

From @areusch , recommended using xfail instead of skip for tests marked as known failing. I agree, and will make that change to the PR.

From @areusch , recommended removing the global variables of tvm_excluded_targets and tvm_known_failing_targets, having the decorator as the only method by which to apply these traits. The global variables can silently have a typo, whereas a typo in a decorator throws an error. I agree, and will make that change to the PR.

From @areusch , though perhaps out of scope of this particular RFC, it would be good to have some discussion as to which unit tests should be required to have parametrized targets, and which are allowed to be target-specific (e.g. where should changes in PRs be requested for having non-parametrized targets). My first thoughts would be to rearrange the current tvm/tests/python folder to mimic the current organization of the top-level src and python directories, rather than the current division into unit/integration tests. Then, the tests that are specific to the target subdirectory and the target-specific subdirectories in runtime would be allowed to have non-parametrized tests, while all others would be parametrized.

Someone mentioned that @masahi was looking into ways to compare cuda/vulkan correctness across all topi models, and that this might be of interest to him. The topi tests were exactly the ones I had in mind as a starting point for converting over to the parametrized targets, for exactly the same reason.

areusch · May 11, 2021, 7:20pm

thanks for doing this @lunderberg, i think this will be a very positive change.

I do think it’s worth discussing the “how do we use this?” angle more. As a reviewer, there needs to be some very simple way to decide whether a test needs to be parameterized or whether it can be allowed to remain unparameterized. It may be easier to do this by incrementally expanding a subset of the tests that “require parameterization,” rather than starting with a reorganization (I’m not diminishing the point that the tests are not exactly organized according to source file, but it may be quite a large project to reorganize everything). I think a great starting point would be operator tests, where we already have a need to produce an automated report of our operator coverage.

This also implies that we’ll stop allowing tests to be committed with this syntax:

if __name__ == "__main__":
   test_foo()
   test_bar()

and instead require e.g.

if __name__ == "__main__":
    tvm.testing.main()

where tvm.testing.main:

import sys
import pytest
def main():
   main_file = # use traceback to get calling __file__
   sys.exit(pytest.main([main_file] + sys.argv[1:]))

I’m very supportive of doing this.

Finally, we need to address test time, which might be more easily done through incremental implementation. The CI is pretty maxed out right now, and while we’re working to improve this, we can’t really address a need for a lot of added compute immediately. We may need a way to mark tests as “for CI,” “for nightly,” etc in the event we split the CI.

Lunderberg · May 12, 2021, 10:30pm

Additional comments from a video chat:

From @AndrewZhaoLuo and @electriclilies, support for having the test directories mimic the structure of the existing src directories. This would make it easier to know where a test should be implemented, and would address @areusch’s concern of which directories are allowed to have non-parametrized targets.

From @mbrookhart, one concern would be how to handle supported targets that are not widely available, not in the current CI, or both. We don’t want to stifle contributions by requiring full testing across all targets prior to making a PR. One suggestion was that it would be good to mark these as known failing targets.

From @tqchen, main thing is that it should be done incrementally, after defining a conversion guide.

Lunderberg · May 17, 2021, 6:12pm

@areusch I’ll go for the easier question first. The setup for a tvm.testing.main can be even easier. We don’t need to use traceback, since a script being executed is in sys.argv[0]. Therefore, we can just run sys.exit(pytest.main(sys.argv)), either in the script itself or as a library function.

For the test time, I experimented with a few different ways to make sure that setup common across multiple parameters isn’t repeated. This ended up being trickier than I had expected, mostly because of this note in the pytest documentation, indicating that only a single version of each fixture gets cached. Therefore, in a loop over target and array size parameters, if test1 has expensive setup based on the target, and test2 has expensive setup based on the array size, then one of those setup parameters would be repeated, based on whether the target or array size is in the outer loop.

To get around this limitation, we can either reorder the tests ourselves such that they have minimal repetition of the expensive setups, or we can cache some of the setups such that the order doesn’t matter. In the examples below, I’ve done the latter with @functools.lru_cache. I’d lean toward this method, since it keeps the parametrized tests in a more readable order, but the caching would need to be something that doesn’t last for the entire pytest session.

import functools
import pytest

import tvm.testing


###############################################################
### Case 1, xunit style, setup is independent of parameters ###
###############################################################

# Works, but the `setup_class` method cannot itself be parametrized.
# This can handle most setup cases, but not cases where the setup
# itself requires parameters, such as array size.

# Requires tvm.testing.pytest to set the scope of "target" fixture to
# be "class" or "function", such that the "class" priority setup is
# the higher priority.  In all later examples, "target" fixture is set
# at "session" priority.


class TestClassic:
    @classmethod
    def setup_class(cls):
        print(f"Expensive setup in class")
        cls.expensive = "expensive"

    def test_class(self, target):
        print(f"Running test_class for target={target}")

    @pytest.mark.parametrize("size", [1, 10, 100])
    def test_class_params(self, size, target):
        print(f"Running test_class_params for target={target}, size={size}")


##################################################################
### Case 2, fixture style no parametrization of expensive step ###
##################################################################

# This doesn't depend on any other parameters, same functionality as
# the xunit-style example.  The `scope="module"` argument is
# necessary, otherwise pytest will repeat the expensive setup.
@pytest.fixture(scope="module")
def expensive_no_dependencies():
    print(f"Expensive setup with no dependencies")
    return "expensive"


def test_case2(expensive_no_dependencies):
    print(f"Running test_case2")


def test_case2_target(expensive_no_dependencies, target):
    print(f"Running test_case2 for target={target}")


@pytest.mark.parametrize("size", [1, 10, 100])
def test_case2_params(expensive_no_dependencies, size, target):
    print(f"Running test_case2_params for size={size}, target={target}")


##############################################################
### Case 3, fixture style, target-based setup is expensive ###
##############################################################

# Can have the expensive setup be based on a parameter.  Here, we need
# to include the @functools.lru_cache, because pytest will only cache
# a single version of each fixture.  While the order of tests can be
# adjusted in pytest_collection_modifyitems, unless we add additional
# cache size ourselves, this doesn't have a good solution when
# different functions need different expensive setups.


@pytest.fixture(scope="module")
@functools.lru_cache
def expensive_target_dependent(target):
    print(f"Expensive target-dependent setup for {target}")
    return "expensive" + target


def test_case3(target, expensive_target_dependent):
    print(f"Running test_case3 for target={target}")


@pytest.mark.parametrize("size", [1, 10, 100])
def test_case3_params(target, expensive_target_dependent, size):
    print(f"Running test_case3_params for target={target}, size={size}")


############################################################
### Case 4, fixture style, size-based setup is expensive ###
############################################################


# The `params=[1,10,100]` argument defines the default parameter
# values of `size`, if tests themselves don't override it with
# pytest.mark.parametrize.
@pytest.fixture(scope="module", params=[1, 10, 100])
def size(request):
    return request.param


# As in case 3, need to do our own caching of the expensive setup.
@pytest.fixture(scope="module")
@functools.lru_cache
def expensive_size_dependent(size):
    print(f"Expensive size-dependent setup for size={size}")
    return "expensive" + str(size)


def test_case4(target, size, expensive_size_dependent):
    print(f"Running test_case4 for size={size}")


# pytest.mark.parametrize can override the sizes used.  The
# `indirect=True` argument is necessary in order to expose the new
# `size` parameters to expensive_size_dependent.
@pytest.mark.parametrize("size", [5, 50], indirect=True)
def test_case4_param_test(target, size, expensive_size_dependent):
    print(f"Running test_case4_param for size={size}, target={target}")

areusch · May 17, 2021, 7:13pm

ooc, are the different cases intended to show the entire contents of a test file? i assume the fixture(scope="module") would run for all test cases there.

how could we verify that lru_cache didn’t hide a test problem (e.g. if caching provides a different output to a test than would ordinarily be generated)?

Lunderberg · May 17, 2021, 10:51pm

Correct, the different cases are intended to show the entire contents of a test file. The names in this example are chosen so that it can be run with minimal interaction between cases.

For the fixture(scope="module"), this indicates when pytest should clean up a fixture, but it is only available to be accessed by tests that explicitly accept the fixture as a function argument. With the default scope of "function", the cleanup is done between each test. Absent any caching that we add, this scope is what prevents the fixture from being regenerated, repeating the expensive setup, with each target/size parameter.

For the caching, using the cached values should be no different than passing an argument into a verify_* function, or a verify_* function accessing a nonlocal variable. In either case, we’re relying on the test not to make modifications to the setup that is shared between multiple tests.

That said, if we decide to use fixtures as the method to handle expensive setup, I think that we should add a helper function tvm.testing.fixture to reduce the amount of boilerplate, avoid potential mistakes (e.g. having @lru_cache before @pytest.fixture instead of after), and as a place to disable the cache entirely based on an environment variable. I haven’t implemented it yet, but the behavior would be something like the following.

# Using pytest directly, defining parametrized fixture
@pytest.fixture(scope="module", params=[1, 10, 100])
def size(request):
    return request.param

# With helper function, same behavior as above
tvm.testing.fixture(name="size", params=[1,10,100])

# Using pytest directly, defining a cached parameter derived from `size`
@pytest.fixture(scope="module")
@functools.lru_cache
def expensive_size_dependent(size):
    ...

# With helper function, same behavior as above,
# except that caching can be overridden with environment variable.
@tvm.testing.fixture(cache=True)
def expensive_size_dependent(size):
    ...

areusch · May 19, 2021, 3:29pm

For the caching, using the cached values should be no different than passing an argument into a verify_* function, or a verify_* function accessing a nonlocal variable. In either case, we’re relying on the test not to make modifications to the setup that is shared between multiple tests.

I think this is the part that worries me–while we do make use of caching today, we just do that with test data files and not generally with arbitrary Python objects. Even with what we have today, we should be running a full, uncached build about weekly (i.e. on the weekends when CI traffic is lighter) to confirm our cached results aren’t breaking tests. Additionally, it’s pretty easy in this case to provide some basic protection against inter-test modifications–you just reload the data file.

We clearly have this issue already today, since many tests just loop and reuse state. To play devil’s advocate, though, these tests tend to be entirely self-contained, so it’s easier to catch modifications to that state in code review. I think before we add infrastructure that could hide that, we should make sure we have a way to disable caching and a periodic CI run that runs without it.

That said, I entirely agree with this idea and think we should just make these things happen to unblock this effort.

Lunderberg · May 20, 2021, 11:27pm

Agreed, we don’t want the caching to hide any issues that may be present. I’ve made a couple more updates to the PR, both to implement more flexible caching, and to make it be easier to write tests using the fixtures, with or without setup code.

tvm.testing.parameter to define parameters to be available for tests. Functions that accept more than one such parameter are run using all combinations of the parameters.
tvm.testing.parameters to define multiple parameters where not all combinations should be tested. (e.g. a pair of width/height parameters)
tvm.testing.fixture to define setup code, which may depend on one or more parameters. By default, the setup code is run once for each test function. A cache=True argument can be passed, which caches the results. The cache is cleared once all tests that use it are completed. The cache can be disabled entirely by setting the TVM_TEST_DISABLE_CACHE environment variable to a non-negative integer.

An example test file would then look something like the following.

size = tvm.testing.parameter(1, 2, 3)


def test_case5_size(target, size):
    print(f"Testing on target={target} with size={size}")


shape, dtype = tvm.testing.parameters((10, "float32"), (20, "float16"))


def test_case5(shape, dtype):
    print(f"Testing for shape={shape}, {dtype}")


@tvm.testing.fixture(cache=True)
def case5_pretest_setup_obj(shape):
    print(f"Generating case5_setup for shape={shape}")
    return 5


def test_case5_setup(target, shape, dtype, case5_pretest_setup_obj):
    print(f"Testing on target={target} for shape={shape}, {dtype}, using premade setup")
    print(f"Have access to previously performed setup: {case5_pretest_setup_obj})

The PR has these features implemented, so as soon as we’re set on how we want it to look, we can merge it in and have the new features available.

areusch · May 26, 2021, 2:46am

I think the functionality all makes sense to me. I have some questions on the implementation:

Why do you assign global variables when declaring a param, then shadow them in those functions that use them? This seems like a bad design because unparameterized functions don’t get an unbound variable error if they try to use a param.
```
size = tvm.testing.parameter(1, 2, 3)
```
It would be great to qualify cache=True, for example cache_return_value=True or something.
What’s the procedure for reproducing an error in a test case with several parameter axes which does not occur when TVM_TEST_DISABLE_CACHE is provided and the test cases are filtered such that only 1 parameterization runs?
Does it make sense to impose any requirements on the cached values e.g. cached fixtures must return serializable values, such that reproducing the above scenario is likelier to be easier?
Can you give some examples from existing tests of the kinds of things you expect to be cached? I know we cache some testdata, which is clearly serializable. Are there other examples which could help inform a decision about 4? I think either way, it would be great to motivate the type of cache with some examples of things that we expect to cache.

Lunderberg · May 26, 2021, 3:51pm

Thank you, and those are useful feedback, and useful items to capture in eventual documentation. Responses to each question are below.

areusch:

Why do you assign global variables when declaring a param, then shadow them in those functions that use them? This seems like a bad design because unparameterized functions don’t get an unbound variable error if they try to use a param.
size = tvm.testing.parameter(1, 2, 3)

Unfortunately, this is due to the way that pytest identifies what fixtures are available in a given module. (Both tvm.testing.parameter and tvm.testing.fixture are implemented on top of pytest.fixture.) In fixture.py, Pytest loops over all members of that module, looking for members that are of type FixtureFunctionModule. Calling the tvm.testing.parameter function is equivalent to the following pytest code.

@pytest.fixture(params=[1,2,3])
def size(request):
    return request.param

I agree that I’d like to avoid variable shadowing where possible, but I wasn’t able to find a way to declare a fixture without having an associated global variable. As it is, I think the type difference between a FixtureFunctionModule and an integer or tuple would expose errors when first implementing a test.

The other option would be to declare a parameter with the autouse=True flag. This exposes a parameter to all tests in a module, even if they don’t explicitly accept it as a parameter. While this would prevent accidentally accessing the FixtureFunctionModule instead of the parameter itself, I wouldn’t recommend it for our use, because it can accidentally run tests multiple times. Even if a test doesn’t actually access the autouse parameter, it still gets run once for every parameter defined.

Good idea, and changed. I like that wording better, as it clearly states which parameter is being cached.

In that case, the test failure would be due to one test modifying the state of the cached fixture, then another test assumes that an unmodified fixture is being passed in. I would view that as a bug either in the fixture definition (if the fixture is intended to be modified on use, then it shouldn’t be declared as cached), or in the test that makes a modification.

The steps to identify a failing case would be as follows. Steps 1/2 are part of your problem statement, but are included for completeness.

Test with TVM_TEST_DISABLE_CACHE=1 . If the error stops, then the issue is due to some cache-related cross-talk.
Reduce the number of parameters being used for a single unit test, overriding the global parameter definition by marking it with pytest.mark.parametrize. If the error stops, then the issue is due to cross-talk between different parametrizations of a single test.
Run a single test function using python3 -mpytest path/to/my/test_file.py::test_my_test_case. If the error stops, then the issue is due to cross-talk between the failing unit test and some other unit test in the same file.
If it is due to cross-talk between multiple unit tests, run the failing unit test alongside each other unit test in the same file that makes use of the cached fixture. This is the same command-line as above, but passing multiple test cases as arguments. If the error stops when run with a particular unit test, then that test is the one that is modifying the cached fixture.
Run a single test function on its own, with a single parametrization, using python3 -mpytest path/to/my/test_file.py::test_my_test_case[parameter_value]. If the error still occurs, and is still disabled by TVM_TEST_DISABLE_CACHE=1, then the error is in tvm.testing._fixture_cache.

In principle, step #5 shouldn’t be possible, as only one value generated, and only one use of that value. But in practice, there’s a difference between principle and practice.

I think that restriction would remove some important use cases. For example, suppose a fixture initializes a local rpc server, and returns a connection to it. That connection is not serializable, but would be very useful to cache to avoid needing to repeat the port binding.

Certainly. From a quick search of different unit tests, here are some examples at first glance.

test_list_foldl, caching mod
test_sketch_search_policy_zero_rank, caching measure_ctx
test_measure_local_builder_runner, parametrize over enable_cpu_cache_flush, caching task
Various unit tests, caching randomly generated data from np.random

areusch · June 4, 2021, 8:49pm

hi @Lunderberg,

thanks for following-up here. a few more comments:

In fixture.py, Pytest loops over all members of that module, looking for members that are of type FixtureFunctionModule .

I do see there is a name parameter to pytest.fixture. Are you sure it’s not possible to register the fixture by name against a closure function, and append them all to a global list in the module to keep them in scope? I am not a pytest expert here, but requiring people to intentionally shadow a global variable name seems like a pretty weird design to me.

I agree autouse=True is not good for us.

I think that restriction would remove some important use cases. For example, suppose a fixture initializes a local rpc server, and returns a connection to it. That connection is not serializable, but would be very useful to cache to avoid needing to repeat the port binding.

I think this is what I’m questioning–should we really build infrastructure that allows us to cache the RPC server across tests? I am not so sure. Reproducing failures in that case seems quite hard. I think I agree with caching data. I don’t know that I agree with caching non-serializable runtime objects.

Lunderberg · June 4, 2021, 10:32pm

Hmm. The name parameter for pytest.fixture changes the name by which functions refer to the fixture, which by default is the name of the variable containing it. The fixture variables need to be directly in the module scope, and can’t be held by a global list. We could have tvm.testing.parameter and tvm.testing.fixture use inspect.stack() to walk up the stack to find the calling scope, and insert the fixture at that scope with some name like __tvm_fixture_$INDEX. For tvm.testing.parameter, that would fully solve the issue. For tvm.testing.fixture, we’d also need to find and remove the function that is being decorated from the module scope. That would depend a bit on the order in which decorators are applied, whether the variable name gets reassigned after the decorator returns.

Another option might be to walk through all modules in pytest_collection_modifyitems and remove the fixtures from module scope at that point. This happens after the fixtures are collected, but before the tests are run, so that should let us remove the parameter from the module scope so that a NameError gets triggered when the tests are run.

I definitely agree that it’s a bit of a weird design, but it does look like the recommended method with pytest to have the shadowing present. I’ll see if I can avoid the accidental shadowing with the minimal amount of magic.

I think it makes sense to do, primarily because those are the cases that have the largest amount of setup/teardown. Network and database connections tend to be the most frequent examples of test fixtures to cache for that reason, and show up pretty frequently when searching for @pytest.fixture(scope="session"). (Appeals to the majority can’t possibly be a fallacy, right?)

That said, I do think it makes sense to have additional information to verify whether or not the caching is impacting a test failure. What would you think about re-running tests without caching if they fail with caching? This would be with the --last-failed option of pytest, along with the TVM_TEST_DISABLE_CACHE=1 variable. Since it would only re-run the failed tests, it wouldn’t add significantly to the runtime, but would immediately tell us if the caching should be a suspect. I don’t know much about Jenkins setup, but is it possible to have an extra test step be run only if a previous step failed?

areusch · June 7, 2021, 7:15pm

I think it makes sense to do, primarily because those are the cases that have the largest amount of setup/teardown. Network and database connections tend to be the most frequent examples of test fixtures to cache for that reason, and show up pretty frequently when searching for @pytest.fixture(scope="session") . (Appeals to the majority can’t possibly be a fallacy, right?)

I think the cases I see here are tvmc downloading test models. those are firmly in the “serializable” category so I’m okay with caching those. I don’t know if I buy that we should ever cache an e.g. TVM RPC server connection, and we don’t have much else in the way of e.g. traditional database connections (which at least are so fairly well tested that only the database content is a concern) to cache here.

Lunderberg · June 7, 2021, 7:41pm

Testing this morning, this method worked perfectly, and has the desired effect of turning the function access into a NameError. Thank you for pushing on it, and this gives much better readability for the error messages.

I also played around with adding a __getattr__ function to give even more explicit error messages in that case, but for better or for worse, it looks like a module’s __getattr__ function participates in attribute lookup from outside, but not name lookups from inside the module, so the NameError is probably the best error message we can give.

import tvm.testing

size = tvm.testing.parameter(1, 10, 100)

def test_runs_without_error(size):
    print(f"Size = {size}")

def test_throws_name_error():
    print(f"Size = {size}")

# Output
# FAILED test_asdf.py::test_name_error - NameError: name 'size' is not defined

Good point, that there aren’t that many non-serializable use cases, and so it may be a moot point overall. Taking a quick grep, almost all the current tests that use RPC either use auto_scheduler.LocalRPCMeasureContext() or explicitly connect to 127.0.0.1. The few exceptions get the server from environment variables TVM_RPC_ARM_HOST, TVM_POWERPC_TEST_HOST, TVM_IOS_RPC_PROXY_HOST, and apply to very few tests.

Would you be up for tabling this discussion for now, to be resumed if/when we find more example use cases, and only using the cache capability for serializable data test until then?

hogepodge · June 7, 2021, 7:59pm

Hi @Lunderberg, can you write this up as a pull request for the TVM RFSs repository?

Lunderberg · June 9, 2021, 5:21pm

@hogepodge The RFC has now been migrated, and is at PR#0007. It should be up to date with all the latest changes/caveats that were covered in this discussion.

@areusch Let’s continue discussion there and close this thread, so that we don’t have multiple discussion locations to track.