Modularize and Modernize TensorIR Tests

As we start to build multiple modules, it is useful to start modularizing the unit-tests with a goal of reducing some of the actual integration tests. Previously quite a few tests are written in a way that directly invokes end to end compilation, we also have tests that are coupled with legacy te pipeline. There are several issues, some of the tests running slow, and when regression happens it is harder to find out why since the tests are not unit tests in nature.

While sometimes some related tests are necessary and we would like to keep some legacy tests for a bit. It is important to move to a more unit-testing regime for new tests, and explicitly mark(group) tests that involves end to end execution (and slower). Having tests in different folder also helps us think more carefully about module boundaries. Of course we still want to be pragmatic and not too pedantic. For example, we still love the python first infra that helps us productively write tests, and some level of coupling is useful for us to productively write tests.

To keep things simple, I would like us to try get things moving starting with one module (TensorIR). Here is how we can incrementally do that for TensorIR (and use this as an example).

  • Start with a new folder tests/python/tir
  • Put new TensorIR unit-tests into this folder
  • Migrate some test cases from existing ones into this folder, with the following goals
    • Always use TVMScript/IRBuilder before/after to unit test each pass
    • Avoid calling the build pipeline
  • For those that involves build e2e pipeline
    • Move to an explicit folder tests/python/integration/tir
    • OK to include some for target specific generation, in this case, start from (scheduled) TVMScript
    • Ensure such generation are fast (<1min)
    • Have an explicit naming pattern test_e2e_xxx
    • Move slow tests into a separate folder tests/python/slow/tir

Everything should work as it is. Of course we can still leave some legacy tests in the old place and once we done modularizing, we also would have a clear picture of things

4 Likes

It is worth pointing out that:

  • Most of the existing tests are CPU-bound, including those who use GPUs for execution (end-to-end tests), which also rely heavily on CPU for code generation
  • All e2e tests can be decoupled as host-side compilation on CPU + execution on device (e.g. GPUs)
  • Brute-force splitting between fast and slow tests is less efficient because even slow tests could be CPU bound and not fully utilizing most of the GPU resources

Therefore, my proposal is: based on TVM RPC infra, instead of separating fast/slow tests, we should split host-side logic and device execution. Details:

  • Run all tests on CPU with single or limited number of threads
  • Provide an API via TVM RPC that allows execution of compiled code on an isolated GPU/Hexagon/ARM instance

The advantage of my proposal:

  • Concurrency: a CPU instance could run multiple CI pipelines in parallel;
  • Device utilization: the RPC infra makes sure only minimal logic is executed on device. It routes and manages execution efficiently and thus greatly improves device utilization and lowers the cost.
3 Likes

Thank you, I think these are orthogonal approaches. The first thing is mainly to isolate real unit test cases, into TVMScript and before/after focused, then the ones that runs integration can have different ways of improvements

1 Like

BTW, it might be helpful so just wanted to share a running log using pytest --duration on my local CPU+GPU workstation: pytest_running_log.txt · GitHub

Some takeaways:

  • There are top 162 test cases that use more than 1sec, while the rest of 4.5k tests are rather fast;
  • Most of the slow tests are either from legacy modules (e.g. autotvm, auto_scheduler, TE schedule) or from end-to-end tests (e.g. running a runtime.Module)
  • There are 60 failed testcases that are not included on our CI, some of which are because the CI instances are not equipped with adequate hardware, e.g. tensor core GPUs >= SM_80, some can pass if tested alone, and I’m not sure about the rest

Some extra data points regarding TIR tests (pytest_running_log.txt · GitHub):

3 failed, 1539 passed, 7 skipped, 17 xfailed, 2 xpassed, 84 warnings in 103.38s (0:01:43)

Below are the only ones that are slower than 1s

============================================================================================================= slowest durations ==============================================================================================================
5.02s call     tests/python/unittest/test_tir_transform_remove_no_op.py::TestRemoveSeparatedOverwriteOfPredicatedLoop::test_compare
3.72s call     tests/python/unittest/test_tir_ir_builder.py::test_while_mandel
2.77s call     tests/python/unittest/test_tir_schedule_tensorize_ldmatrix_mma.py::test_f16f16f16_m16n16k16
2.57s call     tests/python/unittest/test_tir_schedule_tensorize_ldmatrix_mma.py::test_f16f16f32_m16n16k16
2.47s call     tests/python/unittest/test_tir_ptx_ldmatrix.py::test_ptx_ldmatrix
2.25s call     tests/python/unittest/test_tir_analysis_verify_gpu_code.py::test_shared_memory
2.14s call     tests/python/unittest/test_tir_transform_lower_warp_memory.py::test_lower_warp_memory_roundup
2.01s call     tests/python/unittest/test_tir_transform_reduce_branching_through_overcompute.py::TestIdentifyOverwrittenWriteFromEquivalentExpressions::test_compare
1.89s call     tests/python/unittest/test_tir_usmp_algo_hill_climb.py::test_intervals[intervals2]
1.82s call     tests/python/unittest/test_tir_transform_remove_no_op.py::TestRemoveOverwrittenPredicatedLoopWithProvableCondition::test_compare
1.82s call     tests/python/unittest/test_tir_transform_remove_no_op.py::TestRemoveOverwrittenPredicatedLoopWithIdenticalCondition::test_compare
1.77s call     tests/python/unittest/test_tir_transform_reduce_branching_through_overcompute.py::TestIntroduceOverwrittenWrite::test_compare
1.76s call     tests/python/unittest/test_tir_transform_reduce_branching_through_overcompute.py::TestIntroduceSupersetOverwrittenWrite::test_compare
1.71s call     tests/python/unittest/test_tir_transform_inject_ptx_async_copy.py::test_inject_async_copy
1.45s call     tests/python/unittest/test_tir_analysis_verify_gpu_code.py::test_vthread
1.39s call     tests/python/unittest/test_tir_analysis_verify_gpu_code.py::test_num_thread

One thing that’s not completely relevant to this discussion, but I wanted to mention is that we could perhaps disable CI on draft PRs as by definition they’re not complete and there’s high probability that more commits would be pushed triggering the CI again.

This could also free up some resources for the other PRs (probably just a very small impact but might still make a difference occationally).

2 Likes

Seems that there are some good amount of interests. I think we can start the categorizations which will help a lot of our followup work

1 Like