[rfc][ci] Skip slow tests on PRs

driazati · January 12, 2022, 9:45pm

A small subset of tests take up a large portion of the total test runtime. This RFC proposes that we skip these tests on PRs where CI runtime is critical and run them only on main.

CI runtime constantly plagues TVM developers with long iteration times, exacerbated by flakiness and difficult-to-reproduce steps. To reduce runtime, we can execute more work concurrently and increase usage of available resources, usually by parallelization within (#9834) or between (#9733) CI jobs. Another way is to do less work, which is what this RFC proposes. By running some tests on main only, we still get some measure of coverage provided by these tests without burdening PR developers.

The caveat is that tests that run on main may now fail due to PRs that were green when they were merged, so this will require some buy-in from all TVM developers. However, the runtime savings are significant (see below) enough to make this worth it. Developments like #9554 will make the revert process much smoother as well to minimize disruptions.

Runtime Savings

A preliminary investigation of 1 PR with the --durations=100 flag added to pytest shows the following results (code). These totals are distributed across all steps but demonstrate what we could achieve.

[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests

This data was gathered by downloading the text log for each test step in the above CI job and running the code in the gist. The file names roughly correspond to the matching CI stage / step.

[cutoff=10.0s] @slow in frontend_gpu.ai saves 113.12m by moving 116 tests to main-only
[cutoff=10.0s] @slow in 386_python.ai saves 17.24m by moving 46 tests to main-only
[cutoff=10.0s] @slow in unittest_cpu.ai saves 29.65m by moving 67 tests to main-only
[cutoff=10.0s] @slow in python_arm.ai saves 12.84m by moving 44 tests to main-only
[cutoff=10.0s] @slow in topi_gpu.ai saves 67.19m by moving 65 tests to main-only
[cutoff=10.0s] @slow in unittest_gpu.ai saves 35.64m by moving 65 tests to main-only
[cutoff=10.0s] @slow in integration_cpu.ai saves 67.0m by moving 176 tests to main-only
[cutoff=10.0s] @slow in frontend_cpu.ai saves 44.0m by moving 58 tests to main-only
[cutoff=10.0s] @slow in 386_integration.ai saves 33.27m by moving 58 tests to main-only
[cutoff=10.0s] Total savings (m): 419.95m by skipping 695 tests
[cutoff=20.0s] @slow in frontend_gpu.ai saves 102.01m by moving 69 tests to main-only
[cutoff=20.0s] @slow in 386_python.ai saves 11.65m by moving 21 tests to main-only
[cutoff=20.0s] @slow in unittest_cpu.ai saves 20.93m by moving 27 tests to main-only
[cutoff=20.0s] @slow in python_arm.ai saves 5.1m by moving 9 tests to main-only
[cutoff=20.0s] @slow in topi_gpu.ai saves 61.95m by moving 44 tests to main-only
[cutoff=20.0s] @slow in unittest_gpu.ai saves 28.35m by moving 30 tests to main-only
[cutoff=20.0s] @slow in integration_cpu.ai saves 42.62m by moving 56 tests to main-only
[cutoff=20.0s] @slow in frontend_cpu.ai saves 39.08m by moving 37 tests to main-only
[cutoff=20.0s] @slow in 386_integration.ai saves 26.57m by moving 27 tests to main-only
[cutoff=20.0s] Total savings (m): 338.27m by skipping 320 tests
[cutoff=30.0s] @slow in frontend_gpu.ai saves 94.17m by moving 49 tests to main-only
[cutoff=30.0s] @slow in 386_python.ai saves 7.23m by moving 11 tests to main-only
[cutoff=30.0s] @slow in unittest_cpu.ai saves 16.68m by moving 17 tests to main-only
[cutoff=30.0s] @slow in python_arm.ai saves 3.01m by moving 4 tests to main-only
[cutoff=30.0s] @slow in topi_gpu.ai saves 55.53m by moving 29 tests to main-only
[cutoff=30.0s] @slow in unittest_gpu.ai saves 25.59m by moving 23 tests to main-only
[cutoff=30.0s] @slow in integration_cpu.ai saves 33.5m by moving 33 tests to main-only
[cutoff=30.0s] @slow in frontend_cpu.ai saves 32.04m by moving 19 tests to main-only
[cutoff=30.0s] @slow in 386_integration.ai saves 23.76m by moving 20 tests to main-only
[cutoff=30.0s] Total savings (m): 291.5m by skipping 205 tests
[cutoff=40.0s] @slow in frontend_gpu.ai saves 84.89m by moving 32 tests to main-only
[cutoff=40.0s] @slow in 386_python.ai saves 3.2m by moving 4 tests to main-only
[cutoff=40.0s] @slow in unittest_cpu.ai saves 13.21m by moving 11 tests to main-only
[cutoff=40.0s] @slow in python_arm.ai saves 3.01m by moving 4 tests to main-only
[cutoff=40.0s] @slow in topi_gpu.ai saves 51.79m by moving 22 tests to main-only
[cutoff=40.0s] @slow in unittest_gpu.ai saves 18.98m by moving 12 tests to main-only
[cutoff=40.0s] @slow in integration_cpu.ai saves 25.32m by moving 19 tests to main-only
[cutoff=40.0s] @slow in frontend_cpu.ai saves 29.14m by moving 14 tests to main-only
[cutoff=40.0s] @slow in 386_integration.ai saves 22.05m by moving 17 tests to main-only
[cutoff=40.0s] Total savings (m): 251.59m by skipping 135 tests
[cutoff=50.0s] @slow in frontend_gpu.ai saves 80.34m by moving 26 tests to main-only
[cutoff=50.0s] @slow in 386_python.ai saves 0.87m by moving 1 tests to main-only
[cutoff=50.0s] @slow in unittest_cpu.ai saves 11.1m by moving 8 tests to main-only
[cutoff=50.0s] @slow in python_arm.ai saves 0.0m by moving 0 tests to main-only
[cutoff=50.0s] @slow in topi_gpu.ai saves 48.9m by moving 18 tests to main-only
[cutoff=50.0s] @slow in unittest_gpu.ai saves 17.49m by moving 10 tests to main-only
[cutoff=50.0s] @slow in integration_cpu.ai saves 22.32m by moving 15 tests to main-only
[cutoff=50.0s] @slow in frontend_cpu.ai saves 24.69m by moving 8 tests to main-only
[cutoff=50.0s] @slow in 386_integration.ai saves 16.86m by moving 10 tests to main-only
[cutoff=50.0s] Total savings (m): 222.56m by skipping 96 tests
[cutoff=60.0s] @slow in frontend_gpu.ai saves 74.08m by moving 19 tests to main-only
[cutoff=60.0s] @slow in 386_python.ai saves 0.0m by moving 0 tests to main-only
[cutoff=60.0s] @slow in unittest_cpu.ai saves 9.33m by moving 6 tests to main-only
[cutoff=60.0s] @slow in python_arm.ai saves 0.0m by moving 0 tests to main-only
[cutoff=60.0s] @slow in topi_gpu.ai saves 47.9m by moving 17 tests to main-only
[cutoff=60.0s] @slow in unittest_gpu.ai saves 16.52m by moving 9 tests to main-only
[cutoff=60.0s] @slow in integration_cpu.ai saves 17.71m by moving 10 tests to main-only
[cutoff=60.0s] @slow in frontend_cpu.ai saves 24.69m by moving 8 tests to main-only
[cutoff=60.0s] @slow in 386_integration.ai saves 13.09m by moving 6 tests to main-only
[cutoff=60.0s] Total savings (m): 203.3m by skipping 75 tests
[cutoff=70.0s] @slow in frontend_gpu.ai saves 69.8m by moving 15 tests to main-only
[cutoff=70.0s] @slow in 386_python.ai saves 0.0m by moving 0 tests to main-only
[cutoff=70.0s] @slow in unittest_cpu.ai saves 8.25m by moving 5 tests to main-only
[cutoff=70.0s] @slow in python_arm.ai saves 0.0m by moving 0 tests to main-only
[cutoff=70.0s] @slow in topi_gpu.ai saves 46.86m by moving 16 tests to main-only
[cutoff=70.0s] @slow in unittest_gpu.ai saves 15.45m by moving 8 tests to main-only
[cutoff=70.0s] @slow in integration_cpu.ai saves 16.69m by moving 9 tests to main-only
[cutoff=70.0s] @slow in frontend_cpu.ai saves 23.66m by moving 7 tests to main-only
[cutoff=70.0s] @slow in 386_integration.ai saves 11.97m by moving 5 tests to main-only
[cutoff=70.0s] Total savings (m): 192.68m by skipping 65 tests
[cutoff=80.0s] @slow in frontend_gpu.ai saves 68.62m by moving 14 tests to main-only
[cutoff=80.0s] @slow in 386_python.ai saves 0.0m by moving 0 tests to main-only
[cutoff=80.0s] @slow in unittest_cpu.ai saves 5.75m by moving 3 tests to main-only
[cutoff=80.0s] @slow in python_arm.ai saves 0.0m by moving 0 tests to main-only
[cutoff=80.0s] @slow in topi_gpu.ai saves 41.78m by moving 12 tests to main-only
[cutoff=80.0s] @slow in unittest_gpu.ai saves 14.27m by moving 7 tests to main-only
[cutoff=80.0s] @slow in integration_cpu.ai saves 16.69m by moving 9 tests to main-only
[cutoff=80.0s] @slow in frontend_cpu.ai saves 22.47m by moving 6 tests to main-only
[cutoff=80.0s] @slow in 386_integration.ai saves 11.97m by moving 5 tests to main-only
[cutoff=80.0s] Total savings (m): 181.56m by skipping 56 tests
[cutoff=90.0s] @slow in frontend_gpu.ai saves 67.13m by moving 13 tests to main-only
[cutoff=90.0s] @slow in 386_python.ai saves 0.0m by moving 0 tests to main-only
[cutoff=90.0s] @slow in unittest_cpu.ai saves 5.75m by moving 3 tests to main-only
[cutoff=90.0s] @slow in python_arm.ai saves 0.0m by moving 0 tests to main-only
[cutoff=90.0s] @slow in topi_gpu.ai saves 41.78m by moving 12 tests to main-only
[cutoff=90.0s] @slow in unittest_gpu.ai saves 12.78m by moving 6 tests to main-only
[cutoff=90.0s] @slow in integration_cpu.ai saves 9.57m by moving 4 tests to main-only
[cutoff=90.0s] @slow in frontend_cpu.ai saves 22.47m by moving 6 tests to main-only
[cutoff=90.0s] @slow in 386_integration.ai saves 11.97m by moving 5 tests to main-only
[cutoff=90.0s] Total savings (m): 171.45m by skipping 49 tests
[cutoff=100.0s] @slow in frontend_gpu.ai saves 65.63m by moving 12 tests to main-only
[cutoff=100.0s] @slow in 386_python.ai saves 0.0m by moving 0 tests to main-only
[cutoff=100.0s] @slow in unittest_cpu.ai saves 2.59m by moving 1 tests to main-only
[cutoff=100.0s] @slow in python_arm.ai saves 0.0m by moving 0 tests to main-only
[cutoff=100.0s] @slow in topi_gpu.ai saves 38.51m by moving 10 tests to main-only
[cutoff=100.0s] @slow in unittest_gpu.ai saves 9.62m by moving 4 tests to main-only
[cutoff=100.0s] @slow in integration_cpu.ai saves 9.57m by moving 4 tests to main-only
[cutoff=100.0s] @slow in frontend_cpu.ai saves 22.47m by moving 6 tests to main-only
[cutoff=100.0s] @slow in 386_integration.ai saves 11.97m by moving 5 tests to main-only
[cutoff=100.0s] Total savings (m): 160.36m by skipping 42 tests

Implementation

#9915 implements the preliminary machinery for this change, after that tests can be decorated with @tvm.testing.slow and CI will decide whether to run slow tests or not based on the following checks:

If the branch matches main or ci-docker-staging
If text like @ci run slow tests appears in the last commit message or the PR body (this is necessary if a PR specifically edits or concerns a slow test). Note that changes to the PR body will not be reflected until the next time CI runs (implementing a listener / canceller would be much harder than querying at the beginning of each CI run)

Using a decorator has the advantage of being explicit compared to an automated system to detect and skip slow tests. Clicking through to the decorator’s short definition makes it clear what is going on so this shouldn’t add too much of a development burden. Slow tests run by default to minimize disruption (i.e. if a developer doesn’t see this RFC their workflow doesn’t need to change locally) but can be controlled by setting SKIP_SLOW_TESTS=1.

Finding slow tests right now is manual (see the gist above), but the number of tests is tractable at least for a start. In the future we plan to fix and monitor test reports which should include test time to ease this process. However, that shouldn’t hold up this work since we can get concrete wins today by checking the logs.

Open Questions / Alternatives

nightly-only tests are another possibility, but this makes it much harder to identify and revert offending commits since now a developer will have to go over every commit from the last day and check the tests.
the threshold cutoff for slow tests is up for debate. I prefer to start aggressive and back off if reverts are a problem, so I would say 10s to start, increasing by 10s if reverts are burdensome. Cutoffs could be specialized to test files / steps but adds complexity and makes the system harder to reason about
since PRs that are green may need to be reverted, someone will need to monitor the CI output and investigate failures. We do not have anyone committed to doing this right now but since this change treats tests as a black box and doesn’t require any domain knowledge of tests I think it’s reasonable for the CI maintainers to do this as a core responsibility (anyone else that notices failures is of course welcome to revert / fix). In general I think this will also be a good forcing function to monitor and fix other flaky / spurious failures on main

masahi · January 12, 2022, 11:20pm

I’m happy to be that someone. I agree that this is a prerequisite for introducing nightly tests. I have a good knowledge of the stack to diagnose failures (at least one needs to differentiate real and flaky failures), and being in a different time zone, I can debug “nightly” test issues in my day time.

driazati · January 13, 2022, 11:45pm

See https://gist.github.com/driazati/e009f09ff44c6bc91c4d95a8e17fd6f1 for a detailed breakdown by test per threshold that would be skipped with this change. The diffs in tests/ in this PR show what it would look like. If there is a test that is over the threshold but needs to be run per-PR for some reason or another, it can be enabled by deleting the decorator.

areusch · January 14, 2022, 5:17pm

@driazati I like the idea of having some machinery to identify slow-running tests and skip them on PRs. I think it would be great to think through the rollout plan a bit more here:

many of these long-running tests unfortunately do provide us test coverage. it’s not clear to me how much test coverage we’d lose if we just disable them.
i’d propose we checkin this framework to enable us to start moving the tests onto main, but i think we should take an incremental approach here rather than just choosing a cutoff. i think many of these tests are essentially integration tests that could be rewritten as unit tests to provide equivalent coverage on PRs.
i’d also like to investigate whether it’s possible to provide some type of automated tracking, so that we don’t have to manually monitor the main-only test results.

driazati · January 14, 2022, 7:42pm

A slower roll out definitely makes sense, though I wouldn’t say this decreases coverage for any tests since any @slowed tests would still get run on main when the PR is merged (and associated commits get reverted if tests fail in a non-flaky way). The way I see it is we are currently trading 2 hours of CI time to get faster feedback on a small subset of tests by running them on every PR. By moving them to main, they still get run and won’t regress. The reverting of commits out of main and back into PRs makes this change only helpful if: (rate of reverts * the dev time cost of a revert) / (2 hours * number of PRs) is low (which I would guess it is).

If you buy the “we-re not decreasing coverage” argument this change with a time cutoff is an easy way to get CI to run on PRs in under 2 hours today, whereas having to write unit tests to replace longer integration tests is a much bigger effort

areusch · January 14, 2022, 8:14pm

one aspect is that this would increase burden on the TVM committer who is merging the PR, because they would then be on the hook to monitor the next main build and verify nothing was broken. i do think that this makes sense, but enabling it across the board seems like it would substantially increase this burden and increase the time it takes to bisect failures (e.g. in the case that two PRs merge in a short window and a single main build is triggered).

I do see the benefits to CI time, though. perhaps if there are a small set of “worst offenders,” we could prioritize those tests as the first candidates for @slow and still have a decent effect on CI runtime in a relatively short period of time?