[RFC][TESTING] Split testing based on cpu/gpu

tkonolige · August 25, 2020, 4:12pm

Motivation

Our current test suite takes a while to run. A main reason is that tests that only require a cpu are also being run on testing nodes that have gpus. With multiple PRs, tests running on gpus are often a limiting factor. Because demand is high, PRs have to wait until a gpu node is freed up before testing can begin.

Proposal

I propose we explicitly mark tests that require a gpu and run only marked tests on the gpu. Pytest provides a mechanism to do this: markers. Markers allow tests to be decorated with @gpu (for example) and then pytest can select only tests with this marker using pytest -m gpu. Markers can be combined with pytest.mark.skipif, to make sure that tests are only run when a required gpu is present. I propose we use the following markers:

tvm.testing.uses_gpu for tests that use both the gpu and cpu (see below).
tvm.testing.requires_gpu for tests that require the gpu.
tvm.testing.requires_cuda for tests that require the cuda.
tvm.testing.requires_... for tests that require rocm, opencl, etc.

Many tests use a variety of different devices, like llvm, cuda, and rocm. There are three main ways that tests use devices: 1. tests iterate through tvm.relay.testing.config.ctx_list 2. tests iterate through tests/python/topi/python/common.py:get_all_backend and 3. tests iterate through a hand picked list of targets and check if the device is enabled with tvm.context(device).exist and tvm.runtime.enabled(device). These methods do not allow us to separate out the gpu parts from the cpu parts. To do this separation, I propose we merge 1. and 2. into a function called tvm.testing.enabled_devices and replace 3. with a function tvm.testing.device_enabled. These two functions would use an environment variable to determine which devices are enabled (a subset of the ones supported by the current build of TVM).

Cons

Devices we test against are controlled by an environment variable. Environment variables can be hard to discover, so we should document this one well.
Tests that use tvm.testing.device_enabled or tvm.testing.enabled_devices must also mark their testing function with tvm.testing.uses_gpu. If they don’t then the test will never be run with gpu devices. A fix would be having a special decorator that parameterizes the test over the devices and sets markers appopriately (using [pytest.mark.parameterize](pytest parameterize)). Unfortunately, this would require rewriting a large amount of tests.