[RFC] Python Dependencies in TVM CI Containers

Background

This RFC is a follow-on to [RFC] Consolidating Python Dependencies, implemented in PR 7289. It will be followed with a third RFC describing how we might leverage changes introduced this RFC to produce version-constrained tlcpack wheels and CI-tested requirements.txt for development.

After PR 7289 lands, TVM will contain a centralized list of Python dependencies: a list of packages with constraints where they are required for functional reasons (e.g. tests don’t pass without these constraints). However, this list has no bearing on the CI, so it has no teeth.

This RFC proposes changes to the way we install Python packages into TVM CI containers to address the following shortcomings:

S1. Scattered CI Python Dependencies

At the time of writing, there aren’t really any rules per se about how Python packages are installed into CI containers. Here are all the ways:

  • Most dependencies are installed with pip install commands placed in scripts that live in docker/install
  • The set of scripts run for a particular image are listed in docker/Dockerfile.ci_<image>
  • Some late-breaking dependencies are installed in tests/scripts/task_ci_python_setup.sh

Constraints for these packages are just listed in the pip install command-line and don’t necessarily match the functional requirements listed in gen_requirements.py. For the most part, these are unconstrained, so the actual installed version of a package changes as the containers are rebuilt.

This is considered a good thing (it is part of our strategy for ensuring TVM testing remains up-to-date with respect to its dependences), but a drawback is that TVM developers have to manually query the installed version using pip freeze. Since there are many such containers updated independently, assembling a known-good list of TVM’s Python dependencies (as certified by CI) is not straightforward.

S2. Dependency Conflicts and pip 20.3

Not all requirements.txt describe a set of packages that can be installed. For example, if we have:

foo
bar>3 

but foo depends on bar <3, the packages in this file should not be installable together.

However, until recently (Oct 2020), pip (the default Python package manager) considered this file line-by-line, not holistically. pip would, in this case:

  1. Read the dependencies for latest foo
  2. Realize bar<3 was required, and install the latest bar under 3
  3. Read the next line, uninstall the bar<3 and install latest bar version 3 or higher

pip only cared whether dependencies agreed when foo depended on several packages, each of which declared dependencies which may conflict.

Changes to pip install for the CI

Given these changes, we and all of our users are going to have to begin caring that we test against a compatible set of dependencies, so that it will be possible in the future to pip install TVM. This in particular means that the current approach to the CI (installing packages one-by-one in a set of decentralized scripts) is going to get more and more expensive as the dependency resolver encounters situations where each pip install progressively encounters trickier and trickier dependency groups.

The first step was centralizing the Python dependencies, accomplished in PR 7289. Next, we have a choice:

I1. Consolidate all pip install commands in docker/install scripts into a single pip install script which consumes only python/requirements/*.txt.

I2. Feed the python/requirements/*.txt into a tool such as poetry; use poetry lock to produce a poetry.lock file containing compatible python dependencies. Change all pip install commands to poetry install commands, which essentially forces installation of only those package versions constraints.txt file.

I3. Like I2 but translate poetry.lock into constraints.txt and continue to use pip in install scripts.

Independently or Jointly Building CI containers

At the time of writing, CI containers are updated one-by-one and by hand (a process which typically requires some manual bugfixing). This also means that, in implementing any I{1,2,3} above, the dependency resolver will find a compatible solution for the packages relevant to a particular CI container. That solution may be considerably different as compared with one inclusive of all TVM dependencies.

For instance, if ci-arm does not depend on onnx, but ci-gpu does, the selected version of torch may differ according to constraints torch places on the onnx package. We may get to a point where some TVM extras_install conflict with one another, but until we reach that point, it seems like we should strive to ensure all parts of TVM can be used at once without package version conflicts.

Therefore, we have a choice in how we implement I{1,2,3} above:

B1. Continue building all ci- containers independently. Implied by I1, unless we install all Python dependencies on one particular container and use pip freeze from that container as the constraints file for the others.

B2. When ci- containers are rebuilt, either periodically or because changes are needed, recompute dependencies assuming all TVM dependencies are installed at once (I.e. I2 or I3). Use a single constraints file for all ci- containers (but see Held Back Dependencies below).

The author’s opinion is that to truly address the problem imposed by the new pip dependency resolver, approach B2 is needed. However, it does reduce the number of different versions of each of TVM’s dependencies used in the CI, and in turn that does sacrifice some unquantifiable test coverage.

Recording CI-tested Package Versions

At the end the CI container build process, each ci- container should have a list of installed package versions—either the constraints.txt or from pip freeze. This file should be placed either in the TVM tree (I.e. docker/constraints/ci-cpu.txt) or in another repository accessible to the public. If in the TVM tree, it should be updated when the Jenkinsfile container version is updated.

These files are useful when users need to lookup the packages that make up a known-good TVM configuration. Without these files, the user must download each container and run pip freeze themselves—incredibly burdensome considering the size of the containers and that ci-gpu (used to build docs and run many of the tests) requires nvidia-docker to launch.

Note that in the event we choose approach B2 above, far fewer constraints.txt will need to be recorded. Only variants of the virtualenv introduced by Held-Back Dependencies below would necessitate more than 1 constraints list.

Held-Back Dependencies

There are a couple of cases where we intentionally explicitly pin or implicitly hold-back dependencies in the CI:

  1. Development tools such as pylint may often be challenging to update as the reported errors can vary dramatically by pylint version. For these tools, motivation to update is also lower—having the best linting is just a nice-to-have relative to landing a TVM bugfix or feature request. Therefore, we may frequently hold back dev tools to make it easier to change or add other dev dependencies.

  2. Increasing Test Diversity. This one is trickier because we “implicitly” hold back dependencies by building each container independently and often a few months apart from the others. Since we just pick whichever package version is available at the time we build a container, this winds up running our e.g. unittests against a wider range of dependencies.

    This RFC proposes we accomplish this more purposefully and on a more limited basis, by creating a file e.g. ci-i386-constraints.txt in the TVM repo which describes holdbacks to large dependencies such as PyTorch or TensorFlow for e.g. ci-i386.

Topics for Discussion

T1. Do we agree it’s worth it to centralize Python dependencies from the CI? What concerns do we have about doing this?

T2. Which change (I1, I2, I3, or other) do you support to pip install for building ci-* containers?

T3. Do you support building all ci- containers at once (I.e. do you support B1 or B2)?

2 Likes

one thing that occurred to me after writing this is that packaging flows such as conda may not be able to satisfy the dependencies we use in the CI. We don’t currently execute Python tests on windows, but we would like to. I think we could work around this like so:

  1. concurrent with building containers, create a conda environment on a Windows machine and produce environment.yml.

  2. either

    a) build a Windows VM at this time that contains the installed conda deps from environment.yml b) check in the environment.yml file and rebuild the environment in the CI

  3. run pip freeze from within the created Python venv to capture the revisions used in the Windows CI and record them in the constraints list (e.g. docker/constraints/ci-windows.txt)

T1: I think we should centralize all dependencies for CI. With a centralized dependency set, users of TVM can have confidence that if they use the dependency set, they should not expect to hit any issues around dependencies.

T2: I think I2 is the best option here. It guarantees we have a consistent set of dependencies across all containers. I don’t think we can rely on pip to generate a reasonable set of dependencies given constraints.

T3: I think neither of these is a good option. We should decouple updating dependencies from the updating of docker images. Forcing developers who want to modify the docker images to also be responsible for making sure new dependencies work is an unnecessary burden. It also makes it hard to diagnose issues that arise from updating dependencies because they are mixed in which changes to the docker images.

2 Likes

T1: I think centralizing dependencies is pretty important

T2: Using poetry (option 2) seems like the way to go, especially since we can separate development dependencies from production dependencies. We could put “held back” dependencies like pylint into the development dependencies.

T3: I think that using a single constraints file for all the docker images makes sense, but like @tkonolige says, it does seem like a large burden to place on people updating docker images. And then if you want to change a dependency for a normal PR, you need to rebuild the CI-docker images, which again places a large burden on the people updating the docker images.

1 Like

I don’t have anything new to add, totally agree with analysis:

T1: Deps must be centralized to maintain sanity long term. T2: I think using lock files makes the most sense but if there is some reason to do more flexible thing I am okay giving some ground here. T3: I think for T3 the images must be rebuilt when dependencies change, I don’t see any way around that. If we don’t have unified dependencies across images our lives will be a nightmare.

Thanks for the discussions so far. Everyone seems to agree on T1, it is good to have a way to have a place where users and developers can look into. T2 and T3 are the ones that worth discussing.

Trying to summarizing a few rationales of the current CI pipeline.

R0: Use Binary Tag for Reproducibility and Stablization

During testing, the dependency being used goes beyond python dependencies(what a poetry lock file can capture) and will include things like blas version, nnpack code and cuda runtime.

Docker binary image is still the only reliable source of truth if we want to exactly reproduce the test runs. This is why we opted for the usage of binary tag and stage dependency images during test dependency update. So developers can pick up these tags.

R1: Manual Fixes are Unavoidable During Dependency Update

Changes to APIs like tensorflow, pytorch will likely requires upgrade of the test cases, which means manual fixes are not avoidable. This factor also creates slight favor of building docker images separately, so incremental changes/fixes can be made if needed.

R2: Simple Flow to Build the Docker Image

For different down stream users e.g. folks might prefer a different CI flow that directly build the docker image from the source. Having a simple flow docker/build.sh to do so is super useful to these points.

Discussions

The above factors might lead to the conclusion that l1 is likely the simplest approach to take. They are also the plan that is immediately actionable.

On one hand, l2/l3 indeed captures more python dependencies via a lock, they do not necessarily capture all the dependency needed(see R0). Additionally, l2/l3 might involve checking in a lock file((which cannot be reviewed as we do not immediately know the impact of the change) into the repo that was produced from a file in the repo creating a cyclic-like dependency in the upgrading process(txt file to lock file).

The original constraint file(can be approved by a human) and binary tag(can be verified by CI) generally satisfies the two needs(of readability and final pining pointing), and makes R1 generally easier. Additionally, the set of python dependencies can be generated through pip freeze in the ci-gpu image(or ci-cpu image for cpu only packages), which means we still meed our original requirements of tracking ci deps.

@tqchen Thanks for the reply! A couple of points:

  • You’re right that I2/I3 don’t capture all dependencies. There is a notable gap between a set of Python dependencies and the docker container: the remaining libraries which may be present in the container.
  • The method by which those libraries are installed is platform-dependent, but so far our approach for the CI has been mostly to depend on a package manager to satisfy them.
  • I think that even if a lockfile can’t be used primarily to reproduce the test environment, there are plenty of other uses–here are a couple:
    • Determining which version of a C library is suitable when trying to install TVM on a system other than the docker containers. You might prefer to at least know the version of the Python package which uses it. I had trouble finding a non-dev dependency like this but I think this pattern is very common, and Pillow is an example of such a package.
    • Often times a user isn’t trying to exactly reproduce the regression–they just want to know which minor version of a package to install e.g. tensorflow 2.1 vs 2.2. This can make a big difference, especially when the main problem is API incompatibilty… Even if a user might not 100% be able to reproduce the regression, it’s often the case that if they’re using TVM in an application, they won’t be able to use the TVM CI containers as a starting point.

Updating Docker Containers

I think it’s worth thinking through the case of updating the docker containers a little more because I’m not sure I follow your reasoning here.

The current process is as follows, per-container:

  1. update docker/install and Dockerfile as needed to modify the container
  2. run docker/build.sh to build the new container image
  3. Fix problems and rerun steps 1 & 2 as necessary.
  4. test the containers – discussed later

The current build process is quite straightforward–it’s just building a docker container. I2 and I3 propose to add these changes:

  1. prior to building the containers, create a lockfile
  2. rebuild all containers when the lockfile changes, rather than just updating one

Change 2 shouldn’t significantly burden developers unless the build process is error-prone. We should fix that if it is, and perhaps a nightly container build isn’t a bad Jenkins job to configure so we can address these failures as they happen.

Change 1 is the one that needs to be thought through. Ordinarily, building a lockfile is running a single command on the target platform, and often this process takes about a minute. This would not be a significant burden. The tricky part is: we have multiple platforms in the CI (three, in fact: manylinux1_i686, manylinux1_x86_64, and manylinux_arm) and it’s not possible to build the lockfile for a platform other than the one you’re running on.

In thinking this through again, I realized that actually this is the same problem as the one we have with Windows: it’s not possible to use a single lockfile for all platforms. If you are switching platforms, you are going to have to accept you may need to switch dependency versions.

So let’s return to the ultimate goals of I2/I3 approach:

  1. Ensure all features of TVM can actually be used simultaneously on a given platform.
  2. Standardize dependencies so that the unittests don’t fail for different reasons on different containers.
  3. provide enough input to allow the tlcpack build process to place narrower constraints on install_requires included there, based on the versions used in CI

In light of the above, all of these goals can be achieved only within a single pip platform. So I’d propose a new way to install dependencies while addressing this, I4:

I4. A multi-part approach:

  1. For each pip platform that has a ci- container built around it: produce a lockfile either by:

    1. pip install -r all TVM requirements and pip freeze
    2. poetry lock

    The lockfile becomes part of the artifacts created when the containers are built. When multiple containers are built for the same pip platform (e.g. ci-lint, ci-cpu, ci-gpu, ci-qemu, ci-wasm), pick one container which will build the lockfile. The rest will merely consume this lockfile when installing packages either as pip constraints file or poetry lockfile.

  2. I would argue that between platforms, the version standardization we care about is effectively TVM direct Python dependencies to the minor version level. For instance, if we add Windows tests, we prefer to test against e.g. tensorflow 1.6 on both linux and Windows. To address this, when building containers, we would first build a pilot platform e.g. manylinux1_x86_64, scan the resulting lockfile for direct dependencies of TVM, and produce a loose constraint file for the remaining platforms. For example, if the lockfile contained an entry tensorflow==1.6.8, we may add tensorflow>=1.6,<1.7 to the cross-platform lockfile.

    → This may not always be possible (conda may not update fast enough for us, etc), so we will need to have an override file per-platform. This is complex, but it’s likely that this is happening anyway in the release process now, and the reasons why different packages are being installed aren’t documented anywhere.

Checking in new containers

To check in the built containers, these things need to happen:

  1. upload candidate containers to docker hub under your username
  2. prepare a TVM PR which updates the Jenkinsfile to reference these test containers
  3. have a TVM committer push the PR to ci-docker-staging.
  4. await test results and iterate as needed.
  5. push candidate containers to tlcpack docker hub account
  6. amend the TVM PR to point at tlcpack images and submit it.

Here we’re proposing that in step 2, we also include the per-platform lockfiles produced from container building.

tlcpack Dependencies

Finally: I4 proposes we produce a cross-platform constraints file. This isn’t discussed as part of this RFC (it’s the next one), but it would be nice to produce a tlcpack wheel which includes some loose version constraints (e.g. so that by depending on tensorflow, a user running pip install tlcpack would install tensorflow 1.x and not 2.x). The version constraints in the cross-platform constraints file are exactly those we should use in such a tlcpack wheel.

Previously, the follow-on RFC had proposed to analyze the CI lockfile and produce the tlcpack dependencies. I4 just proposes to move that analysis earlier, and effectively means that the CI would be testing against the same constraints present in the tlcpack wheel. This seems like a better change to me.

1 Like

I’m open to changing build.sh to reference the previously-used lockfile, which should make it possible to update containers independent of the Python deps used.

tagging a few others who replied to the previous RFC

@mjs @leandron @comaniac

One potential idea is we can first start with l1, which is simpler, then see if there is need to move to l4. Personally I don’t think we really need a stronger consistency of packages in CI spectrum, sometimes we might want to diversify it a bit(e.g. use different compiler such as clang/gcc). We are also producing different tlcpack binary packages for different system, so we could derive the cpu package from ci-cpu and gpu package from ci-gpu.

Additionally, most of the dependencies that are relevant should have already captured by the constraint.txt e.g. tensorflow==1.6 so there is not a lot of room to pinning down further.

here’s what I don’t like about I1:

  • it doesn’t ensure dependency versions are compatible across all the various extras in each container architecture
  • it still means that ci-lint or ci-gpu could wind up with different package versions from ci-cpu

I do agree that most of the large direct dependencies are quite pinned, and we could probably accomplish a lot from a ci-constraints.txt which just explicitly pins those direct dependencies.

Yes, I agree with you that different containers could have different packages for things are not pinned(thus my comment about the need of a strong consistency among ci variants).

My last post is mainly about how big an issue that can cause, as we can pinning down most direct deps, and also binary packages for different platforms independently(from the corresponding ci container)

one thing that’s particularly annoying is when an error occurs in ci-gpu. I can’t run ci-gpu locally because I don’t have nvidia-docker or a compatible GPU, and people on e.g. Windows will be in exactly the same boat. it’s sometimes possible to quickly run the failing code locally in e.g. ci-cpu and reasonably expect it to fail in the same way. but you may or may not reproduce the error in doing that. in this case, you may wonder whether there is a discrepancy between Python deps that’s responsible, but there’s no good way to eliminate that variable.

I agree with your sentiment that strong consistency might not contribute to the overall health of the CI, but lack of consistency can make problems harder to debug when you don’t have access to the full spectrum of CI variants. I think we should be judicious when we allow the CI to vary, and i’m not sure that e.g. allowing attrs to vary between containers buys us anything other than uncertainty.

there are a couple of other problems that arise from a lack of standardized packages:

  • if you want to replicate the CI on some other linux system: which container’s pip freeze should you use? right now, you have to be a CI expert to know the answer to that. it would be hard, as a newcomer to the project, to decipher which container tests which functionality, so you probably just guess and hope you picked right.
  • if we ever start relying on dev packages from ci-gpu which are also present in ci-lint (i.e. rerunning pylint for some reason), then discrepancies in those precise versions could cause late-stage CI failures
  • it makes it difficult to rebuild the CI containers if you want to change something other than the Python packages, since you can’t avoid signing up to update them.

Okay, it seems like so far as lazy consensus goes, we have support for:

  1. Installing Python dependencies based off of the generated requirements from python/gen_requirements.py.
  2. Creating a ci-constraints.txt file based off of e.g. semver or other restrictions of those requirements to selectively pin TVM’s immediate dependencies
  3. Checking in the output of pip freeze from each container.
  4. No change to require all ci- containers to be built at once (though they will be all built one time after this change is implemented, to checkin the pip freeze output).

From this, we can use the pip freeze output and ci-constraints.txt output to drive tlcpack constraints and a developer-facing requirements.txt, which was my ultimate goal with these RFCs. Further RFCs will be posted to propose those processes.

In the future, we can consider further consolidating container dependencies into a lockfile. For now, the ci-gpu container is going to install everything, so there should be some path to achieving a configuration of our dependencies that satisfies pip install.

As I’m working on this now finally, we discussed the implementation so far in the TVM Community Meeting this morning. Here are notes:

  • @kparzysz: Should we consider conda for this? It locks down not only Python dependencies but also C++ dependencies.

    • @tqchen originally preferred to have requirements.txt so users weren’t forced to use conda. This is the reason why gen_requirements.py creates requirements.txt.
    • @mousius mentioned that he’s had teams move off conda because it’s too heavy-handed and just wraps the Python dependency problem without adding too much value
    • @hogepodge notes that conda dependency solvers can can take hours to update and that’s contributed to an overall negative experience with conda for him
    • @areusch mentioned that adding another packaging system in the loop could make it hard for dependencies to stay current
    • Ioannis says that adding conda would add yet another dependency to TVM, and that poetry is built on pip.
  • @kparzysz asks: if we’re going to consider relaxing the ci-constraints.txt in the released package, shouldn’t we consider testing against those relaxed constraints? won’t there be a combinatorial explosion there?

    • @areusch notes: we could consider using tox to test different combinations, but another question here is: we should probably unit-test the apache-tvm wheel before we release it, but right now we run the tests in-tree so we might not have any expectation they’d pass against the wheel. the Python ecosystem does kind of have an expectation that things float around, and we should be judicious about what we place in install_requires, since that could constrain downstream users.
    • @driazati notes: we aren’t going to necessarily be able to test all combinations, but we can look at reported bugs and constrain as necessary. one concern about departing from a normal way to specify dependencies in setup.py is that if someone does want to fix a weird versioning problem by constraining it in the release, they’ll have to learn our bespoke process and that’s extra work on this proposal.
  • @driazati notes: we could also produce purpose-built CI images–i.e. one for GPU unit tests, one for GPU integration tests, etc. this would help shrink them from e.g. the current 20GB size, which is huge by docker image standards.

    • @areusch notes: would be good to see how many images we might want and see how much disk savings there are, and whether we could build e.g. integration tests on top of unit tests.

@kparzysz: re: the incompatibilities between tensorflow and paddlepaddle, could we avoid installing the two at once to avoid over-burdening users?

  • @areusch: we do declare extras right now, but you can’t install two incompatible extras at once. this does mean that if we did get to a scenario where we couldn’t work around this with pep527 environment markers, we’d need to move to an architecture where e.g. an import was done by invoking a subprocess and therefore they could live in a separate virtualenv.

It seems like poetry lock is actually platform-agnostic; I see no differences between the 3 different poetry.lock files generated by running poetry on aarch64, i386, and x64_64. That’s great, because it means we can simplify the approach here and just add a single image without needing to make every ci_ image depend on a common base. I’ll update the PR to take that approach.