[RFC] Consolidating TVM Python Dependencies

mjs · October 30, 2020, 10:44am

Great to see this tricky topic is being tackled!

As the text above states, “the defacto standard Python package manager is pip”. Despite it’s limitations, at this current point in time, that fact is unavoidable. Users expect “pip install foo” to reliably install a working instance of foo. They also expect to be able to co-install foo with other packages, which makes packages with overly restrictive version constraints un-popular, version constraints should be sufficient to deliver a viable tool install, but no more.

If seems reasonable that as a project, if necessary, we might impose the use of a specific package manager on developers but attempting to impose a “not yet mainstream” package manager on users will limit the reach of the project.

mjs · October 30, 2020, 10:49am

As a slight aside, on “package managers”, we’ve used pipenv in a number of internal projects. Despite the fact that many of us really the ui and concept of pipenv, we’ve recently made the decision to back out of using it due to the widely discussed instability it has with the time taken to resolve dependencies.

areusch · October 30, 2020, 9:23pm

@mjs thanks for your reply! I agree we should avoid overly restrictive version constraints in our released packages. my thoughts are that specifying a range for semver packages and especially complex dependencies with significant API exposure to tvm (I.e. most frontend packages–tensorflow, torch, etc) would be a good idea in the spirit of producing a working install. i’m happy to back off of that approach given evidence to the contrary.

i’ve had a similar experience with pipenv as you have in the past. i’ve liked poetry better so far.

my thoughts are that auto-generating e.g. requirements.txt, requirements-frontend-torch.txt, etc. from a pyproject.toml or equivalent config would alleviate any problems with developers disliking a not-yet-standard tool such as poetry.

tqchen · October 31, 2020, 1:47pm

I think it might be helpful to talk about some of the expected end results: e.g. what should the requirements.txt looks like, how can we advertise them, how do we use them in the ci, regardless of auto generating one from another, or perform consistency checking is more like a mechanical thing.

Because pip is still the status quo and should be our first class citizen, here is a strawman:

B0: Follow common practices(like pytorch) to specify loose dependencies
B1: When talking about installation and dev env, list both pip, conda and poetry without having to enoucrage one over another, we might want to put pip in the first place.

From B0, we might want the following files:

requirements.txt: common deps
- Most should have no constraint at all
- We can have minimum version constraint if necessary (e.g. xgboost>=1.2.0).
requirements-extra.txt: extra requirements (can have multiple, to simplify we can have )
- same as above, we can have a mininum version constraint. We should talk about whether or not maximum version constraint should be placed here. Per python ecosystem, seems a better practice for now is to only add maximum version constraint when there is a regression, and we should strive to remove as soon as it is fixed.

Additionally, we can have the following constraint file

ci-constraint.txt: pinning the exact version that are needed in the CI, including dependency of dependencies if a regression occurs
- We should only pinning things that are needed to be pinned as per current docker installation.

We will use the txt files docker to generate the image to simplify the flow and make sure pip is the first class citizen. While it is true that a lock file might be more precise, we might still prefer the simplified flow per above discussion(lock does not cover everything and an equal constraint might be sufficient for most cases.

mjs · November 2, 2020, 9:49am

Hi. The semver proposal makes sense to me. There is arguably a case the that maximum major version constraints should be provided since the semver definition of a major version change specifically implies a breaking change of some form.

w.r.t the discussion around requirements.txt, common convention and best practice encourages that:

install_requires: provide abstract, minimal, requirements for a specific package
requirements.txt: provides exhaustive pinned versions for a compete environment

There is a general overview here: https://packaging.python.org/discussions/install-requires-vs-requirements/

The end user experience is defined by how we manage install_requires, rather than what we provide in a requirements.txt. While requirements.txt foo is a viable way to share fixed environments within the tvm projects own developer community, we should focus on the install_requires mechanism for tvm end users, because, that is the one they will use by default.

areusch · November 2, 2020, 5:44pm

@tqchen @mjs thanks for your replies

there is also a new pip dependency resolver that is about to be standard. this should alleviate a number of the reproducibility problems, but i’m not sure if there is a replacement flow for generating e.g. a ci lockfile. assuming everyone upgrades to pip 20.3, the main concerns we need to address in the long run are how we manage the list of dependencies in the repo. in practice, I don’t think this upgrade will be standard for linux users til the major distributions pick it up.

wrt install_requires vs requirements.txt: this makes sense to me. I think that means that we could use the following rules to decide when to promote a version-constrained requirement from requirements.txt to install_requires (NOTE: all packages listed in requirements.txt should be included in install_requires, just maybe without a version):

semver dependencies should be placed into install_requires as foo >= X.Y, foo < (X+1). we should only specify semver dependencies to the minor version level (and even then, only if absolutely necessary)
“at least” dependencies (of the form foo >= 2.1) should be placed into install_requires. we should scrutinize these to see if they are really semver; we should also consider additionally placing a foo < (X+1) constraint. it’s possible that “at least” dependencies may not explicitly state they follow semver rules, so it wouldn’t be obvious that rule #1 is apropos, but nevertheless given a clear version pattern, restricting beyond the next major version may be prudent.
precise version pins in requirements.txt should be placed into install_requires, but we should essentially never do this except in extreme cases.

now wrt the file layout:

I like @tqchen proposal for requirements.txt and requirements-extra.txt. these can live in the python/ subdirectory of tvm repo. potentially, setup.py could just apply some set of rules (i.e. the ones above) to generate install_requires from these files without being overly specific.
for the CI: it’s not clear to me we need to pin beyond what’s done in requirements.txt. a constraints file makes sense if we do need that. the main case I could think of is when a package introduces an accidentally-breaking revision. if we are cutting a release, and the CI needs constraints above requirements.txt, perhaps we should consider promoting that constraint to install_requires. finally, we do need a way for developers to get a list of which versions ended up being used in the CI (because right now, if you don’t have a GPU, you can’t produce this list). we don’t need to discuss that here, though.

finally, i’d like to think through some common dependency management cases:

C1. someone adds a new core dependency

edit requirements.txt and insert the new dependency in alphabetical order.
ensure no other requirements-extra.txt specifies this dependency
run a tool to validate the requirements.txt (setup.py?)
update the CI containers and submit a Jenkinsfile change
submit requirements.txt PR along with new Python code that uses the new dependencies

C2. someone adds a new extras dependency

same as core, but swap requirements.txt and requirements-extra.txt
test path isn’t as clear in the CI, but this is a separate problem

C3. a pinned or semver package’s version needs to get updated.

edit requirements.txt to update the version pin
test locally
rebuild CI containers with new version.
test with new CI containers (how: TBD)
update the CI containers and submit a Jenkinsfile change
submit requirements.txt PR along with new Python code that uses the new dependencies

I think these all make sense to me, though we should consider how to improve the “update the CI containers” step in a separate RFC; specifically:

ensuring all containers use the same version dependencies
documenting the actual versions used

tqchen · November 2, 2020, 8:19pm

Thanks @areusch I still think having a manually specified ci_constraint.txt is easier than having a resolver to pin the version. This is the way we pin dependencies right now, the developers have clear expectations about what to happen(regardless of the dep resolver’s behavior) and we do not need to involve a lock file that might complicates the docker build.

areusch · November 2, 2020, 8:48pm

@tqchen I’m more concerned that there are at least 112 dependencies to specify:

$ docker/bash.sh -i tlcpack/ci-gpu:v0.70 pip freeze | wc -l
118

probably more when you include the lint deps.

tqchen · November 2, 2020, 9:13pm

The lock file will contain all the dependencies that are resolved, e.g. numpy, dependency of the dependencies, which results in the 118 deps you talk about.

If we are going to specify the ci_constraint.txt, I would imagine we only specify the necessary ones, e.g. tensorflow==2.3.1 keras==2.4.3, without specifying the numpy version. While this is indeed less precise than the lock file, so far the pinning approach seems to work well for our case, and the ci_constraint.txt will provide more useful information to the users who need it.

areusch · November 2, 2020, 10:15pm

I think we need to export the output of pip freeze from the CI containers to help those who are trying to exactly reproduce the CI. I agree with you that we should try to constrain as few packages as possible–so ci_constraint.txt could be small. but for the purposes of exactly reproducing the CI, we need to provide an exact snapshot of the venv. a challenging part of doing this is that unless every constraint is pre-computed by e.g. running pip install with all possible packages, ci-lint and ci-gpu install different subsets of packages with ordering that is not easy to determine (you have to read 10 different shell scripts and determine which packages were affected in each), and thus may choose different indirect dependencies. this means it may not actually be possible to export such a pip freeze from the CI, because some indirect dependency may be at version A for lint and and B for gpu.

tqchen · November 2, 2020, 10:52pm

The main question is whether not the exact pinpointing is critical to reproduce CI. Our previous experience seems to suggest that pin pointing the few constraints are sufficient, if not, we should add things to the constraint.txt

Notably, there are also additional environments that are not covered by the pip lock, e.g. LLVM CUDA dependencies. When we are getting to the level of exact reproduce, perhaps the easiest way is still to fall back to use the exact docker binary tag itself.

areusch · November 3, 2020, 3:34am

it’s not really a question that has a general answer. sometimes pinpointing doesn’t matter and other times it does. as an extreme case, pylint cannot be reproduced without both precise pinpointing of both python packages and the interpreter itself. that’s why it can be hard to lint locally now (docker/lint.sh doesn’t work with git-subtree). most times, the interpreter version doesn’t matter so much.

we should also consider the case of using TVM with in conjunction with a project repo (I.e. not just using TVM in its source tree). in this case, suppose additional python packages need to be installed, some of which are large. you may prefer not to pay the cost of starting with ci-gpu, itself a 17.4GB image. instead, you may want to pick and choose the dependencies, and point pip at a constraints file that lists the exact versions of packages that were installed.

anyway, I think this discussion is more about the CI flow which should go in another RFC. I think we both do agree that ci_constraint.txt is a reasonable thing to do to add constraints used to build CI docker images above those specified in requirements.txt.

tqchen · November 3, 2020, 12:25pm

I agree, at a high level having a constraint.txt would be sufficient in most cases. In the case of pylint, while it is true there might be some python version dependent errors, most cases they are not inconsistent with each other(with the correct pylint version) and the error message produced by pylint are still helpful to do a quick lint fix.

areusch · November 3, 2020, 6:28pm

thanks for all the discussions! here are the next steps I can see:

create a python/requirements.txt for the core direct dependencies, specify version constraints as appropriate to be used in the CI. also create e.g. python/requirements-tflite.txt files (one per extra feature of TVM) in the same spirit.
modify python/setup.py to use those requirements.txt to derive install_requires.
create python/requirements-dev.txt, which will not be read by setup.py but which should be used to install dev dependencies (i.e. lint, pytest, sphinx).
modify the CI docker build scripts as follows:
1. update to pip 20.3 to use the new dependency resolver
2. delete all pip install foo commands and replace them with a single pip install command (see next bullet point)
3. create python/ci-constraints.txt which captures even more restrictive constraints for use with the CI that we would not ordinarily populate.
4. modify the CI install scripts in docker/install/*.sh, deleting all pip install commands and replacing with pip install -e ./python -c python/ci-constraints-concatenated.txt (and potentially ./python[tflite] when extras are needed). ci-constraints-concatenated.txt is synthetically built by a shell script function concatenating these files (we may need to de-dupe python packages, so this may become a python script):
  - python/requirements.txt
  - python/requirements-$extra.txt
  - python/ci-constraints.txt

in a follow-on RFC, we’ll consider the challenge of ensuring that this constraints file is uniform across all CI containers (I.e. also the question of whether this is necessary). we’ll also consider challenges in keeping the requirements up-to-date, which may be easier with a tool such as poetry or may not. I think given the new pip dependency resolver, motivation for the use of poetry or some additional tool should come after considering that.

tqchen · November 3, 2020, 8:22pm

Looks good to me. Given the collections of requirements does it makes sense to create a requirements folder?

areusch · November 3, 2020, 8:36pm

sure, python/requirements/requirements.txt?

tqchen · November 3, 2020, 8:48pm

how about python/requirements.txt to follow normal convention, and then python/extra-requirements/tflite.txt

Alternatively, we can move all the extra requirements into a single file, given that it is relatively tedious for users to install separately anyway. They can cherry pick if they want

areusch · January 14, 2021, 1:57am

I’m starting to work on this one. One small departure we need to make from the previously-agreed-upon plan (from a few replies back): as part of this plan, we’re defining one extras_require for each TVM frontend. Some frontends depend on the same external package, but we need to ensure that where we place semver constraints on such packages, we do it uniformly across TVM.

Therefore, I propose we generate the set of requirements-*.txt files from a small Python tool python/gen_requirements.py. this tool would contain a specification for requirements like:

requirements_by_extra = {
    "tvm": ["attrs", "decorator", ...],   # Core requirements
    "importer-tensorflow": ["tensorflow", "tensorflow-estimator"],
    "importer-tflite": ["tflite", "tensorflow", "tensorflow-estimator"],
}

constraints = {
  "tensorflow": "^2.1",  # Maps package name to semver or pip-style requirements
}

we could then either:

R1. Check in the requirements-*.txt, and create a lint step to verify that the checked-in, generated requirements-*.txt matches the Python spec in gen_requirements.py

R2. .gitignore the generated requirements-*.txt, and invoke gen_requirements.py from setup.py.

I sort of lean towards R2, but open to suggestions.

leandron · January 14, 2021, 2:42pm

Hi @areusch, I’m sorry I’m a bit late, but was reading through the whole thread and got curious, as we are talking about Python dependencies, on what we envisage as the experience for an end-user, installing a TVM package generated using the proposed mechanisms.

At the moment, if we generate a TVM package (purely based on apache/tvm repo) and install it on a new Virtualenv, we don’t really get something that we can use to import a model, as none of the frontend dependencies will be in place.

It is also not very usual for users being expected to consume “requirements.txt”-style files directly. What usually is there, is a clean pip install <something>, that will install all dependencies required.

Q1: Is that correct to assume the proposal is that the user will be expected to pip install tvm[importer-tensorflow] (I’m using “tvm” as package here, just as a placeholder), which will install tvm with all required dependencies for TensorFlow frontend enabled, but in case the same user now needs to use ONNX or PyTorch, it will be broken until the user manually install the dependency?

Q2: I know a lot of thinking and effort was put into the current mechanism being designed, but I’m curious on why don’t we just declare all the dependent frameworks and versions, and let the package to install everything, rather than we needing to keep an elaborate mechanism of dependencies processing and files being generated? It seems to me that it would make the experience much more aligned with everything else in the ecosystem.

areusch · January 14, 2021, 4:40pm

hey @leandron, thanks for chiming in! here are some answers.

It is also not very usual for users being expected to consume “ requirements.txt ”-style files directly.

definitely. requirements.txt is purely intended for use with the tvm source repo. there are a lot of different virtualenv management tools, but most (all?) of them can consume requirements.txt. theoretically, it should also be possible to eventually install tvm in “editable” mode, but currently doing so requires that you’ve built libtvm.so, so using requirements.txt makes it a more useful “official” dependencies list, for source tree purposes.

Q1: Is that correct to assume the proposal is that the user will be expected to pip install tvm[importer-tensorflow] ( I’m using “ tvm ” as package here, just as a placeholder ), which will install tvm with all required dependencies for TensorFlow frontend enabled, but in case the same user now needs to use ONNX or PyTorch, it will be broken until the user manually install the dependency?

technically there’s a packaging flow that could further modify setup() arguments such that the end user flow could do something different or enable certain importers by default. but absent that (and I don’t believe the flow does anything like that now), that seems correct.

Q2: I know a lot of thinking and effort was put into the current mechanism being designed, but I’m curious on why don’t we just declare all the dependent frameworks and versions, and let the package to install everything, rather than we needing to keep an elaborate mechanism of dependencies processing and files being generated? It seems to me that it would make the experience much more aligned with everything else in the ecosystem.

I agree pip install foo is typically expected to be a batteries-included sort of command, but I think there are more practical reasons why installing all importer dependencies as default may be hard right now. for example, PyTorch is nearly 800MB. I think @tqchen may have more examples, but at least from a source code perspective, I believe the paradigm we’re attempting to embrace is to minimize the set of required dependencies to make it possible to use TVM in a wide range of settings. it should be quite easy to fix up the TVM error message to explain to users "please run pip install tvm[importer-onnx]" if tvm finds a missing dependency.

I think the end-user flow is definitely worth debating. the initial goal of this proposal is merely to consolidate the set of python dependencies scattered around the source tree, so that’s what i care about. happy to make changes as the community’s thinking evolves around the end-user flow.