Auto-build and test TVM CI docker images nightly

Currently we don’t do any sort of automated testing to make sure our Docker images are healthy, so it is not uncommon that the images are sometimes broken and we don’t have visibility of the issues. Only when we decide to update the images, then it causes massive pain (e.g. Rebuild ci-arm, ci-cpu, and ci-gpu container · Issue #8177 · apache/tvm · GitHub).

Rebuilding our Docker images takes at least a couple hours alone. Hence, it is currently impractical to rebuild images for every PR or merge, due to time constraints.

In order to give visibility of issues in our Dockerfiles, I’d like to propose an automated build that can use our existing infrastructure to re-generate the images from scratch, once a day, so that problems can be spotted early, without increasing even more the time we get to validate our PRs. Also the work needed in maintaining the images can be spread in the community.

This proposal can be implemented by two independent Jenkins pipelines. Here is a summary of what they would do:

  1. P1: daily-docker-images-rebuild: fetch the latest Dockerfiles definitions on TVM repository and rebuild the images from scratch. If successful, uploads the images to a “tlcpack-staging” (provisional name) DockerHub account.
  2. P2: daily-docker-image-validate: pulls the latest images from “tlcpack-staging” (provisional name) and runs our existing tests on it, with the latest TVM sources.

As mentioned, the pipelines are independent, and they are not expected to make any changes to our production CI (the images used to run our CI from GitHub Pull requests), without manual intervention.

Going a bit more in detail on what each pipeline would accomplish:

daily-docker-images-rebuild

  • This pipeline is triggered by a timer, running once a day
  • Fetch the latest TVM sources
  • Uses docker/build.sh to rebuild all images currently used in CI: ci_lint, ci_cpu, ci_gpu, ci_arm, ci_i386, ci_wasm and ci_qemu.
  • Tags images with two tags: latest and a timestamp based tag YYYY-MM-DD-HH-MM-SS-<short_lastest_tvm_git_hash>
  • If successful, uploads them to an account on DockerHub
  • In all cases (success or fail) would send notifications on somewhere visible by the community e.g. Discord channel or mailing lists

daily-docker-image-validate

  • This pipeline is triggered by a successful “daily-docker-images-rebuild”
  • Fetches the latest TVM and run the existing tvm/Jenkinsfile pointing at the images generated by the pipeline above
  • In all cases (success or fail) would send notifications on somewhere visible by the community e.g. Discord channel or mailing lists

Next steps

I have a draft job that implements “daily-docker-images-rebuild”, and I’ll be posting that in the next few days. In the meantime, I’d like to ask for feedback and ideas on how to deal with the issues described here.

cc @areusch @tqchen @ramana-arm @haichen @jroesch @thierry @Lunderberg @mbrookhart

4 Likes

Thanks @leandron for the proposal! I agree this will be a great help in monitoring the container rebuild process for problems and should reduce the headache typically involved with updating containers.

I think we could prototype this first using [CI] How to run your own TLCPack CI (and: proposing future improvements to https://ci.tlcpack.ai) so we can iterate without impacting CI runtime, and then migrate it to the production TVM CI to run at night when CI load is lessened.

Below I scope out a couple ideas for future work which may help to motivate this project.

Future work: use autobuilt containers for production TVM CI

I think it would be interesting to implement this and then consider only allowing containers built by this process to be promoted to official tlcpack/ci-* containers. It’s likely we would need some additional work over this to provide a flexible enough interface (e.g. build selected containers on-demand, likely gated to committers) to support this workflow. However, the benefit is that all containers would then be built from a known clean revision of TVM, so a reproducible build is more likely to occur.

To be sure, this approach doesn’t provide 100% reproducibility (the container build process contains a bunch of external dependencies e.g. apt packages, LLVM, etc), it ensures those dependencies are documented and provides us a path to collaborate on future movement in that direction, should we so desire.

Future work: Build status dashboard

I think it would be great to also consider creating a concise status dashboard that shows a matrix of the build outcomes by container and date. This would make it easy to diagnose failures and bisect the range of PRs which may be suspect.

Future work: TVM Python dependencies

[RFC] Python Dependencies in TVM CI Containers proposed some efforts to capture the set of Python deps used in the CI and improve their consistency. With this process in place, we should be able to finally build the constraints list of x86_64 dependencies. This would allow us to ensure that Python packages in ci-cpu, ci-gpu, and ci-lint match. This has been a point of confusion for me when debugging CI failures in the past.

1 Like

I’ve just submitted a draft Jenkins pipeline that accomplishes the main idea presented here:

There are a few things we still need to agree upon:

Q1: Where do we upload the staging images to?

  • Suggestion: create a new tlcpack owned DockerHub account tlcpack-staging

Q2: what notification we send when a job fails?

  • Suggestion: create an e-mail list with the interested people.

Q3: How to tackle issues coming from the Docker images rebuild?

  • I think this should be a shared responsibility for the committers. Perhaps something we can discuss in the net community meeting? (cc @hogepodge)

cc @tqchen @mbrookhart @areusch @Mousius

1 Like

@leandron would you be available to talk about this work at the next TVM Community meeting this Thursday, July 22 at 9 AM PT?

Yes, I’ll add myself to the agenda.

@leandron thanks for posting this! including some of my thoughts replying to your questions:

Q1: Where do we upload the staging images to?

this seems reasonable. Probably a new DockerHub organization called that, and the bot can be a DockerHub account called tlcpack-ci-build-bot.

Q2: what notification we send when a job fails?

either an e-mail list or a discord notification seems good. @tqchen can comment whether discord is sufficient.

Q3: How to tackle issues coming from the Docker images rebuild?

We will need to do this as a community. I think Discord may be a good place for these higher-bandwidth debugging conversations to start. however, we should establish a process for logging failures so that we can keep track of them. in my experience, there is often a long tail of difficult-to-trigger, flaky CI failures. when you have these, you want to create a set of e.g. GH issues and link to the Jenkins log for each of these runs. After you’ve accumulated enough logs, you’ll begin to form theories; at this point, the list of links to failure logs is invaluable to resolving the underlying issue.

I agree discussing this further at community meeting makes sense to me. There are some open questions as to whether we have an “on-call” or something.

[update]

daily-docker-images-rebuild is now available at daily-docker-image-rebuild [docker-images-ci] [Jenkins]

1 Like

New update:

daily-docker-images-validate is now submitted as a PR at: