Currently we don’t do any sort of automated testing to make sure our Docker images are healthy, so it is not uncommon that the images are sometimes broken and we don’t have visibility of the issues. Only when we decide to update the images, then it causes massive pain (e.g. Rebuild ci-arm, ci-cpu, and ci-gpu container · Issue #8177 · apache/tvm · GitHub).
Rebuilding our Docker images takes at least a couple hours alone. Hence, it is currently impractical to rebuild images for every PR or merge, due to time constraints.
In order to give visibility of issues in our Dockerfiles, I’d like to propose an automated build that can use our existing infrastructure to re-generate the images from scratch, once a day, so that problems can be spotted early, without increasing even more the time we get to validate our PRs. Also the work needed in maintaining the images can be spread in the community.
This proposal can be implemented by two independent Jenkins pipelines. Here is a summary of what they would do:
- P1: daily-docker-images-rebuild: fetch the latest Dockerfiles definitions on TVM repository and rebuild the images from scratch. If successful, uploads the images to a “tlcpack-staging” (provisional name) DockerHub account.
- P2: daily-docker-image-validate: pulls the latest images from “tlcpack-staging” (provisional name) and runs our existing tests on it, with the latest TVM sources.
As mentioned, the pipelines are independent, and they are not expected to make any changes to our production CI (the images used to run our CI from GitHub Pull requests), without manual intervention.
Going a bit more in detail on what each pipeline would accomplish:
daily-docker-images-rebuild
- This pipeline is triggered by a timer, running once a day
- Fetch the latest TVM sources
- Uses
docker/build.sh
to rebuild all images currently used in CI:ci_lint
,ci_cpu
,ci_gpu
,ci_arm
,ci_i386
,ci_wasm
andci_qemu
. - Tags images with two tags:
latest
and a timestamp based tagYYYY-MM-DD-HH-MM-SS-<short_lastest_tvm_git_hash>
- If successful, uploads them to an account on DockerHub
- In all cases (success or fail) would send notifications on somewhere visible by the community e.g. Discord channel or mailing lists
daily-docker-image-validate
- This pipeline is triggered by a successful “daily-docker-images-rebuild”
- Fetches the latest TVM and run the existing
tvm/Jenkinsfile
pointing at the images generated by the pipeline above - In all cases (success or fail) would send notifications on somewhere visible by the community e.g. Discord channel or mailing lists
Next steps
I have a draft job that implements “daily-docker-images-rebuild”, and I’ll be posting that in the next few days. In the meantime, I’d like to ask for feedback and ideas on how to deal with the issues described here.
cc @areusch @tqchen @ramana-arm @haichen @jroesch @thierry @Lunderberg @mbrookhart