Summary
Rebuild Docker images per-build rather than use the pinned Docker Hub images in the Jenkinsfile.
Guide
Note: This is a spin off discussion from https://discuss.tvm.apache.org/t/rfc-a-proposed-update-to-the-docker-images-ci-tag-pattern/12034/2
We currently run various stages in Jenkins largely through stage-specific Docker images, such as ci_cpu
for CPU jobs, ci_gpu
for GPU, etc. These images are uploaded manually to the tlcpack Docker Hub account after a lengthy update process (example).
Instead, we could build Docker images on each job (i.e. every commit to a PR branch or main
). This immediately removes the need for the manual update process to Docker Hub. However in the simplest set up this would add significant runtime to CI jobs (30-ish minutes). Using Docker caching and automated updates should help avoid this overhead except in cases where it is actually needed (i.e. during infrequent Docker images changes).
The process would look like:
-
Author sends commit, Jenkins runs lint and kicks off jobs per execution environment (e.g. ARM, CPU, GPU)
-
In each environment,
docker pull
the relevant base image from the tlcpack Docker Hub to populate the build cache -
Run
docker build
for the relevant image using the code from the commit -
Pack the Docker image as a
.tgz
for use in later stages -
Run the rest of the build using the saved image
-
If run on
main
and the Docker image changed, upload the new version to tlcpack automatically
We are able to automatically update main
since changes will not need to go through ci-docker-staging
since they can be run as normal PR tests and the Docker images will be re-build and used in the PR.
Drawbacks
This requires experimentation to ensure it can be implemented with low overhead. The Docker images for CPU and GPU environments are large (around 25 GB) and sending this data around may have an impact on runtime.