[RFC] Rebuild Docker images per commit

Summary

Rebuild Docker images per-build rather than use the pinned Docker Hub images in the Jenkinsfile.

Guide

Note: This is a spin off discussion from https://discuss.tvm.apache.org/t/rfc-a-proposed-update-to-the-docker-images-ci-tag-pattern/12034/2

We currently run various stages in Jenkins largely through stage-specific Docker images, such as ci_cpu for CPU jobs, ci_gpu for GPU, etc. These images are uploaded manually to the tlcpack Docker Hub account after a lengthy update process (example).

Instead, we could build Docker images on each job (i.e. every commit to a PR branch or main). This immediately removes the need for the manual update process to Docker Hub. However in the simplest set up this would add significant runtime to CI jobs (30-ish minutes). Using Docker caching and automated updates should help avoid this overhead except in cases where it is actually needed (i.e. during infrequent Docker images changes).

The process would look like:

  • Author sends commit, Jenkins runs lint and kicks off jobs per execution environment (e.g. ARM, CPU, GPU)

  • In each environment, docker pull the relevant base image from the tlcpack Docker Hub to populate the build cache

  • Run docker build for the relevant image using the code from the commit

  • Pack the Docker image as a .tgz for use in later stages

  • Run the rest of the build using the saved image

  • If run on main and the Docker image changed, upload the new version to tlcpack automatically

We are able to automatically update main since changes will not need to go through ci-docker-staging since they can be run as normal PR tests and the Docker images will be re-build and used in the PR.

Drawbacks

This requires experimentation to ensure it can be implemented with low overhead. The Docker images for CPU and GPU environments are large (around 25 GB) and sending this data around may have an impact on runtime.

cc @manupa-arm @leandron @areusch

1 Like

Hi @driazati ,

I would support this.

This is a great improvement as this would always verify the patches in the environment where they are meant to be verified – without having to merge docker changes first and then running docker-staging job with the other changes.

@Mousius what do you think ?

I think every advance that closes the gap between the Docker images being updated and the PRs is much welcome.

One of the reasons it is not live as it would seem logical to be, is because of security reasons (based on a chat long ago with @tqchen). We can’t blindly run a docker rebuild for any PR, because that opens the door for random people to run arbitrary commands on our Jenkins nodes, just by submitting a PR with changes to our shell scripts e.g. build.sh or bash.sh, which would run outside a container.

Does the proposed change here address this fundamental issue of the way our CI is organised?

Running arbitrary commands directly on the node is already possible in several places in the Jenkinsfile:

So our security is relatively poor even today. We can mitigate risks by:

  • ensuring every step is wrapped in a reasonable timeout
  • getting rid of persistent nodes (workers) and using autoscaled nodes only
  • require approval to run for first time contributors (GitHub Actions update: Helping maintainers combat bad actors | The GitHub Blog), though Jenkins may not easily support this
  • make sure scripts run outside of docker are checked out from the target branch and not the PR branch for forked PRs (similar to how we manage the Jenkinsfile now). This would be troublesome for CI development for non-committers (i.e. me) but should lock down most of the vulnerability surface. We could take this further and only rebuild docker images on branches, which would still make testing / updating easier without the risks.

@driazati @leandron ,

I think this proposal will benefit all the work that require updates to dependencies. @masahi @Leo-arm @elenkalda-arm

I would suggest lets scope scripts that is relevant to this proposal (as it seems there are already other places the attackers could exploit anyway) . Isn’t it just build.sh that we need to checkout from the main ?

make sure scripts run outside of docker are checked out from the target branch and not the PR branch for forked PRs (similar to how we manage the Jenkinsfile now).

I think this approach should address the concern, @driazati I can understand not being able to test things out in the upstream CI, however, how much of a concern is that related to the scripts in question (Im thinking it is just build.sh, but maybe I am wrong) here ?

We could take this further and only rebuild docker images on branches, which would still make testing / updating easier without the risks.

I am not sure I follow this proposal. Can you elaborate ?

cc : @areusch

It’s mostly that we trust all code running on a branch but not necessarily code running from PRs from forks, same as how Jenkins treats the Jenkinsfile today. After thinking about this for a bit I don’t think it’s a huge deal since nodes are mostly ephemeral, we just need to ensure we don’t pass secrets via environment variables to untrusted code. Also we probably wouldn’t want to rebuild docker images per commit just because the images themselves are quite large (25+ GB in some cases), but we can pretty easily detect changes to the docker/* directory and rebuild in that case.

I support improving the process of updating the CI docker images. A couple of thoughts here:

  • I agree the Jenkinsfile security is a little bit arbitrary. I think maybe there could be an aspect of protecting the Jenkins master (i.e. I think you can get references to internal shared Java objects from the Jenkinsfile and mess with them). I agree it doesn’t stop folks from running arbitrary commands.
  • Generally speaking we don’t tend to reuse layers in Docker containers, so pulling the base image may make the docker bulid shorter, but I don’t think we should expect the typical docker build to result in a small incremental change. Rather, these images are quite large, and I’m not sure we should be stashing them via Jenkins.
  • A related problem occurs when we pull images from ephemeral containers–previously builds started almost immediately, but now it’s much more likely we’ll need to pull a docker image. I wonder if we might consider a pull-through cache or some other type of local registry as a way to both pull images and also store builds that originated in the CI.
  • From a security perspective, I think ephemeral machines help; I’d also add pre-declaring network dependencies and firewalling the executors as possible remedies.