[CI] How to run your own TLCPack CI (and: proposing future improvements to https://ci.tlcpack.ai)

areusch · May 28, 2021, 9:51pm

Hi all,

One piece of documentation that’s been sorely missing from TVM is a complete picture of how it is tested. In collaboration with the OctoML Infrastructure team, I’ve been working to fill this gap by creating an Infrastructure-as-Code repository that fully describes the TLCPack CI (e.g. Jenkins running at https://ci.tlcpack.ai, which provides CI services for the TVM project). An Infrastructure-as-Code repo documents the parts of the CI outside the Jenkinsfile which nevertheless affect the outcome of the CI. As examples:

Which AWS nodes are used to run e.g. ci-gpu tasks
How Jenkins is configured
The serving infrastructure used to run https://ci.tlcpack.ai

There are quite a few use cases for such a repository:

It explains to the community how we configure the CI
It serves as a reference for those who would like to run their own internal copy of the CI; particularly for those who would either like to test yet-to-be-contributed code or on hardware not available in the cloud.
It allows us to scale maintenance operations on the TLCPack CI
It brings in operational knowledge from DevOps engineers, who can be more experienced with handling intricacies of cloud hosting.
It provides us a path to standardize the CI runtime environment around a documented configuration.

This project is still ongoing and there are a few different milestones:

Build a repository which can be used to launch a Jenkins instance plus executors which can run a TVM Jenkinsfile end-to-end and achieve a passing result. # <-- we are here
Productionize the repository (e.g. automate common tasks that occur when running the production https://ci.tlcpack.ai server) so that it can be used with long-running Jenkins instances.
Perform a long-running test of the Jenkins server and e.g. compare CI results over a period of time against the currently-running CI to ensure we’ve exactly matched the configuration.
Propose to replace the production https://ci.tlcpack.ai with one managed by the Infrastructure-as-Code repo.

Ultimately, the final milestone here ensures that we have a repository that accurately documents the production CI. A separate RFC and community discussion will follow at the point in time we are ready to consider doing this.

However, for now I want to share the repository we’ve developed to meet milestone 1. This repository allows you to launch a copy of the TLCPack CI in your own AWS account.

I’d love to get any feedback you may have on this repo. I’d also be interested in feedback on the idea of collaborating on a repository like this one as we continue developing the CI in the long term

As a known problem I’m addressing, there’s a known private container (the ARM one) which should be publicly buildable, but is currently privately owned. I’ll attempt to address this next week–in the mean time, you’ll need to set # of arm nodes to 0.

Andrew