CI Monitoring Rotation

TVM’s CI fails on main and PRs with unrelated changes all the time (some examples from the last few days a, b, c, d). These come from many areas, mainly flaky test cases and CI infra issues. We should establish an on-call rotation of people where 1 specific person is on the hook for identifying and triaging these issues so we can get them under control.

The work required would be pretty minimal:

  • If there is a failure on main and it’s a test case: file an issue from the link in the failing job and open a PR to disable the test (or fix it if the error is trivial such as a floating point comparison with too tight of a tolerance). Try to look at the git history and tag relevant people
  • Otherwise, report an issue with [ci] in the title so subscribed people are tagged

To implement it we could use a Google Calendar with 1-week long rotations and use the #tvm-ci-failures Discord channel (which already gets a link posted to automatically any time a job fails on main) to coordinate

cc @areusch @Mousius @leandron @MJKlaiber @gromero @comaniac @kparzysz

2 Likes

Awesome idea @driazati! This is also a great way to get started in the community for newcomers :smile_cat:

Rather than a Google Calendar, do you think we could re-use some of the issue tooling we have to allow people to sign up and opt out? It can re-generate a schedule on a weekly cadence, publish it and ping those in the next two weeks so they can cross talk around issues that may have come up.

@areusch do we need a process RFC for this?

My thinking is that we don’t need a Process RFC to report CI flakes because we should already be doing that. If we wanted to be able to recognize merit for advancement based solely on someone taking a rotation here, it would be a good idea to do a Process RFC. I’d suggest we start by trying this out and figure out any further details, then consider writing this up so that it can be broadcasted to the community in a stable location.

I like the idea in its current form.

I was planning to start doing this myself a bit and build up a runbook which we could broadcast at the next community meeting.

Our existing tooling is pretty specific to its current use case and I don’t particularly want us to add another script to the existing pile if we can help it, hence the existing tools.

I also agree, seems to be a practical approach

@driazati regardless of whether we use GitHub Actions, can we get an issue on the repo to provide visibility to other people who may not be active on Discord or these forums?

An issue makes sense, especially if it’s pinned. I got rid of the calendar and moved it to https://github.com/apache/tvm/issues/11462