TVM’s CI fails on main
and PRs with unrelated changes all the time (some examples from the last few days a, b, c, d). These come from many areas, mainly flaky test cases and CI infra issues. We should establish an on-call rotation of people where 1 specific person is on the hook for identifying and triaging these issues so we can get them under control.
The work required would be pretty minimal:
- If there is a failure on
main
and it’s a test case: file an issue from the link in the failing job and open a PR to disable the test (or fix it if the error is trivial such as a floating point comparison with too tight of a tolerance). Try to look at the git history and tag relevant people - Otherwise, report an issue with
[ci]
in the title so subscribed people are tagged
To implement it we could use a Google Calendar with 1-week long rotations and use the #tvm-ci-failures
Discord channel (which already gets a link posted to automatically any time a job fails on main
) to coordinate
cc @areusch @Mousius @leandron @MJKlaiber @gromero @comaniac @kparzysz