[RFC][CI] Add a `[skip ci]` tag to shortcut builds and tests

Summary

TVM has had several breakages on main lately that affect developers. When main is broken it is difficult or impossible to get signal on PRs, blocking everyone’s development until a fix is merged. So it is a high priority that fixes, both reverts and forward fixes, are merged as fast as possible. Right now TVM requires the Jenkins tvm-ci/pr-merge job to finish, which can take several hours. This proposes a flag, [skip ci], that can be added to commit messages to shortcut CI, enabling a quicker turnaround for fixes.

Proposal

If a PR contains the text [skip ci] in the commit subject (the first line of the commit), detect this in Jenkins and skip all steps after the lint / sanity check. This shortens CI runtime from hours down to under 10 minutes. GitHub Actions implements something similar, so this only covers Jenkins.

This would ideally avoid situations like with this PR and the fix, which was not landed until 8 hours later. Developers in the meantime had to deal with CI erroring out in the lint stage and not testing anything.

This flag should mark the corresponding PR with a label to indicate that CI was skipped without making reviewers click through to Jenkins.

Skipping CI is dangerous and can lead to further breakages, so the flag should only be used for fixes and if the forward fix is trivial or it’s a revert. The final call over these things should rest with committers, so they should be the only ones using this flag.

Implementation

As implemented in #9554, Jenkins doesn’t query the PR at all so adding [skip ci] to the PR title does nothing. Developers have to know to include [skip ci] in their commit messages. The behavior can be toggled with empty commits:

# Developer pushes a fix
git commit --message "some fix"
git push

# Shortcut CI
git commit --allow-empty --message "[skip ci] Trigger skipped CI"
git push

# Run full CI
git commit --allow-empty --message "Run full CI"
git push

The Jenkins SCM Skip plugin implements something similar, but the complexity is about the same (i.e. it still needs to be integrated at every usage point and is more difficult to extend).

Open Questions

There are some questions around when the [skip ci] flag should be applied:

  • What tests should be run locally for a CI fix?
  • Who should submit those PRs?
  • What verification should committers do?
2 Likes

I think overall having a mechanism to more quickly apply CI fixes would dramatically improve productivity, and also enable folks to skip ci when they know a) either it does not need to be re-run or b) will be sending follow up changes that can be checked as a unit later saving CI resources for others.

thanks for raising this, @driazati ! I am wary of abusing a mechanism like this, but given how slow TVM CI is, I do agree that having a way to quickly fix or revert PRs that cause build breakages is a good thing. I’m in support of merging such a mechanism.

I do think we should include documentation that guides when it’s okay and not okay for a committer to use this power. I am also wondering if it’s possible to indicate this more prominently (e.g. can Jenkins post up on the PR saying it was skipped? can we require it to be in the PR title?).

I’d suggest these situations when you could skip CI:

  1. The original one: when a previous PR breaks CI and the PR in question is either a rollback or a small fix-forward which has been verified by the TVM committer locally (or the TVM committer trusts the author to have verified the fix). Local verification should not be done when the breakage involves a test that uses the GPU.
  2. When you are updating a CI image and it has already passed CI in another context.

When a committer skips CI, they should monitor the next main build at https://ci.tlcpack.ai.

For all other cases, I’d like us to continue to work to improve the speed of the CI rather than encouraging the community to bypass the CI. Bypassing the CI will create a different problem (where we don’t have visibility into what is slow). I’d like to encourage people to be very judicious when using this power.

Tagging a few others who may have thoughts here: @leandron @mousius @manupa-arm @grant-arm @matt-arm @junrushao1994 @tqchen

Bypassing CI certainly is a band-aid solution for things like “I know CI won’t run anything useful for this PR.” Like with the tutorials and Docker updates we should try to have CI automatically figure that out itself, though the manual tag is still useful in the meantime. For breakages though I think there’ll always need to be a (well documented) shortcut to revert a commit on the scale of minutes rather than hours.

I like this proposal. Do we want to safeguard with a constraint that makes sure the CI is always green on main? For example, we allow “skip ci” to take into effect on PRs, but a PR needs to be run fully at least once before merge

My 2 cents is to really identify places where we would need to do this and explore using a structured filter to avoid running the CI (e.g. Jenkinsfile and docker install shell scripts).

By thinking about it for some more, I guess apart from shell scripts that affect docker image build and the Jenkinsfile, this feature will benefit reverting PRs with flaky tests that got “lucky” in the CI. In the absense of the proposed feature, the developers will need to disable such tests.

My question would be that : is there more usecases that this could help given that a) we have a structured filter that skips CI for Jenkinsfile and docker install script changes and b) while being conscious about the fact that we could always disable flaky tests while the revert is being done in parallel ?

not quite sure I follow here?

I think this is a great question. My primary concern with this mechanism is abuse, and you’re right in pointing out that expanding the docs-only PR filter to include e.g. Jenkinsfile would also address the problem this RFC is attempting to address.

One thing is that disabling flaky tests while doing a revert is going to take 2x CI runtime (once to find the flaky test, and once to disable it). Perhaps this use case would still benefit from some type of skip-CI. I wonder if it’s possible to identify such reverts by looking at the commit message and/or with a diff? (Perhaps not, since previously-green commits could land between when main was broken and when the revert is tested)

It does seem like we could also get by with skip-CI only for reverts and a Jenkinsfile filter to explicitly bypass the filter in the cases I mentioned in my previous post.

The main use of this at least for now should be solely to keep main green and avoid breakages that affect other developers running PRs, whereas right now we have no “escape hatch” that committers can use to quickly unbreak things (and CI on main should never be skipped so signal shouldn’t really be lost). Selectively running things is generally nice to have but also is what caused the breakage linked in the proposal (i.e. the breaking PR was green but subsequent ones were not), so a manually applied, sparingly-used skip would help. Stepping back a bit it’d be nice to codify some of the process around breakages, such as:

  • Who is responsible for monitoring for breakages
  • What to do about flaky tests / intermittent infra failures
  • Who is responsible for submitting a fix
  • When is a fix acceptable vs. a revert
  • How to actually do these things

This shouldn’t block any of the other fixes discussed though, things like disabling flaky tests and choosing what to run are generally good for developer experience in the long run

1 Like

Just putting some more notes here, but I think if we can just document the rules we’ve been discussing, and amend the documentation so that committers know the rules, I would suggest we give this idea a trial period (also given @driazati’s PR implements almost all suggestions so far), so that we can revisit in a few months time, as a community.

It needs to be very explicit (guaranteed by the underlying systems/CI) that CI was skipped for a given patch. We can’t rely on committers going and checking the Jenkins job manually to verify that not a full round of checks was done on a regular day-to-day PR.

On this, I think an automated tag on Github that makes it very explicit is a good first step. Perhaps also notifying on some Discord channel, would be another idea.

I agree with @areusch and also worry that the mechanism will be abused, not by bad intentions but because people will be sure that their one-liner PR to fix a typo in a documentation is safe and so much quick to go without CI, and it will bite is in future.

So we need to be strict on who can actually submit this sort of PR in the first place, and clearly delimit the cases in which is OK to use such practice: I like the ideas in [RFC][CI] Add a `[skip ci]` tag to shortcut builds and tests - #3 by areusch.

To the original questions:

I think at least linting is mandatory and cheap enough to be always run. It is a very easy mistake to make in an emergency PR.

I feel this should be limited to committers, but have no strong opinion.

Given the normal case for a skip ci PR is an emergency fix, I think local verification done by committer, locally, is a good start.

I would like also to propose that we give this mechanism a trial period, and agree on revisiting it’s efficacy in a few months time.

Summarizing the discussion a bit here:

  • There is consensus that such a bypass mechanism could be useful
  • There is widespread concern of abuse. Due to this concern, it’s been suggested to also improve our CI filter to skip parts of the CI for certain changes.
  • There is consensus that committers should enforce that some verification be done prior to submitting the PR.

Given the above, my suggestion would be we move forward as follows:

  1. We move forward with a trial period of [skip ci] but require it to be present in the PR title so that it’s clear from git log.
  2. We limit [skip ci] to quickly reverting PRs which caused breakages.
  3. We implement the CI filter suggestions from @manupa-arm to avoid introducing [skip ci] into our regular development flow. This way, if we decide to abandon the mechanism after a period of time, we do so independent of regression in our normal development workflow.

How does this sound to everyone?

1 Like

On the implementation side a bot (likely this new one: https://github.com/tvm-bot) with triage-level permissions could ensure that [skip ci] is present in the PR title if any of the commits in the PR have skipped CI (if a commit comes later on where CI isn’t skipped, the submitter can manually remove the tag from the PR and the bot can remove the label), similar to other Apache repos with triaging bots on PRs. Optionally it could add a label as well (which has the benefit that submitters without committer status can’t edit labels directly)

It seems like it would be simpler for us to rely on both the PR title and the commit being built starting with [skip ci] (with a clear indication of why CI was skipped or not skipped in the CI logs). This would obviate the need for a bot to be writing PR titles or adding labels. We still may need the bot to authenticate though in lieu of installing a GitHub App to the repo.

This is now landed in https://github.com/apache/tvm/commit/586944e5be7ddda744c859dfe55b8a31c9a3eb3d!

1 Like

Do we encourage using skip ci for other benign changes like typo-only fix?

I don’t think so at this point (but that may change in the future). We’re planning to add more filtering for CI so doc-specific changes don’t trigger full CI so people don’t have to rely (or overuse) on skip CI as a sledgehammer

@masahi correct–i think it would be good to avoid going too far down the path of carving out exceptions, as we detailed above.

however, we have made changes to allow for quick fixes of typos in docs. it would be great if we could similarly detect comment-only PRs and run a shorter CI for them.