Block Github merge if CI Windows/MacOS build failed?

Currently TVM CI does not block merging code on Github if CI fails in building TVM on Windows or MacOS. Due to this issue we had an incident where we merged a PR without noticing failure on Windows.

I’m proposing to change TVM CI to block merging by default if Windows/MacOS fails similar to the rest of the builds on various architecture.

I would like to capture the community thoughts on this and have a discussion here.

Thanks!

1 Like

cc @areusch @driazati @leandron @Mousius @tqchen

I think this seems pretty reasonable to me, but I’d also like to get the community’s thoughts. We don’t currently run many tests on Windows/Mac, so we may need to also consider whether we want to run those as well. Another thing is that it could be a bit challenging to reproduce those errors without more detailed instructions, since most of us develop on linux via the docker containers. It does seem like those containers are built with packer, so we may be able to leverage that.

1 Like

I think that is a good idea, as it will make the criteria more robust and also don’t let the codebase to bit rot in non-linux platforms. Thanks for bringing this up @mehrdadh.

1 Like

Happy with what community decides. One thing to note is that we might want to spend a bit more active watchout for GH action dependency updates. For example, as GH action drops support for certain version of MSVC xcode, we will need to be able to upgrade, or turn the block off. Although they are relatively infrequent

1 Like

I think they cause soft outages so we would notice this pretty quick before we’re perma-busted though.

2 Likes

cc @Hzfengsy @kparzysz @comaniac

If we’re running CI builds they should carry some weight, so I think we should do this. There are a couple ongoing follow ups we’ll have to do regarding flaky CI runs though (e.g. this one):

  • Committers will have to be on top of these for PRs they review and re-run specific flaky failures using the button on GitHub as necessary (to avoid making the PR author re-push and trigger Jenkins again)
  • The OSS team will need to monitor / note flaky failures and add fixes where possible (e.g. for the failure linked above we could have a backoff + retry for the checkout step)
1 Like

Hmm, I could swear that the Windows error was blocking subsequent PRs. In any case, I agree that we shouldn’t knowingly let these builds fail.

Thanks everyone for the feedbacks!

I like the @areusch’s suggestions of using packer to build windows container. I think we could do that as the first step and document how to reproduce windows errors. This will help community members to understand the flow. Following this, we could change CI to be blocked by windows fails.