in the past we use sphinx gallery to generate the notebooks. one balance to play is that a lot of the notebooks are intergations that takes longer time to run. Ideally we should bring some of them to separate pipeline that triggers independently, or run as a nightly where the errors get covered by some debuggable unit testcases