Incremental recompilation of models

One feature that is great about many compiler toolchains is incremental recompilation.

E.g. imagine a C++ project with 40 source files which I compile; and then I change a few lines in one of the files. A clever compiler toolchain will only recompile the parts of the program touched by this change.

This is essential in projects with high compilation times.

Now, I have been exploring auto-schedules and such in TVM, and I am interested in the performance impact of different schedules. The auto-scheduler evaluates the performance of schedules in standalone workloads (sub-graphs, distinct from the whole model).

However, I am interested in the performance of these schedules in the full tensor program.

If I have a logfile with schedules for three workloads (A1, B1, C1), I can compile with:

with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(
        opt_level=3, config={"relay.backend.use_auto_scheduler": True}
    ):
        lib = relay.build(
            mod, target=target, target_host=target_host, params=params
        )

However, let’s say I have alternative schedule for B, and thus can generate a second logfile (A1, B2, C1). Right now, as I understand it, to evaluate this schedule, I would need to recompile the whole model from scratch, even though only one part of the model has changed.

Thus, I am interested in how incremental compilation could be achieved in TVM - only recompile workload B, leaving A and C untouched.

If we know the parts of the graph we want to recompile, how complicated would it be to do this? What parts of TVM would need to be changed and extended? Any challenges or shortcuts you can foresee?

I would like it be to be as conceptually simple as compiling a standalone module of the part that changed, then running lib.changed_part = new_subgraph_lib

@wheest could you explain the motivation a bit more? do you find the actual compilation time of relay.build to be excessively large, or e.g. have you already deployed compiled code for op A and C and want to reduce the size of the updated binary?

My motivation is reduction of the compilation time, rather than saving space.

For the work I’m looking at, the time for relay.build starts to dominate once I start evaluating a sufficiently large number of variants, and especially for more complicated tensor programs.

could you split your relay function into two or three parts and then invoke relay.build separately?

I can see your pain point and I think this is a nice-to-have feature. Manually splitting a module would become unmanageable when we want to run the whole model (need to stitch together separately compiled modules).

I believe we already have some form of caching at the Relay function level. Maybe we can an offer an option that makes those cache persist across different invocations of relay.build()?

cc @jroesch

If the only “incremental” part that is needed is auto-scheduler/auto-tvm you can just concatenate the log files together and have it explore all of the log entries at once, finer grained incremental compilation is a complex topic that requires designing around it as a first class constraint.

We could introduce caching in some places but its very likely the caching will be incorrect leading to later issues. For example the Rust compiler took about 3 years to implement end-to-end, correct incremental compilation.

If you want to run a search process inside the compiler you can just save the IRModule at a certain step and replay them against each other, this is how we plan on doing this as we land AutoTIR and TensorIR.