Two missing pieces for training

zhuzilin · May 20, 2021, 2:32am

Thank you for the information! I’m really looking forward to the BERT and DLRM model

Note that optimizers often have internal state (such as velocity for momentum SGD, other gradient statistics, etc.), so we should view an optimizer as an update function of (weights, grads, state) -> (weights', state') . … Regardless, by defining the optimizer functionally this way, we could also compile the update step externally and avoid inlining easily.

I love the idea of representing the optimizer as a relay function and letting the user choose whether to inline it! I’m totally fine as long as user could separate the main calculation and the gradient update~

Regarding training vs inference mode, note that TVM already has a distinction in the compilation workflow via the SimplifyInference pass- for compiling training graphs that include dropout and batch norm, disabling SimplifyInference will keep these operators.

If I have understand this correctly, this means we will not have a separate training version and inference version for ops like dropout and batchnorm. Then, all the dropout in the relay graph will be interpreted as the training one and an additional prng key argument will be added to the main function. In that case, if we wants to do evaluation in the training (after each epoch, for example), we need to transform the module with SimplifyInference every times? And what if there are an op that behaves differently in training and inference but could not be eliminated by the SimplifyInference?

BTW, it there any plan on lowering part of the graph from frameworks to tvm, like the nice blog Bridging PyTorch and TVM did with BERT. In practice, we may encounter people using all kinds of rare ops in their training, even some custom ops. It would be great if we can just optimize part of the origin training graph (for example, the common backbone or loss) and leave the other ops to the origin frameworks. The main barrier would be how to separate the forward and backward graph as mentioned in the blog. And I wonder if you have any thoughts on this? Thank you~