An enabling framework for int8 quantization

areusch · August 19, 2022, 11:34pm

We discussed this this past Wednesday at the TVM Community Meeting. Here are notes:

@mbaret: you mentioned some errors when trying to collect quantization information at the Relay level. what were those errors?
- Nikhil: quantization in TVM isn’t done in batch_norm, instead on multiply and add
@mbaret: does batch_norm have a topi impl? can we run it directly rather than legalizing it to multiply/add?
- this would help to resolve the need for tracking tensors and unblock this quantization effort without that.
- Nikhil: note we’d need to verify that all relevant operators had TOPI impls before this would really be a solution.
Mikael: the presented conv2d/batch_norm/relu is a typical fusion pattern; just noting usually the intermediate tensors inside a fused block can be ignored. is this why it’s hard to collect the stats on each of these intermediate tensors?
- This quantization approach is meant to collect as much range information as it can by looking at all the ops available; you can ignore those intermediate ops as needed. this approach seeks to be a generic approach to collect quantization stats irrespective of quantization target.
@areusch: wondering if it’s possible to consider tracking the Relay variables rather than the operator nodes?
@ulfhanebutte : seems to be similar proposal
@areusch: looking at Relax proposal and other graph-level transforms, you might start introducing more complex optimizations that significantly mutate the structure of the graph, and after those you might not have the same structure, but you might have a tensor or a partial tensor that contains the data. we’re looking into this a bit at OctoML for debugging.
@ulfhanebutte : note that this is bigger than debugging–we just need a way to reference back to something from the originally-imported model in a way we all agree on.
@dchickles : is this the main shortcoming of the proposal? trying to understand if this topic is a problem we should solve with this proposal or a future work.
- @areusch: not sure if it’s a shortcoming, but wondering how profileID might be kept after scheduling. need to think a little bit more about this. IR changes tend to have a pretty big blast radius; everyone who’s writing a pass needs to think about questions like, “should we forward this ID, or erase this ID? does our transformed IR truly represent the original layer?” there are some cases I could see where it’d be straightforward to transform a layer, but there are others where e.g. without a feedforward graph where it could be more challenging.
@dchickles: what are the next steps here?
- @areusch : it’d be good to think over the implications of needing to implement this in various passes. narrowing the scope of the RFC to just tracking values would probably help and allow us to recruit other use cases to validate the approach.
- @mbaret: most obvious blocking issue here is we should develop a clearer picture of the errors that happened previously that motivate the need for this approach. also, could we consider adding annotations or AnnotateOp or some sort of profile op between the original layers. then, those dummy ops would retain their position in the graph and then we could instrument off their locations.
- @areusch: similar–it seems like it’s easier to keep the annotation on the edge or dummy op.
- Senad: similar to begin/end compiler annotation?
- @mbaret: yes
Senad: that seems equivalent, so what are the next steps? look at original issue?
@ulfhanebutte : this is making the edges into a node-edge combination, so you can use that to follow what’s happening on the edges, so you have like an EdgeObserver.
- @mbaret: yeah, if the node’s job is just to forward its inputs. can have different types of nodes e.g. for debugging, printing tensors, etc. agree this adds complexity, but all of these ways have
- Mikael: re: quantization: when you read the ONNX graph, simulated quantize or fake-quantize nodes are there. when we read the graph, they’ll be fused together. would be helpful to track that fusion, don’t think we have that pass in TVM now. could see the tracking to be even more important because [this fusion] is often the source of error. if you have a fake-quanitze graph, you’re obliged to fuse the nodes together and that would be useful. is that part of this RFC?
- @areusch agree this would be useful. thinking about the tracking: how can we add it in such a way that it doesn’t interfere with the typical transformation/optimization pass? should it be carried as part of an op or inserted into the graph or added to the variables we represent in the graph? whatever way we do it, it should be simple to carry forward.
  
  also regarding the transformations that are made to the graph: is it possible to reconstruct what you want if we track things at the edge level? or do we need more in the IR? in general we should aim to minimize the amount of state in the base IR node. from a complexity-reduction pov, we should be more explicit when possible, balancing that against adding too much state to the graph.
Mikael: could have the PassManager track transformations with a generic tracking mechanism.
- @areusch: need to think about this more.
- @ulfhanebutte: one pro of putting it in the IR though is that it’s easy to print without the need to reference PassManager. should think about ease-of-use.
- Senad: main point is to create a base framework that’s generic and decoupled, and then a separate use-case specific framework.
@mbaret: want to bring up device annotation as a case study, where we tried to do add state to the base IR node before, but ran into trouble. originally tried to add device annotations to each op, but ultimately backed that out and went to an explicit on_device op, which was more naturally handled by passes.
@areusch: for next steps, would be great if Matthew or anyone else wants to look at the batch_norm problem, could unblock you guys. but do think that adding a tracking mechanism would be very helpful. if you guys are interested in moving forwards, let’s open a more targeted thread on the forum. can have another thread as well if we want to discuss ways to track graph transformations as well.
@ulfhanebutte: since TVM has grown dramatically, we’re likely running into these needs more frequently. are there other groups working on this in TVM?
@areusch: don’t have a specific working group, establishing one of those wouldn’t be the worst idea. create forum subcategory or mailing list? Relax is also intended to be a new graph-level IR, so has been some focus on dev tooling there. could be good to get those folks involved.

also not in the meeting, but an addition: consider consulting the UMA folks as well (cc @MJKlaiber )