An enabling framework for int8 quantization

ulfhanebutte · August 3, 2022, 11:24pm

Dear TVM community,

We, at Marvell, like to continue the discussion on Quantization within TVM. As starting point, we take Masahi’s response to the question by JosseVanDelm (Dec 2021): status on quantization in TVM https://discuss.tvm.apache.org/t/status-on-quantization-in-tvm/11668

We agree that pre-quantized models utilizing DL frameworks is the preferred way to quantize network. In particular we embrace the QDQ ONNX representation of quantized networks.

However there is a need to have a wider quantization support within a TVM flow, beyond the current support.

To enable HW vendors within their BYOC TVM flow, we propose, and are committed to deliver to the TVM community, a tensor range profiling functionality within the TVM flow that generates the needed profile information of intermediate tensors at each layer of the initial IR within the TVM flow. The resulting profile information can then be used during the network quantization within a BYOC flow. This is an extension of the relay quantization work https://github.com/dmlc/tvm/blob/master/python/tvm/relay/quantize. Noticing, that the current annotation by TVM quantization is for specific layers only, as seen in /dmlc/tvm/blob/master/python/tvm/relay/quantize/_annotate.py, we propose to instrument all layers and generate a profile data file which at a later stage can be parsed and selectively utilized. Our proposed profiling stage is near the beginning of the TVM flow and is instrumenting the initial IR. The profiling results are stored in a json file which contains for each layer the min and max values of each tensor. Currently we are limiting profiling to tensor ranges of a fp32 precision model run within TVM only, however it can be extended to other parameters, such as histograms and scales. As the profiling is done at initial IR, and TVM is transforming this initial IR into many consecutive IRs, it is required to have a linkage between IR representations. We propose to introduce a profileID that is added to each layer of the initial IR and being propagated along each IR transformation. The profileID is identical to the customID that was propose as tvm_custom_dict in our pre-RFC ( /apache/tvm/pull/9730). We propose this profileID, as we did not identify any tracking mechanism in TVM which could be used to create this explicit linkage. If there is such mechanism, which we missed to identify, please let us know and we will adopt.

It is our belief, that splitting up the profile generation from the actual quantization and code generation stage, and providing a general profiling support will serve the wider community well, and will allows HW specific methods in the BYOC flow, including selecting which layer to quantize and how.

We look forward to discuss this further with the TVM community and to provide a RFC for the extended profiling functionality.

Best regards,

Ulf Hanebutte Senior Principal Architect, Machine Learning Marvell

masahi · August 4, 2022, 7:40am

Sounds great, although a full-blown quantization framework within TVM would likely be an extremely challenging goal, I agree that it is a good idea to offer limited functionalities that could still be useful for specific needs like BYOC.

I couldn’t get what “a linkage between IR representations” or “profileID” are about, but I’m looking forward to learning about them when the RFC becomes concrete.

leandron · August 5, 2022, 2:39pm

Hi. Thanks for the post. Perhaps this is a good topic for the Community Meetup? (cc @areusch)

You mentioned int8 quantization specifically in the title of the post. Is there a limiting issue that would prevent this work to also cover of other int data types such as int16?

areusch · August 6, 2022, 12:16am

@ulfhanebutte Thanks for the post and happy to see further interest in quantization!

I’d be happy to have this discussed at the Community Meeting if a high-bandwidth channel is useful here. If you’d like to do this, feel free to add it to our agenda Google Doc, which you can find linked from the meeting announcement thread (see the Meetup category for those threads. This thread is a sufficient thing to link to where we can take notes.

I’ll take another look at this thread more in detail on Monday as well.

nber8341 · August 6, 2022, 12:46am

It would be great to talk about this RFC in a live meeting (Best schedule for us will be in about 2 weeks). In the meantime, here are a few clarifications regarding your questions.

@masahi Regarding your question about ProfileID. In this RFC, we are proposing to add this profileID to track where each node of a Final IR originated from after undergoing multiple IR transformations. This is useful while profiling since we need to generate the profile with respect to the original IR graph. Some Target Architectures require profiling information on original graph layer boundaries. The same is illustrated with a simple example in the image attached, where the following transformations are applied (_transform.SimplifyInference(), _transform.FoldConstant(), and _transform.FoldScaleAxis()), the node_id and profileID are information that will be generated from the BYOC side. The profileID’s are unique to the original IR graph nodes and are essential to maintaining linkages of nodes after multiple IR transformations.

@leandron There should be no limiting issues for quantization schemes such as int16 (from your example) since the RFC only covers profiling which is done in fp32 domain and which can be used to do any kind of quantization/scaling in the custom BYOC.

areusch · August 9, 2022, 3:54am

I’ve added this to the Community Meeting’s backlog for Aug 17, I’ll ping again to confirm that date works for you on Aug 15.

Andrew

areusch · August 19, 2022, 11:34pm

We discussed this this past Wednesday at the TVM Community Meeting. Here are notes:

@mbaret: you mentioned some errors when trying to collect quantization information at the Relay level. what were those errors?
- Nikhil: quantization in TVM isn’t done in batch_norm, instead on multiply and add
@mbaret: does batch_norm have a topi impl? can we run it directly rather than legalizing it to multiply/add?
- this would help to resolve the need for tracking tensors and unblock this quantization effort without that.
- Nikhil: note we’d need to verify that all relevant operators had TOPI impls before this would really be a solution.
Mikael: the presented conv2d/batch_norm/relu is a typical fusion pattern; just noting usually the intermediate tensors inside a fused block can be ignored. is this why it’s hard to collect the stats on each of these intermediate tensors?
- This quantization approach is meant to collect as much range information as it can by looking at all the ops available; you can ignore those intermediate ops as needed. this approach seeks to be a generic approach to collect quantization stats irrespective of quantization target.
@areusch: wondering if it’s possible to consider tracking the Relay variables rather than the operator nodes?
@ulfhanebutte : seems to be similar proposal
@areusch: looking at Relax proposal and other graph-level transforms, you might start introducing more complex optimizations that significantly mutate the structure of the graph, and after those you might not have the same structure, but you might have a tensor or a partial tensor that contains the data. we’re looking into this a bit at OctoML for debugging.
@ulfhanebutte : note that this is bigger than debugging–we just need a way to reference back to something from the originally-imported model in a way we all agree on.
@dchickles : is this the main shortcoming of the proposal? trying to understand if this topic is a problem we should solve with this proposal or a future work.
- @areusch: not sure if it’s a shortcoming, but wondering how profileID might be kept after scheduling. need to think a little bit more about this. IR changes tend to have a pretty big blast radius; everyone who’s writing a pass needs to think about questions like, “should we forward this ID, or erase this ID? does our transformed IR truly represent the original layer?” there are some cases I could see where it’d be straightforward to transform a layer, but there are others where e.g. without a feedforward graph where it could be more challenging.
@dchickles: what are the next steps here?
- @areusch : it’d be good to think over the implications of needing to implement this in various passes. narrowing the scope of the RFC to just tracking values would probably help and allow us to recruit other use cases to validate the approach.
- @mbaret: most obvious blocking issue here is we should develop a clearer picture of the errors that happened previously that motivate the need for this approach. also, could we consider adding annotations or AnnotateOp or some sort of profile op between the original layers. then, those dummy ops would retain their position in the graph and then we could instrument off their locations.
- @areusch: similar–it seems like it’s easier to keep the annotation on the edge or dummy op.
- Senad: similar to begin/end compiler annotation?
- @mbaret: yes
Senad: that seems equivalent, so what are the next steps? look at original issue?
@ulfhanebutte : this is making the edges into a node-edge combination, so you can use that to follow what’s happening on the edges, so you have like an EdgeObserver.
- @mbaret: yeah, if the node’s job is just to forward its inputs. can have different types of nodes e.g. for debugging, printing tensors, etc. agree this adds complexity, but all of these ways have
- Mikael: re: quantization: when you read the ONNX graph, simulated quantize or fake-quantize nodes are there. when we read the graph, they’ll be fused together. would be helpful to track that fusion, don’t think we have that pass in TVM now. could see the tracking to be even more important because [this fusion] is often the source of error. if you have a fake-quanitze graph, you’re obliged to fuse the nodes together and that would be useful. is that part of this RFC?
- @areusch agree this would be useful. thinking about the tracking: how can we add it in such a way that it doesn’t interfere with the typical transformation/optimization pass? should it be carried as part of an op or inserted into the graph or added to the variables we represent in the graph? whatever way we do it, it should be simple to carry forward.
  
  also regarding the transformations that are made to the graph: is it possible to reconstruct what you want if we track things at the edge level? or do we need more in the IR? in general we should aim to minimize the amount of state in the base IR node. from a complexity-reduction pov, we should be more explicit when possible, balancing that against adding too much state to the graph.
Mikael: could have the PassManager track transformations with a generic tracking mechanism.
- @areusch: need to think about this more.
- @ulfhanebutte: one pro of putting it in the IR though is that it’s easy to print without the need to reference PassManager. should think about ease-of-use.
- Senad: main point is to create a base framework that’s generic and decoupled, and then a separate use-case specific framework.
@mbaret: want to bring up device annotation as a case study, where we tried to do add state to the base IR node before, but ran into trouble. originally tried to add device annotations to each op, but ultimately backed that out and went to an explicit on_device op, which was more naturally handled by passes.
@areusch: for next steps, would be great if Matthew or anyone else wants to look at the batch_norm problem, could unblock you guys. but do think that adding a tracking mechanism would be very helpful. if you guys are interested in moving forwards, let’s open a more targeted thread on the forum. can have another thread as well if we want to discuss ways to track graph transformations as well.
@ulfhanebutte: since TVM has grown dramatically, we’re likely running into these needs more frequently. are there other groups working on this in TVM?
@areusch: don’t have a specific working group, establishing one of those wouldn’t be the worst idea. create forum subcategory or mailing list? Relax is also intended to be a new graph-level IR, so has been some focus on dev tooling there. could be good to get those folks involved.

also not in the meeting, but an addition: consider consulting the UMA folks as well (cc @MJKlaiber )

alter-xp · August 26, 2022, 2:40am

We have encountered the same problem. Can we consider writing the data range information of each expression into relay IR after being calibrated ？In addition to the problems mentioned above, there is another point that we need the quantitative information of the input and output of each op in some backend. However, the existing QNN OPS has only input information but no output information. We need a pattern to obtain the output quantization information (like python/tvm/relay/op/contrib/cmsisnn.py). After the data range information is written into the relay, perhaps we do not need to use QNN to express it, but directly use the ops of NN.

areusch · August 26, 2022, 3:13pm

@alter-xp I like that idea. We ultimately decided that the need here was about tracking Relay-level tensors through the compilation pipeline; on top of this need, the framework mentioned here can be elegantly built.

I think an RFC for tensor tracking needs to get written up here and we can debate the proper solution. Then we can proceed with the quantization efforts.

@ulfhanebutte @dchickles I checked with the source I was thinking of, and that work is related but not similar enough to matter for an RFC like this. Are you guys interested in writing the aforementioned RFC/will you have cycles to do so? We also will likely need this feature somewhat soon, so may be able to take this on.