[pre-RFC] TVM Explorer Infrastructure

chunit · September 16, 2022, 5:58am

Yes we did. We add a few lines to deal with node name of NodeProto for the frontends like Pytorch and ONNX in our internal marinating Netron.

Yet core functionality of interaction between source model and complier is implemented in our inhouse web application. Here is a GIF for your reference.

interaction

FrozenGene · September 16, 2022, 6:30am

What I could say is just AMAZING! Really appreciate your job! @chunit

chunit · September 16, 2022, 7:40am

What I could say is just AMAZING! Really appreciate your job! @chunit

It’s really glad to see you like it!
Zack, Hao-Wei and me all try really hard to make this tool be helpful when developing TVM.

Would you mind to give us some comments to make it better? Or is there anything unclear? We can try to explain it more precisely.

Thanks again!

FrozenGene · September 16, 2022, 8:47am

Because I don’t play your app, but I have some thought about it. I think we could dump intermediate output for every layer as you have done interactive mapping. i.e. when we hover on one specific layer on netron, we could dump output of this layer (from input to it). we could also show this layer’s execution time on specific hardware (cpu / gpu or whatever), we could also show tir or generated source code like CUDA/LLVM IR/OpenCL etc like compiler explorer could show compiler result (assembly).

For this pre-rfc, I am happy to spend more time in reading it and maybe could give more comments.

fPecc · September 16, 2022, 12:53pm

Hi @chunit ,

Nice work! I noticed you talked about propagating the name of the layer in the frontend to the relay op. Would it be possible to also propagate the name of the parameters of the layer, so that we can trace it back from the relay?

Example:

Convolutional layer weights can alter its layout, or be merged with other constants. It would be nice to be able to trace back from the relay operator which is the original weight name in the frontend layer.

areusch · September 16, 2022, 8:53pm

Apologies for the delay in reply–I just needed to find some time to sit down and read the RFC all the way through. This is great work @chunit @haowhsu-quic and Zack ! I’m supportive of moving this forwards.

In An enabling framework for int8 quantization, we discussed how to effectively track frontend layers throughout the compiler. It seems like you guys have taken the same approach we discussed there–leveraging the graph edges (e.g. tensors) as the “stable” part of the graph and labelling that Relay ops between them as belonging to the same frontend layers (e.g. RecursivelyFillSpan). Right now, this needs to be done per-pass, but I wonder if we could get away with doing this once at the end of compilation if we also attach references to the frontend layer (or post-import variable) to each Relay Var.

It seems like by annotating Var we might be able to add this information.

One issue is that once we move outside of Relay (e.g. in AOT flow), it’s harder to fill span information back up through the compiler since the layer variables have changed. I’m curious if you guys tried to apply this to any TIR-based fusion?

Lastly, any idea how much additional memory this takes or performance impact?

zack-ch · September 17, 2022, 1:31am

Nice to see it ^^. Wondering if it possible to have a play around website/github?

chunit · September 20, 2022, 3:23am

Thank you very much for the suggestion of TVM Explorer! Here are some works we have done for the functionality you just mentioned.

we could dump output of this layer (from input to it)…

In our TVM Explorer we do have a functionality, called as Executor to communicate with device via RPC and obtain the inferenced result from a targeted relay expression. It is not connected with Netron but it is a good idea to think about how to connect it with Netron.

show tir or generated source code like CUDA/LLVM IR/OpenCL…

We aim to support this mapping in the GIF you saw too. Currently the corresponding TIR/LLVM IR results can be obtained in the Executor above too. Yet we are still working on the Span (source information) propagation after pass transformation. Because Span propagation between passes need to be done per-pass. So far we have done the span propagation for those necessary Relay passes in the build flow based on our infrastructures. About the TIR part we are still working on it. Here is a table shows how many passes have been filled span for your reference.

RelayPass	TIRPass	Not yet done TIR Pass
AlterOpLayout	LowerInitBlock	BF16Legalize
AutoSchedulerLayoutRewrite	LowerIntrin	CombineContextCall
CanonicalizeCast	MakePackedAPI	CompactBufferAllocation
CanonicalizeOps	MakeUnpackedAPI	ConvertBlocksToOpaque
CombineParallelBatchMatmul	NarrowDataType	FlattenBuffer
CombineParallelConv2D	PlanAndUpdateBufferAllocationLocation	HoistIfThenElse
CombineParallelDense	RemoveNoOp	InferFragment
DefuseOps	RewriteUnsafeSelect	InjectDoubleBuffer
DynamicToStatic	SplitHostDevice	InjectPrefetch
EliminateCommonSubexpr		InjectVirtualThread
EtaExpand		InstrumentBoundCheckers
FastMath		LoopPartition
FoldConstant		LowerCustomDatatypes
FoldScaleAxis		LowerDeviceStorageAccessInfo
FuseOps		LowerMatchBuffer
InferType		LowerTVMBuiltin
Inline		LowerThreadAllreduce
SplitArgs		LowerWarpMemory
LabelOps		MergeDynamicSharedMemoryAllocations
Legalize		Simplify
RemoveUnusedFunctions		StorageFlatten
SimplifyExpr		StorageRewrite
SimplifyInference		TextureFlatten
ToBasicBlockNormalForm		ThreadSync
relay::qnn::transform::Legalize		UnifyThreadBinding
		UnrollLoop
		VectorizeLoop
		VerifyMemory

For this pre-rfc, I am happy to spend more time in reading it and maybe could give more comments.

Take your time please, we will wait for it!

chunit · September 21, 2022, 1:04am

Hi @fPecc,

Would it be possible to also propagate the name of the parameters of the layer, so that we can trace it back from the relay?

Should be YES.

Here is an example from the TFLite model for you. As you can see, the names of weights in conv2d ops, and values of bias_add ops are attached in their input part. (Note that we use the “bind_prarms_by_name” in this example.)

Although it is possible, it requires some more investigations in each frontend and modify a bit for Var type printer. We need more time to confirm it.

chunit · September 20, 2022, 3:16am

No worry. Thank you very much for helping us! If you don’t mind I would like to submit more materials for you, and ask some qusetions about the Var thing you just mentioned.

Right now, this needs to be done per-pass

Yes, we did attach span to per-pass based on the “sequentialSpan” and “SIBuilder”. It is a time consuming task. Currently we have done the following passes. All these passes are invoked during the build flow. We would try to complete the rest of passes.

RelayPass	TIRPass	Not yet done TIRPass
AlterOpLayout	LowerInitBlock	BF16Legalize
AutoSchedulerLayoutRewrite	LowerIntrin	CombineContextCall
CanonicalizeCast	MakePackedAPI	CompactBufferAllocation
CanonicalizeOps	MakeUnpackedAPI	ConvertBlocksToOpaque
CombineParallelBatchMatmul	NarrowDataType	FlattenBuffer
CombineParallelConv2D	PlanAndUpdateBufferAllocationLocation	HoistIfThenElse
CombineParallelDense	RemoveNoOp	InferFragment
DefuseOps	RewriteUnsafeSelect	InjectDoubleBuffer
DynamicToStatic	SplitHostDevice	InjectPrefetch
EliminateCommonSubexpr		InjectVirtualThread
EtaExpand		InstrumentBoundCheckers
FastMath		LoopPartition
FoldConstant		LowerCustomDatatypes
FoldScaleAxis		LowerDeviceStorageAccessInfo
FuseOps		LowerMatchBuffer
InferType		LowerTVMBuiltin
Inline		LowerThreadAllreduce
SplitArgs		LowerWarpMemory
LabelOps		MergeDynamicSharedMemoryAllocations
Legalize		Simplify
RemoveUnusedFunctions		StorageFlatten
SimplifyExpr		StorageRewrite
SimplifyInference		TextureFlatten
ToBasicBlockNormalForm		ThreadSync
relay::qnn::transform::Legalize		UnifyThreadBinding
		UnrollLoop
		VectorizeLoop
		VerifyMemory

I wonder if we could get away with doing this once at the end of compilation if we also attach references to the frontend layer (or post-import variable) to each Relay Var.

If it could be done at the end of compilation it would be quite convenient! Sorry that I am not really following this. May I have your explanation again please? Like, may I have an example for

What it looks like about attaching references to the frontend layer?
What should be attached to Relay Var?

It seems like by annotating Var we might be able to add this information.

About this part I would like to have some more explanation. Except the Var or Params, this problem also happens in those one-to-many conversion. Here I would like to take the Pack OP from TF as example again. Currently we fill the layer name to the converted IR like this:

def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
    %0 = shape_of(%input, dtype="int32") /* Shape */;
    %1 = strided_slice(%0, …) /* strided_slice */;
    %2 = squeeze(%1) /* strided_slice */;
    # the Pack Op conversion start from here
    %3 = expand_dims(%2, axis=0) /* stack */;
    %4 = expand_dims(3, axis=0) /* stack */;
    %5 = expand_dims(3, axis=0) /* stack */;
    %6 = (%3, %4, %5) /* stack */;
    %7 = concatenate(%6) /* stack */;
}

And here is the result from former patch:

def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
    %0 = shape_of(%input, dtype="int32") /* Shape /;
    %1 = strided_slice(%0, begin=[0], end=[1], strides=[1], axes=None) / strided_slice_PART_0 /;
    %2 = squeeze(%1) / strided_slice /;
    %3 = expand_dims(%2, axis=0) / stack_PART_0 /;
    %4 = expand_dims(3, axis=0) / stack_PART_1 /;
    %5 = expand_dims(3, axis=0) / stack_PART_2 /;
    %6 = (%3, %4, %5) / stack_PART_3 /;
    %7 = concatenate(%6) / stack /;
}

In the former patch we can indicate computation output of Pack Op immediately because we do not add suffix for it. Now we remove it because we notice that “_part_” suffix is really annoying and misleading after the pass transformations.

The drawback of current version is we cannot tell which one is the computation output because they all look the same. Perhaps we can do something like the following example. But we are still seeking for a better solution.

def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
    %0 = shape_of(%input, dtype="int32") /* Shape */;
    %1 = strided_slice(%0, …) /* strided_slice */;
    %2 = squeeze(%1) /* strided_slice */;
    # the Pack Op conversion start from here
    %3 = expand_dims(%2, axis=0) /* stack */;
    %4 = expand_dims(3, axis=0) /* stack */;
    %5 = expand_dims(3, axis=0) /* stack */;
    %6 = (%3, %4, %5) /* stack */;
    %7 = concatenate(%6) /* stack_OUTPUT */;
}

it’s harder to fill span information back up through the compiler since the layer variables have changed. I’m curious if you guys tried to apply this to any TIR-based fusion?

We are still working on the TIR pass as shown in the list above. Besides, we haven’t done the propagation between Relay → TE or TIR. Because that’s also a tough part we encounter. Things are not too complicated in the Relay environment, but it becomes harder when we go down to lower IR like TE and TIR. Currently we still rely on the layer name. Yet we are thinking perhaps using the row & column number could be more robust and more indicative.

If we have a precise definition of the line number information of an IRModule, we could at least have a better mapping relationship before and after “a pass”.

Lastly, any idea how much additional memory this takes or performance impact?

Yes, take the mobilenet_v1_2018_08_02 for example, here is the profiling result:

RunTime performance

function	Without span filling	With span filling	with span filling & schedule_record
relay.frontend.from_tflite()	133174.0 us	176468.0 us(↑32.51%)	177774.0 us(↑33.49%)
relay.build()	7480367.0 us	7558526.0 us(↑1.045%)	7580165.0 us(↑1.334%)

Memory usage

function	Without span filling	With span filling	with span filling & schedule_record
relay.frontend.from_tflite()	26.105 MiB	26.203 MiB(↑0.375%)	26.211 MiB(↑0.406%)
relay.build()	147.762 MiB	148.148 MiB(↑0.261%)	148.418 MiB(↑0.443%)

We also provide optionst to disable span filling and shcedule recording if users don’t need them.

chunit · September 20, 2022, 3:17am

Hey Zack!

We are asking the legal supporting. You know, it takes time, haha. I would update the news once I get it.

mgeek · September 20, 2022, 7:29am

Wow, such excellent work! Always want some interactive debugging feature like this when playing around with TVM. You guys make it come true! Looking forward to the release

areusch · September 21, 2022, 10:22pm

Cool, thanks for the explanations!

The Var thing I’m discussing here is not exactly a simple tweak to this proposal–it’s probably significant enough lift that it would deserve its own RFC. So just to clarify–I’m not necessarily asking you to change your approach. However, I did want to raise this question to a) build more support for the idea, b) see if it is potentially easier to pursue than adding SIBuilder support to the remaining passes, and c) think through whether it’d be easier to maintain in the long run.

The basic idea is like so: consider your one-to-many example conversion. A common challenge we face in TVM is determining which Relay Expr correspond to one another before and after a pass. To choose a concrete example, suppose we introduce a pass which outlines part of a function (suppose it outlines Pack from your previous example). Before executing the pass, suppose we start from your example:

chunit:

def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
    %0 = shape_of(%input, dtype="int32") /* Shape */;
    %1 = strided_slice(%0, …) /* strided_slice */;
    %2 = squeeze(%1) /* strided_slice */;
    # the Pack Op conversion start from here
    %3 = expand_dims(%2, axis=0) /* stack */;
    %4 = expand_dims(3, axis=0) /* stack */;
    %5 = expand_dims(3, axis=0) /* stack */;
    %6 = (%3, %4, %5) /* stack */;
    %7 = concatenate(%6) /* stack */;
}

Now suppose we run the outliner, and arrive at:

def @outlined_pack(%i1) {
  %0 = expand_dims(%i1, axis=0) /* stack */;
  %1 = expand_dims(3, axis=0) /* stack */;
  %2 = expand_dims(3, axis=0) /* stack */;
  %3 = (%0, %1, %2) /* stack */;
  %4 = concatenate(%3) /* stack */;
  %4
}

def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
    %0 = shape_of(%input, dtype="int32") /* Shape */;
    %1 = strided_slice(%0, …) /* strided_slice */;
    %2 = squeeze(%1) /* strided_slice */;
    # the Pack Op conversion start from here
    %3 = @outlined_pack(%2);
    %3
}

Now the question here is: after running the pass, does a new Relay var exist which contains %7? The answer is yes: it’s %7. In order to make this outline, an e.g. ExprMutator needed to capture the subgraph that contains %3 through %7, then replace it with a call to the new function and store the result in %3. This pass knows that %3 == %7, and (similarly to how Span information is filled here) when defining %3, could include some type of backreference to %7. This could even just be included as a Map:

using VarMap = Map<Var,Var>;  // keys are originally-imported Var, values are the equivalent now inside f.
Function f = mod.GetFunction("main");
f->GetAttr<VarMap>("var_map");

This approach could be taken all the way back to the original import (e.g. or there could be an additional map from input framework layer to Relay var).

SIBuilder takes as input a set of Expr which bound the subgraph. Since most Relay programs are transformed in A-Normal form, the VarMap could substitute for these Expr. This won’t work for all optimizations, but I think for a decently large class of them, we could automatically apply SIBuilder by walking VarMap and applying Spans to the subgraphs with endpoints in VarMap. The advantage of this technique is that it could also be done with TIR with the same approach.

I think you’d need to assert that the Relay or TIR graph could be partitioned along VarMap for this to work–so I’m not saying it would work for all transforms. But I do think it would work for many. It’s also worth noting that this is a best-effort tracking scheme–it’s possible through e.g. operator fusion that some Vars could simply be eliminated. In these cases, the VarMap may not contain all Var from the original model

chunit:

RunTime performance

function Without span filling With span filling with span filling & schedule_record

relay.frontend.from_tflite() 133174.0 us 176468.0 us(↑32.51%) 177774.0 us(↑33.49%)

relay.build() 7480367.0 us 7558526.0 us(↑1.045%) 7580165.0 us(↑1.334%)

Memory usage

function Without span filling With span filling with span filling & schedule_record

relay.frontend.from_tflite() 26.105 MiB 26.203 MiB(↑0.375%) 26.211 MiB(↑0.406%)

relay.build() 147.762 MiB 148.148 MiB(↑0.261%) 148.418 MiB(↑0.443%)

We also provide options to disable span filling and shcedule recording if users don’t need them.

Thanks for providing this data! It seems reasonable as part of running with a debug option at least!

chunit · September 23, 2022, 6:33am

Thank you for this detailed explanation! We digest the content and try to apply this concept to an existing pass. There are still many implementation details we have not figured out. Yet the following is how we illustrate the var mechanism should be like. Please kindly help us if we misunderstand anything.

Goal

Implement a pass to construct a graph. The graph is a tracing map to record the transformation before and after a pass.

What the map should looks like

Personally I would prefer the key are the f, the new equivalent now, and value are the original var. It should be more convienent for us to trace back to the source. So it should be like:

Map<Var,Var>
// Keys are the equivalent now inside f
// Values are originally-imported Var.

Because after a sequence of pass transformations, we would have a final IRModule. Select a certain expression in the final IRModule[“main”], we can trace back to the source. If we use the the originally-imported Var as Key. Perhaps we have to iterate through all the map to find the resulted Var after transformations.

How to invoke

Considering the function GetPassPrefix in “src/relay/backend/utils.cc” we insert a pass OutLiner between passes:

//...
pass_seqs.push_back(transform::SimplifyInference());
pass_seqs.push_back(OutLiner);
pass_seqs.push_back(transform::EliminateCommonSubexpr(fskip));
pass_seqs.push_back(OutLiner);
pass_seqs.push_back(transform::SimplifyExpr());
pass_seqs.push_back(OutLiner);
//...

Process looks like

Take the Relay Pass, SimplifyInference for example, it unpacks certain Calls like batch norm op. The following image is a part of result after the transformation of SimplifyInference pass in our Explorer.

It takes the batch_norm call and its tupleGeItem as source exprs and unpacks them to a set of basic operations.

Now the following is the process once we introduce the OutLiner pass:

Back to the IR pretty print, we would start from IR[“main”] here:

def main(...) {
  %0 = nn.conv2d(%input, %model.conv1.weight,...) /* si=torch._convolution_3 */;
  %1 = nn.batch_norm(%0,...) /* si=torch.batch_norm_8 */;
  %2 = %1.0 /* si=torch.batch_norm_8 */;
}

After the SimplifyInference the IR[“main”] becomes:

def main(...) {
  %0 = add(%model.bn1.running_var, 1e-05f) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %1 = sqrt(%0) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %2 = divide(1f , %1) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %3 = multiply(%2, %model.bn1.weight) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %4 = nn.conv2d(%input, %model.conv1.weight,...) /* si=torch._convolution_3 */;
  %5 = expand_dims(%3, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %6 = negative(%model.bn1.running_mean) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %7 = multiply(%6, %3) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %8 = add(%7, %model.bn1.bias) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %9 = multiply(%4, %5) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %10 = expand_dims(%8, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %11 = add(%9, %10) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
}

Now it is the time to invoke OutLiner. It generates another global function, outlined_bn_0.

def main(...) {
  %0 = nn.conv2d(%input, %model.conv1.weight,...) /* si=torch._convolution_3 */;
  %1 = @outlined_bn_0(%0,...)
}

def outlined_bn_0(%i1...) {
  %0 = add(%model.bn1.running_var, 1e-05f) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %1 = sqrt(%0) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %2 = divide(1f , %1) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %3 = multiply(%2, %model.bn1.weight) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %4 = expand_dims(%3, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %5 = negative(%model.bn1.running_mean) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %6 = multiply(%5, %3) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %7 = add(%6, %model.bn1.bias) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %8 = multiply(%i1, %4) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %9 = expand_dims(%7, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %10 = add(%8, %9) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
}

#Perhaps we would need the original main as reference
def main_before_SimplifyInference_0(){
  #...
}

On the same time, we maintain our the tracing map like this (Key and value should be a Var, yet I am not pretty sure show to exress them in a Var form).

# key: transformed result
# values: original things
map = {
    hash(outlined_bn_0): {%1-batch_norm, %2-%1.0}
}

Using the graph constructed by tracing map, we should be able to trace an IR back to its very original form. Perhaps the functionality of OutLiner might be Implemented based on StructuralEqual. But we haven’t come up a good idea for this currently. Still, if this OutLiner is Implementalbe, it will be really convenient.

Questions

In here we come up some questions about this strategy:

What IRModule would be used once the OutLiner is invoked? Should be IR1 but not the IR2, right?

IR1

def main(...) {
  %0 = nn.conv2d(%input, %model.conv1.weight,...) /* si=torch._convolution_3 */;
  %1 = @outlined_bn_0(%0,...)
}

def outlined_bn_0(%i1...) {
  %0 = add(%model.bn1.running_var, 1e-05f) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %1 = sqrt(%0) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %2 = divide(1f , %1) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %3 = multiply(%2, %model.bn1.weight) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %4 = expand_dims(%3, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %5 = negative(%model.bn1.running_mean) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %6 = multiply(%5, %3) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %7 = add(%6, %model.bn1.bias) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %8 = multiply(%i1, %4) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %9 = expand_dims(%7, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %10 = add(%8, %9) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
}

IR2

def main(...) {
  %0 = add(%model.bn1.running_var, 1e-05f) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %1 = sqrt(%0) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %2 = divide(1f , %1) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %3 = multiply(%2, %model.bn1.weight) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %4 = nn.conv2d(%input, %model.conv1.weight,...) /* si=torch._convolution_3 */;
  %5 = expand_dims(%3, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %6 = negative(%model.bn1.running_mean) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %7 = multiply(%6, %3) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %8 = add(%7, %model.bn1.bias) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %9 = multiply(%4, %5) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %10 = expand_dims(%8, axis=1, num_newaxis=2) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
  %11 = add(%9, %10) /* si=[ torch.batch_norm_8, torch.batch_norm_8 ] */;
}

If we choose the IR1, and continue the transformations of the rest of passes. It might end in a nested form. The readiblity should become very terrible. Perhaps a unpack pass for outlined_fn is requried too, right?
Still about the nested form, if we use the nested form like IR1, many pattern matching things may need to rewrite, because now they need to check the outlined_fn in the graph. The complexity of Implement a pass might increase.

Thank you for reading such long post. it feels great that we can try to figure a better way to maintain the source information.

areusch · September 23, 2022, 9:06pm

Sure thing–I think you broadly understand my proposal. Let me clarify some things:

It could be a pass or it could be some other way (e.g. modify Expr constructor). The tracing map is the goal, though.

That seems reasonable, so long as the model is always in A-Normal Form. If it isn’t then we may need Map<Expr,Expr> here. I think this was stated earlier, just reiterating.

This could also be handled by PassManager, but yeah that’s the right idea, if we took a pass-based approach here. I’ll sketch some ideas I have below.

chunit:

On the same time, we maintain our the tracing map like this (Key and value should be a Var, yet I am not pretty sure show to exress them in a Var form).
# key: transformed result
# values: original things
map = {
    hash(outlined_bn_0): {%1-batch_norm, %2-%1.0}
}

This is pretty close to my suggestion, but let me tweak it slightly. The goal here would be to map a Var in the final Relay or TIR representation to a Var that represents it in the original program (assume the original program is expressed in A-Normal Form, and suppose we allow for trivial TupleGetItem Expr in this map, so %0.2 is a valid value). After running this pass, the map here would then look like:

# key: transformed Var
# values: Expr representing the original value
# keys not present where no mapping exists
map = {
    %input: %input,
    %model.conv1.weight: %model.conv1.weight,
    ...  # same for the rest of the inputs (not as trivial if the keys were instead TIR Var)
    %4: %0,  # I think I understood this transform properly, I think the reordering is due to A-Normal Form conversion after the rewrite, but that in the final program, %4 doesn't depend on %0, %1, %2, %3
    %1: %2  # or %1.0, if that was the only such representation of this.
}

Given this map, the Expr that could be used with SIBuilder then are just the keys of the map.

I think you could then implement a fairly simple algorithm to apply SIBuilder:

Invert the variable map (swap keys and values).
Step through the original program, and for each Relay Expr:
1. Identify the inputs and outputs (this is akin to building a connectivity graph in the final program, but we sort of get it for free from the original)
2. Lookup those values in the resultant program using the Map
3. Create SIBuilder with span equal to the Relay Expr. Run RecursivelyFillSpan(outputs, inputs).

I haven’t thought about this enough, but I think this could run into some limitations maybe around loops and control flow, particularly if we apply the same approach to TIR. I’d need to think about it a bit further.

Building the map

As for how to build this map, here are some thoughts:

Modify Expr() constructor to take another arg Expr orig_expr. Modify all passes to pass orig_expr.
Change ExprMutator and kin to accept such a Map (or get it out of an IRModule attr). When Mutate_ returns a Node different than the one passed-in, modify the map.
Attempt to derive this from an analysis pass as you mentioned.

I think #1 or #2 may not cover all cases here, and some passes may also need to be updated. The reason I’m raising this here is it seems like equivalent work to track relationships between Vars, and if it was possible to get away with using that work to label Spans, we might be able to do this once. Finally, I’m thinking about how to apply SIBuilder to LowerTE, which is what generates TIR for Relay, and how to preserve that information when doing MetaSchedule-style transforms in a TensorIR world. It seems a bit more straightforward to propagate this Var relation rather than the Span info. Var tracking can also be useful in AOT for:

Identifying which TIR Vars represent which Relay Expr (e.g. implementing GraphExecutorDebug)
Profiling layers run in TIR, using those Vars as a hint for where a layer’s compute starts and stops.

Anyway, here I am curious to hear your thoughts on whether you think we could leverage this for Span annotations. The work here is helpful for the project either way, so I think we could also merge this now and, if we can improve the maintanability via Var tracking, we could make that improvement as a follow-on.

cc’ing some other folks who have been thinking about this at octo: @anwang @AndrewZhaoLuo @mbaret @mehrdadh

mikeseven · September 25, 2022, 4:20pm

This is awesome and super helpful work. Can’t wait to use it.

chunit · September 30, 2022, 9:04am

Hi @areusch

Sorry for late reply. Now I am able to grasp the whole concept of Relay Var porposal much better. Thank you for your patience! We have some intuitive thoughts about it. But just like what you said, it deserves to have its own RFC if we want to introduce this tracing map. I would put the discussion of it at the end of this post.

Before that may I know would it be fine to prepare our PRs in this RFC if they look good to you? We can categorize the PRs to three independent parts:

Frontend span filling
Schedule recorder
Pass sapn filling*

Currently most of discussions are about the pass span part. We can continue our discussions for it, and at the same time, if frontend span filling and schedule recorder look good to you, we will prepare their PR and submit them recently. On the other hand, if pass span filling is a good enough midterm solution we can also submit its PR later. Finally, based on our conclusion, we can create a new RFC about the Relay Var tracing map. Would this plan look good to you?

About the Var tracing map, I think it is a good mechanism. Because we can always find where is an IR expression from. Based on this idea we try to find what obstacles we need to break through. To me it is really a challenging topic. I totally agree to make a new preRFC for it. Haha

Data structure and what function to be called

Tracing map should be a <var, Array<var>> form.

To serve those n-to-n (n=>1) conversion, we need an array to preserve their relations.
IRModule includes the historical map and functions during transformation

Therefore it might look like:
Var_Map1127×450 12.6 KB
SIBuilder might not be necessary in this scenario

Since we could get expression mapping relationship through traversing tracing map. We can assign the span to an expr directly, no need to find the input/output of a transformation expression.

Obstacles we might encoutner

We might need to construct a new data sturcture according to the index of var.

I haven’t fully read the Doc Printer. But if there is an example look like this:
```
@fn () {
    %0 = ...
}
@main () {
    %0 = ...
}
```
Then we need to make our map be able to recognize which %0 we are talking about.
Annotating original expr to the transformed expr is time consuming

Basically it seems to me that this is the most doable way, but it is almost the same as what we are doing for the span filling. It would not be automatic enough, but at least it might be more easily to achieve.
Modify mutate_ of Mutator, Rewriter would invoke a big number of changes.

Almost all passes inherit from the Mutator or Rewriter, we would need to check them carefully.
Difficulty of make an analyzing pass

So far I have not figured out a workable method. It becomes hard to do analysis for those multiple source/results pass.
Should be aware of the performance impact:

Once we have a sequence of maps, and original Relay functions. We need to do a map traversing for each of expr in the end. The time complexity would be O(N*M), N is the number of expr and M is the number of maps.

That’s all we can come up currently. For the long term solution, we think a tracing map would be a necessary mechanism. Yet it should be planned carefully in case we encounter too much trouble. Currently the pass span filling can provide a roughly mapping after transformation. Perhaps we can still consider using this feature for now, and try to complete the tracing map for a better result.

Thank you again for reading this. We will stay tune with you!

areusch · October 3, 2022, 3:51pm

Here’s some replies to the first part of your post. I’ll get back to the rest of it in a few days here.

Go for it!

Yeah that sounds great to me. Apologies for derailing this around the Var tracing proposal.

chunit · October 5, 2022, 12:32am

No problem

We will start from the frontend span filling. Based on comments, span for parameters will be added. Once finish, we will submit the PR of each frontend one by one. Thank you!

zhaoyang-star · October 20, 2022, 4:06am

Great job! I am interested in the feature. How long will the feature be available for us?