No worry.
Thank you very much for helping us!
If you donât mind I would like to submit more materials for you, and ask some qusetions about the Var thing you just mentioned.
Right now, this needs to be done per-pass
Yes, we did attach span to per-pass based on the âsequentialSpanâ and âSIBuilderâ. It is a time consuming task. Currently we have done the following passes. All these passes are invoked during the build flow. We would try to complete the rest of passes.
| RelayPass | TIRPass | Not yet done TIRPass |
|---|---|---|
| AlterOpLayout | LowerInitBlock | BF16Legalize |
| AutoSchedulerLayoutRewrite | LowerIntrin | CombineContextCall |
| CanonicalizeCast | MakePackedAPI | CompactBufferAllocation |
| CanonicalizeOps | MakeUnpackedAPI | ConvertBlocksToOpaque |
| CombineParallelBatchMatmul | NarrowDataType | FlattenBuffer |
| CombineParallelConv2D | PlanAndUpdateBufferAllocationLocation | HoistIfThenElse |
| CombineParallelDense | RemoveNoOp | InferFragment |
| DefuseOps | RewriteUnsafeSelect | InjectDoubleBuffer |
| DynamicToStatic | SplitHostDevice | InjectPrefetch |
| EliminateCommonSubexpr | InjectVirtualThread | |
| EtaExpand | InstrumentBoundCheckers | |
| FastMath | LoopPartition | |
| FoldConstant | LowerCustomDatatypes | |
| FoldScaleAxis | LowerDeviceStorageAccessInfo | |
| FuseOps | LowerMatchBuffer | |
| InferType | LowerTVMBuiltin | |
| Inline | LowerThreadAllreduce | |
| SplitArgs | LowerWarpMemory | |
| LabelOps | MergeDynamicSharedMemoryAllocations | |
| Legalize | Simplify | |
| RemoveUnusedFunctions | StorageFlatten | |
| SimplifyExpr | StorageRewrite | |
| SimplifyInference | TextureFlatten | |
| ToBasicBlockNormalForm | ThreadSync | |
| relay::qnn::transform::Legalize | UnifyThreadBinding | |
| UnrollLoop | ||
| VectorizeLoop | ||
| VerifyMemory |
I wonder if we could get away with doing this once at the end of compilation if we also attach references to the frontend layer (or post-import variable) to each Relay Var.
If it could be done at the end of compilation it would be quite convenient! Sorry that I am not really following this. May I have your explanation again please? Like, may I have an example for
- What it looks like about attaching references to the frontend layer?
- What should be attached to Relay Var?
It seems like by annotating Var we might be able to add this information.
About this part I would like to have some more explanation. Except the Var or Params, this problem also happens in those one-to-many conversion. Here I would like to take the Pack OP from TF as example again. Currently we fill the layer name to the converted IR like this:
def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
%0 = shape_of(%input, dtype="int32") /* Shape */;
%1 = strided_slice(%0, âŚ) /* strided_slice */;
%2 = squeeze(%1) /* strided_slice */;
# the Pack Op conversion start from here
%3 = expand_dims(%2, axis=0) /* stack */;
%4 = expand_dims(3, axis=0) /* stack */;
%5 = expand_dims(3, axis=0) /* stack */;
%6 = (%3, %4, %5) /* stack */;
%7 = concatenate(%6) /* stack */;
}
And here is the result from former patch:
def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
%0 = shape_of(%input, dtype="int32") /* Shape /;
%1 = strided_slice(%0, begin=[0], end=[1], strides=[1], axes=None) / strided_slice_PART_0 /;
%2 = squeeze(%1) / strided_slice /;
%3 = expand_dims(%2, axis=0) / stack_PART_0 /;
%4 = expand_dims(3, axis=0) / stack_PART_1 /;
%5 = expand_dims(3, axis=0) / stack_PART_2 /;
%6 = (%3, %4, %5) / stack_PART_3 /;
%7 = concatenate(%6) / stack /;
}
In the former patch we can indicate computation output of Pack Op immediately because we do not add suffix for it. Now we remove it because we notice that â_part_â suffix is really annoying and misleading after the pass transformations.
The drawback of current version is we cannot tell which one is the computation output because they all look the same. Perhaps we can do something like the following example. But we are still seeking for a better solution.
def @main (%input: Tensor[(?, ?, 3, 1), float32]) {
%0 = shape_of(%input, dtype="int32") /* Shape */;
%1 = strided_slice(%0, âŚ) /* strided_slice */;
%2 = squeeze(%1) /* strided_slice */;
# the Pack Op conversion start from here
%3 = expand_dims(%2, axis=0) /* stack */;
%4 = expand_dims(3, axis=0) /* stack */;
%5 = expand_dims(3, axis=0) /* stack */;
%6 = (%3, %4, %5) /* stack */;
%7 = concatenate(%6) /* stack_OUTPUT */;
}
itâs harder to fill span information back up through the compiler since the layer variables have changed. Iâm curious if you guys tried to apply this to any TIR-based fusion?
We are still working on the TIR pass as shown in the list above. Besides, we havenât done the propagation between Relay â TE or TIR.
Because thatâs also a tough part we encounter.
Things are not too complicated in the Relay environment, but it becomes harder when we go down to lower IR like TE and TIR. Currently we still rely on the layer name. Yet we are thinking perhaps using the row & column number could be more robust and more indicative.
If we have a precise definition of the line number information of an IRModule, we could at least have a better mapping relationship before and after âa passâ.
Lastly, any idea how much additional memory this takes or performance impact?
Yes, take the mobilenet_v1_2018_08_02 for example, here is the profiling result:
RunTime performance
| function | Without span filling | With span filling | with span filling & schedule_record |
|---|---|---|---|
| relay.frontend.from_tflite() | 133174.0 us | 176468.0 us(â32.51%) | 177774.0 us(â33.49%) |
| relay.build() | 7480367.0 us | 7558526.0 us(â1.045%) | 7580165.0 us(â1.334%) |
Memory usage
| function | Without span filling | With span filling | with span filling & schedule_record |
|---|---|---|---|
| relay.frontend.from_tflite() | 26.105 MiB | 26.203 MiB(â0.375%) | 26.211 MiB(â0.406%) |
| relay.build() | 147.762 MiB | 148.148 MiB(â0.261%) | 148.418 MiB(â0.443%) |
We also provide optionst to disable span filling and shcedule recording if users donât need them.