[MetaSchedule] [TensorCore]Please help check out this error

I have personally switched to Relax in the unity branch, and my bandwidth is very limited to supporting Relay at this moment, so i cannot guarantee if Relay works or not.

in your specific case, it seems that the stage pipeline assumption is broken. would you like to attach the log file to this workload so that we could investigate without using Relay?

Here are the tuning logs and print logs, I just upload them.

Thanks for your patience! Please check

@junrushao Log files are in tune_tmp and log.log, Could you mind having a look?

Many thanks

Do you solve this problem? i meet the similar problem when tuning fused_matmul_add

1 Like

Hello @MasterJianxing ,

I ran the resnet_meta.py code you had shared on my system and I didnt seem to get the same error as you, but I received this error:

 InternalError: Check failed: original_producers.size() == 1u (0 vs. 1) :

The full diagnostic is given below:

Can you please take a look and help as to why this InteralError is occurring? Please help.

Thanks and reagrds,

Krishna

Hi @MasterJianxing Were you able to fix this? I read about a somehwat similar issue here, please take a look. Thanks.

@zxybazh @junrushao @AndrewZhaoLuo @comaniac Please help.

It’s too long time… But now i can use metaschedule to tune BERT with latest tvm version. You can update and have a try

1 Like

Okay, thank you for the response. Did you tune BERT in fp-16 by targeting cuda-tensorcores?

Hi, Sorry to bother you, can you please share the metascheduling part of your code? I just wanted to know how you got Metacsheduling to work with BERT. It would greatly help me in my current work. Thanks in advance.

1 Like

Hi @MasterJianxing ,

Thank you so much for sharing the code. I ran the transformer.py file for the BERT model and I am currently facing this RuntimeError in thread bindings. Error :

RuntimeError: parallel_for_dynamic error with [16:48:39] /home/name/tvm/src/tir/transforms/unify_thread_binding.cc:112: Check failed: (ana.CanProveEqual(dom->extent, new_iter_var->dom->extent)) is false: ValueError: All loops that are bound to ``threadIdx.y`` should have the same extent. However, there are two loops with extent T.int64(6) and T.int64(2), which are not equal

What am I doing wrong here? I have not made any modifications to your code, kindly let me know what I am missing here and how I can rectify it.

TIA

Regards, Krishna

hello, I got into the same problem, have you figured it out?

hello, I got into the same problem, have you figured it out?

Hi, I havent figured it out yet, Its been quite some time since I picked it up. I am still waiting for a response from the author on this

Thank you for your response. Have you raised any issue on this?

Hi, I have not raised an issue. My use case here is with resnet and not BERT in itself, actually.

The Broken Stage Pipeline error goes away when we change the Batch size to 16 or 32. Currently unsure why the batch size change fixes it, but it does anyway.

I tried metascheduling the resnet workloads as well as the resnet50 ONNX model on my GPU. The usage of the MixedPrecisionPass() along with Batch size = 16 results in the following error : "Block no longer exists in IRModule

I troubleshot this, and found this Bug issue thread from @zxybazh : [Bug] Tensorization Failure During Multilevel Tiling with Tensor Intrin · Issue #16614 · apache/tvm · GitHub

TL;DR → The issue says Multilevel Tiling is not supported by metascheduler. This still needs to be resolved, apparently. Will post here if this gains any traction.

Regards,

Krishna

Thank you again. Actually i tried tensorizing resnet50 and it’s fine with a batchsize of 16. However i come accross the runtime error

Check failed: (ana.CanProveEqual(dom->extent, new_iter_var->dom->extent)) is false: ValueError: All loops that are bound to ``threadIdx.y`` should have the same extent. However, there are two loops with extent T.int64(6) and T.int64(2), which are not equal

which you mentioned before. Meta-schedule have restrictions on the input size for convolutions when it comes to computing with tensorcores. So maybe it is of this reason.

Hi, Thank you for your response.

I remember from this thread, that Convolutions require an input with Batch size and color channels as 16 and 16 respectively. However, I am currently unsure how to perform said Padding here. If you have some inputs here, it would be greatly appreciated.

Also, May i please know how you tensorized resnet50 with Batch Size as 16? If you can share a code snippet, it would be helpful. Thanks again.

Regards,

Krishna

No,it does not need Batch size and color channels as 16 and 16 respectively. You can find the requirement here [RFC][Tensor Core] Optimization of CNNs on Tensor Core. it says that

4, TensorCore_common: Tensor Core instructions for conv2d and dense, including loading, and storing data between shared memory and register. Supporting wmma (Tensor Core instructions) for three input shapes, 8x16x32, 16x16x16, and 32x16x8.

These 3 so-called input shapes are calculated from the [B,C,H,W]. Please refer to this link Convolutional Layers User's Guide - NVIDIA Docs

My code is the same as the official one. https://github.com/apache/tvm/blob/main/tests/python/integration/test_auto_tensorize.py You can see that a batch size of 16 is okay to be tuned, don’t need a channel padding.

1 Like

Thank you so much for this, will check!