[MetaSchedule] [TensorCore]Please help check out this error

junrushao · April 10, 2023, 9:24pm

I have personally switched to Relax in the unity branch, and my bandwidth is very limited to supporting Relay at this moment, so i cannot guarantee if Relay works or not.

in your specific case, it seems that the stage pipeline assumption is broken. would you like to attach the log file to this workload so that we could investigate without using Relay?

MasterJianxing · April 11, 2023, 5:04am

Here are the tuning logs and print logs, I just upload them.

Thanks for your patience! Please check

MasterJianxing · April 13, 2023, 2:23am

@junrushao Log files are in tune_tmp and log.log, Could you mind having a look?

Many thanks

JackWw · July 26, 2023, 6:32am

Do you solve this problem? i meet the similar problem when tuning fused_matmul_add

Krishna · December 4, 2023, 9:11am

Hello @MasterJianxing ,

I ran the resnet_meta.py code you had shared on my system and I didnt seem to get the same error as you, but I received this error:

 InternalError: Check failed: original_producers.size() == 1u (0 vs. 1) :

The full diagnostic is given below:

Can you please take a look and help as to why this InteralError is occurring? Please help.

Thanks and reagrds,

Krishna

Krishna · March 1, 2024, 6:38am

Hi @MasterJianxing Were you able to fix this? I read about a somehwat similar issue here, please take a look. Thanks.

@zxybazh @junrushao @AndrewZhaoLuo @comaniac Please help.

MasterJianxing · March 1, 2024, 7:31am

It’s too long time… But now i can use metaschedule to tune BERT with latest tvm version. You can update and have a try

Krishna · March 1, 2024, 9:44am

Okay, thank you for the response. Did you tune BERT in fp-16 by targeting cuda-tensorcores?

Krishna · March 4, 2024, 6:46am

Hi, Sorry to bother you, can you please share the metascheduling part of your code? I just wanted to know how you got Metacsheduling to work with BERT. It would greatly help me in my current work. Thanks in advance.

MasterJianxing · March 8, 2024, 11:57am

github.com

MasterJianxing/metaschedule/blob/main/transformer.py

import numpy as np
import pytest
import tvm
import onnx
from tvm.contrib import graph_executor
from tvm import meta_schedule as ms
from tvm import relay, auto_scheduler
from tvm.meta_schedule.testing import relay_workload
from tvm.meta_schedule.testing.tlcbench import load_quantized_bert_base
from tvm.tir.tensor_intrin.cuda import *
from tvm.tir.tensor_intrin.arm_cpu import DP4A_INTRIN
from tvm.tir.tensor_intrin.rocm import AMDGPU_SDOT4_INTRIN
from tvm.tir.tensor_intrin.x86 import VNNI_DOT_16x4_INTRIN as VNNI_INTRIN

@tvm.testing.requires_gpu
@pytest.mark.skip("Slow on CI")
@pytest.mark.parametrize(
    ["model_name", "input_shape"],
    [("bert_base", (8, 128)), ("resnet_18", (16, 3, 224, 224)), ("resnet_50", (16, 3, 224, 224))],
)

This file has been truncated. show original

Krishna · March 18, 2024, 11:34am

Hi @MasterJianxing ,

Thank you so much for sharing the code. I ran the transformer.py file for the BERT model and I am currently facing this RuntimeError in thread bindings. Error :

RuntimeError: parallel_for_dynamic error with [16:48:39] /home/name/tvm/src/tir/transforms/unify_thread_binding.cc:112: Check failed: (ana.CanProveEqual(dom->extent, new_iter_var->dom->extent)) is false: ValueError: All loops that are bound to ``threadIdx.y`` should have the same extent. However, there are two loops with extent T.int64(6) and T.int64(2), which are not equal

What am I doing wrong here? I have not made any modifications to your code, kindly let me know what I am missing here and how I can rectify it.

TIA

Regards, Krishna

SharynHu · May 14, 2024, 8:36am

hello, I got into the same problem, have you figured it out?

SharynHu · May 14, 2024, 9:05am

hello, I got into the same problem, have you figured it out?

Krishna · May 15, 2024, 10:27am

Hi, I havent figured it out yet, Its been quite some time since I picked it up. I am still waiting for a response from the author on this

SharynHu · May 16, 2024, 2:46am

Thank you for your response. Have you raised any issue on this?

Krishna · May 16, 2024, 4:33am

Hi, I have not raised an issue. My use case here is with resnet and not BERT in itself, actually.

The Broken Stage Pipeline error goes away when we change the Batch size to 16 or 32. Currently unsure why the batch size change fixes it, but it does anyway.

I tried metascheduling the resnet workloads as well as the resnet50 ONNX model on my GPU. The usage of the MixedPrecisionPass() along with Batch size = 16 results in the following error : "Block no longer exists in IRModule

I troubleshot this, and found this Bug issue thread from @zxybazh : [Bug] Tensorization Failure During Multilevel Tiling with Tensor Intrin · Issue #16614 · apache/tvm · GitHub

TL;DR → The issue says Multilevel Tiling is not supported by metascheduler. This still needs to be resolved, apparently. Will post here if this gains any traction.

Regards,

Krishna

SharynHu · May 16, 2024, 5:43am

Thank you again. Actually i tried tensorizing resnet50 and it’s fine with a batchsize of 16. However i come accross the runtime error

Check failed: (ana.CanProveEqual(dom->extent, new_iter_var->dom->extent)) is false: ValueError: All loops that are bound to ``threadIdx.y`` should have the same extent. However, there are two loops with extent T.int64(6) and T.int64(2), which are not equal

which you mentioned before. Meta-schedule have restrictions on the input size for convolutions when it comes to computing with tensorcores. So maybe it is of this reason.

Krishna · May 20, 2024, 5:30am

Hi, Thank you for your response.

I remember from this thread, that Convolutions require an input with Batch size and color channels as 16 and 16 respectively. However, I am currently unsure how to perform said Padding here. If you have some inputs here, it would be greatly appreciated.

Also, May i please know how you tensorized resnet50 with Batch Size as 16? If you can share a code snippet, it would be helpful. Thanks again.

Regards,

Krishna

SharynHu · May 21, 2024, 3:00am

No,it does not need Batch size and color channels as 16 and 16 respectively. You can find the requirement here [RFC][Tensor Core] Optimization of CNNs on Tensor Core. it says that

4, TensorCore_common: Tensor Core instructions for conv2d and dense, including loading, and storing data between shared memory and register. Supporting wmma (Tensor Core instructions) for three input shapes, 8x16x32, 16x16x16, and 32x16x8.

These 3 so-called input shapes are calculated from the [B,C,H,W]. Please refer to this link Convolutional Layers User's Guide - NVIDIA Docs

My code is the same as the official one. https://github.com/apache/tvm/blob/main/tests/python/integration/test_auto_tensorize.py You can see that a batch size of 16 is okay to be tuned, don’t need a channel padding.

Krishna · May 23, 2024, 7:05am

Thank you so much for this, will check!