[MetaSchedule] [TensorCore]Please help check out this error

I try to use tensorcore to tune a network. To use tensorcore, I set datatype as “float16”, and I find this error. Besides when I set datatype as “float32”, it runs normally.

Traceback (most recent call last): File “resnet_meta.py”, line 58, in database = ms.tune.tune_tasks( File “/home/pan/tvm/python/tvm/meta_schedule/tune.py”, line 117, in tune_tasks task_scheduler.tune( File “/home/pan/tvm/python/tvm/meta_schedule/task_scheduler/task_scheduler.py”, line 132, in tune _ffi_api.TaskSchedulerTune( # type: ignore # pylint: disable=no-member File “/home/pan/tvm/python/tvm/_ffi/_ctypes/packed_func.py”, line 237, in call raise get_last_ffi_error() tvm.tir.schedule.schedule.ScheduleError: Traceback (most recent call last):

ScheduleError: An error occurred in the schedule primitive ‘compute-at’ … Error message: The scope tir.Block#0 is not a stage pipeline.

I imitate the testing file to write my resnet metaschedule tune file, here is the code:

mod, params = testing.resnet.get_workload( num_layers=50, batch_size=batch_size, image_shape=image_shape, dtype=“float16” )

tune_tasks = ms.relay_integration.extract_tasks(mod, tgt, params)

tasks, task_weights = ms.relay_integration.extracted_tasks_to_tune_contexts( extracted_tasks=tune_tasks, work_dir=work_dir, space=ms.space_generator.PostOrderApply( sch_rules=“cuda-tensorcore”, postprocs=“cuda-tensorcore”, mutator_probs=“cuda-tensorcore”))

database = ms.tune.tune_tasks( tasks=tasks, task_weights=task_weights, work_dir=work_dir, max_trials_per_task=4, max_trials_global=150, )

Please help me check out why this error happens

Many thanks.

Could you please share the TIR you are tuning?

Here it is. Please check @junrushao

Besides, here is the original file

I have personally switched to Relax in the unity branch, and my bandwidth is very limited to supporting Relay at this moment, so i cannot guarantee if Relay works or not.

in your specific case, it seems that the stage pipeline assumption is broken. would you like to attach the log file to this workload so that we could investigate without using Relay?

Here are the tuning logs and print logs, I just upload them.

Thanks for your patience! Please check

@junrushao Log files are in tune_tmp and log.log, Could you mind having a look?

Many thanks

Do you solve this problem? i meet the similar problem when tuning fused_matmul_add

1 Like

Hello @MasterJianxing ,

I ran the resnet_meta.py code you had shared on my system and I didnt seem to get the same error as you, but I received this error:

 InternalError: Check failed: original_producers.size() == 1u (0 vs. 1) :

The full diagnostic is given below:

Can you please take a look and help as to why this InteralError is occurring? Please help.

Thanks and reagrds,

Krishna

Hi @MasterJianxing Were you able to fix this? I read about a somehwat similar issue here, please take a look. Thanks.

@zxybazh @junrushao @AndrewZhaoLuo @comaniac Please help.

It’s too long time… But now i can use metaschedule to tune BERT with latest tvm version. You can update and have a try

1 Like

Okay, thank you for the response. Did you tune BERT in fp-16 by targeting cuda-tensorcores?

Hi, Sorry to bother you, can you please share the metascheduling part of your code? I just wanted to know how you got Metacsheduling to work with BERT. It would greatly help me in my current work. Thanks in advance.

1 Like

Hi @MasterJianxing ,

Thank you so much for sharing the code. I ran the transformer.py file for the BERT model and I am currently facing this RuntimeError in thread bindings. Error :

RuntimeError: parallel_for_dynamic error with [16:48:39] /home/name/tvm/src/tir/transforms/unify_thread_binding.cc:112: Check failed: (ana.CanProveEqual(dom->extent, new_iter_var->dom->extent)) is false: ValueError: All loops that are bound to ``threadIdx.y`` should have the same extent. However, there are two loops with extent T.int64(6) and T.int64(2), which are not equal

What am I doing wrong here? I have not made any modifications to your code, kindly let me know what I am missing here and how I can rectify it.

TIA

Regards, Krishna