Cannot find config for target when workload is "dense_small_batch.cuda" or "dense_large_batch.cuda"!

wxyhv · April 8, 2021, 2:33pm

I tuned a CNN model trained by tensorflow, and after I test it’s performence, I found the whole TVM infer time is bigger than the tensofrflow far away.
My questions are:
1. I heard that TVM use asynchronous before get_output(). But I don’t understand why the sync() tooks a lot of time, it delayed my whole inference process!
Is there any methods I can do to reduce the sync() time, or move the sync() time before mod.set_input()?
1. I test cudaMemcpy(DeviceToHost) when data size is 125x20x6600, the time is about 4ms. So, why the cudaMemcpy(DeviceToHost) time in TVM is about 20ms which is very slowly?
1. Is there a method to define a placehoder for set_input like tensorflow?
4.How to fix this non-exists configuration case?

$:

Extract tasks...
Compile...
Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown - 
thread_warp_size=32, workload=('dense_small_batch.cuda', ('TENSOR', (2500, 512), 'float32'), 
('TENSOR', (6600, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.