-
I tuned a CNN model trained by tensorflow, and after I test it’s performence, I found the whole TVM infer time is bigger than the tensofrflow far away.
-
My questions are:
-
- I heard that TVM use asynchronous before get_output(). But I don’t understand why the sync() tooks a lot of time, it delayed my whole inference process!
-
Is there any methods I can do to reduce the sync() time, or move the sync() time before mod.set_input()?
-
- I test cudaMemcpy(DeviceToHost) when data size is 125x20x6600, the time is about 4ms. So, why the cudaMemcpy(DeviceToHost) time in TVM is about 20ms which is very slowly?
-
- Is there a method to define a placehoder for set_input like tensorflow?
-
4.How to fix this non-exists configuration case?
$:
Extract tasks...
Compile...
Cannot find config for target=cuda -keys=cuda,gpu -max_num_threads=1024 -model=unknown -
thread_warp_size=32, workload=('dense_small_batch.cuda', ('TENSOR', (2500, 512), 'float32'),
('TENSOR', (6600, 512), 'float32'), None, 'float32'). A fallback configuration is used, which may bring great performance regression.