HD Standard Models [Performance Issues]

SrivastavaKshitij · April 23, 2019, 2:10pm

I am running inference on a standard MobilenetV1 on HD images with input size of [1,1200,1920,3] (NHWC) and the performance is as below

Inference time is in msec on 1080ti GPU with CUDA 10.0, cudnn 7.4.2 and TensorRT 5.0

Standard TF (cuda +cudnn)	TF + TensorRT (FP32)	TVM (without autotuning)	Autotuned TVM
31	26	101	75

I let each task for autotuning run for 1000 steps with early stopping = 400 steps.

In my view , Autotune should atleast match standard TF inference time. Do you have any views on this ? One of the reasons could be that TVM is not able to use cuDNN library. How do I ascertain that TVM is using cuDNN backend ?

Also, any other inputs would be appreciated !

Thanks

eqy · April 23, 2019, 7:17pm

TVM should achieve better performance when not using cuDNN for this workload and GPU combination because cuDNN does not provide the fused operators that graph-level optimization required while TVM using the CUDA backend does.

The NHWC data layout usually yields worse performance in this case; you can try converting to NCHW for GPU.

1000 trials / early stopping 400 may be a bit thin for this setup. We would normally do something like 2000 / early stopping 800 for GPU workloads.
If you share your model definition script we can also try tuning as we have several 1080Ti GPUs as well.

merrymercy · April 24, 2019, 9:40pm

IIRC, we cannot tune models with NHWC layout.
What kind of conversion did you use?

SrivastavaKshitij · April 25, 2019, 3:27pm

@eqy: Please look at the link where I have uploaded frozen models for HD MobilenetV1 and MobilenetV2. Its a .pb file and you just need to import them.

Github link:
A couple of things:

Tensorflow natively doesnt support NCHW format for their standard models so there was no way for me to convert standard models from NHWC to NCHW.
I created a dummy network with 100 layers of depthwise convolution in NHWC format and NCHW format. Ops in NHWC were almost twice than NCHW ( as expected because of an added transpose op after every layer). However, there was no significant difference in runtime between the two. 9.50ms vs 9.30ms.

SrivastavaKshitij · April 25, 2019, 3:28pm

@merrymercy: Are you doing something like this ?

net, params = tvm.relay.frontend.from_tensorflow(graph_def, shape={'input':input_shape}, layout="NCHW)

tqchen · August 30, 2019, 11:56pm

To followup on this thread, the TF converter now do not natively convert things back to NCHW, however. the community was working on internal layout conversion pass, https://github.com/dmlc/tvm/issues/3670 which could help to smartly converting the layout without having to insert conversion in every op.

There is also on going effort on better NHWC supports

https://github.com/dmlc/tvm/pull/3141