HD Standard Models [Performance Issues]

Hi @eqy @tqchen @srkreddy1238

I am running inference on a standard MobilenetV1 on HD images with input size of [1,1200,1920,3] (NHWC) and the performance is as below

Inference time is in msec on 1080ti GPU with CUDA 10.0, cudnn 7.4.2 and TensorRT 5.0

Standard TF (cuda +cudnn) TF + TensorRT (FP32) TVM (without autotuning) Autotuned TVM
31 26 101 75

I let each task for autotuning run for 1000 steps with early stopping = 400 steps.

In my view , Autotune should atleast match standard TF inference time. Do you have any views on this ? One of the reasons could be that TVM is not able to use cuDNN library. How do I ascertain that TVM is using cuDNN backend ?

Also, any other inputs would be appreciated !

Thanks

TVM should achieve better performance when not using cuDNN for this workload and GPU combination because cuDNN does not provide the fused operators that graph-level optimization required while TVM using the CUDA backend does.

The NHWC data layout usually yields worse performance in this case; you can try converting to NCHW for GPU.

1000 trials / early stopping 400 may be a bit thin for this setup. We would normally do something like 2000 / early stopping 800 for GPU workloads.
If you share your model definition script we can also try tuning as we have several 1080Ti GPUs as well.

IIRC, we cannot tune models with NHWC layout.
What kind of conversion did you use?

1 Like

@eqy: Please look at the link where I have uploaded frozen models for HD MobilenetV1 and MobilenetV2. Its a .pb file and you just need to import them.

Github link:
A couple of things:

  1. Tensorflow natively doesnt support NCHW format for their standard models so there was no way for me to convert standard models from NHWC to NCHW.
  2. I created a dummy network with 100 layers of depthwise convolution in NHWC format and NCHW format. Ops in NHWC were almost twice than NCHW ( as expected because of an added transpose op after every layer). However, there was no significant difference in runtime between the two. 9.50ms vs 9.30ms.

@merrymercy: Are you doing something like this ?

net, params = tvm.relay.frontend.from_tensorflow(graph_def, shape={'input':input_shape}, layout="NCHW)

To followup on this thread, the TF converter now do not natively convert things back to NCHW, however. the community was working on internal layout conversion pass, https://github.com/dmlc/tvm/issues/3670 which could help to smartly converting the layout without having to insert conversion in every op.

There is also on going effort on better NHWC supports