What's the state of TOPI on server-class CPU?

lyq · April 27, 2018, 4:24am

I want to deploy cross&&deep network by NNVM on server-class CPU. And I’m trying to build the graph in incrementally.tvm_model_zoo

As far as I know, the quality of kernels has a big impact on the entire performance.

So, what’s the state of TOPI on server-class CPU? Compare with openblas and nnpack?

And could anyone give some advise about implement the cross side? Invent custom operator?
cross

yzhliu · April 27, 2018, 6:19pm

For the question about the performance of server-class cpu. We made some optimization for AVX2/AVX-512 recently. On AWS EC2 c5.9xlarge instance, our solution (for resnet-18/34/50/101/152, ssd, etc) is about 1.7x faster than mxnet+mkldnn, 3x faster than mxnet+mklml. Though not tested, I believe it can bring more speedup for openblas/nnpack.

Will contribute to topi soon.

While our optimization mainly focus on convolution and inference, it shows potentials of tvm on CPU.

FrozenGene · April 28, 2018, 12:01am

how about SSE4.2? our deploy target’s CPU only supports SSE 4.2 (Intel ATOM CPU), no avx. When can we see the new implementation? Thanks. Because we want to implement based on one version of NNVM / TVM.

yzhliu · April 28, 2018, 12:23am

Not sure about the Intel ATOM CPU, I guess the basic optimization idea should be similar, though the SIMD width, cache size may differ, indicating different sizes of split block.

we have to incrementally send pull requests. we have made necessary changes to nnvm and tvm,

github.com/dmlc/nnvm

General Layout Support

master ← yzhliu:layout.0418

opened 05:47PM - 18 Apr 18 UTC

yzhliu

+2363 -440

This PR allows NNVM to do * Replace an operator via `AlterOpLayout`. For x86 it… is more efficient to calculate convolution in `NCHW16c` layout. Here is an example of how it is used: https://github.com/yzhliu/topi-intel/blob/master/e2e_general_pack/schedule_pack/avx512_conv_fwd.py#L128-L155 Note that kernel (pre-)packing is supported as well: https://github.com/dmlc/nnvm/issues/303 * Infer and correct layout automatically - Given the input data layout, or the layout of operators in the network (e.g., convolution, pooling...) `InferCorrectLayout` pass can infer the layout for each operator (both inputs and outputs). - If the required input layout of an operator is different from what it receives, a `__layout_transform__` operator will be inserted. - Each operator registers a function `FInferLayout`. Once a model is imported, we do a `InferCorrectLayout` pass and store the *original layouts* for each operator. After `AlterOpLayout`, the `InferCorrectLayout` runs again. This time each operator will see the *original layouts* it inferred before, which can be used to decide whether to keep the original one. For example, `softmax` can still generate correct result after the input layout changes, while `flatten` cannot. So `flatten` claims it needs the `original layout` and triggers a layout transform, which make the network produce correct result. - With such approach, * optimized layout (e.g., NCHW16c) can flow through the network as far as possible. No pack/unpack overhead for each (e.g., convolution) operator. * now model layout can be transparent to users. Even though a convolution neural network is trained with `NHWC` layout, users can still pass `NCHW` input as long as the input layout is specified, a layout transform happens automatically. Moreover the convolution kernel layout becomes much clearer: https://github.com/dmlc/nnvm/pull/372 . Now we have kernel layout `OIHW` and `HWIO` for `NCHW` and `NHWC` respectively. Will shoot a pull request for corresponding TVM changes. @yidawang @kevinthesun

I hope I can pr our new schedules for resnet this weekend.

lyq · April 28, 2018, 3:09am

How about the graph runtime implement? Homebrew or the default graph_runtime?

yzhliu · May 4, 2018, 6:17pm

cross ref https://github.com/dmlc/tvm/pull/1143