lyq
April 27, 2018, 4:24am
1
I want to deploy cross&&deep network by NNVM on server-class CPU. And I’m trying to build the graph in incrementally.tvm_model_zoo
As far as I know, the quality of kernels has a big impact on the entire performance.
So, what’s the state of TOPI on server-class CPU? Compare with openblas and nnpack?
And could anyone give some advise about implement the cross side? Invent custom operator?
cross
yzhliu
April 27, 2018, 6:19pm
2
For the question about the performance of server-class cpu. We made some optimization for AVX2/AVX-512 recently. On AWS EC2 c5.9xlarge instance, our solution (for resnet-18/34/50/101/152, ssd, etc) is about 1.7x faster than mxnet+mkldnn, 3x faster than mxnet+mklml. Though not tested, I believe it can bring more speedup for openblas/nnpack.
Will contribute to topi soon.
While our optimization mainly focus on convolution and inference, it shows potentials of tvm on CPU.
1 Like
how about SSE4.2? our deploy target’s CPU only supports SSE 4.2 (Intel ATOM CPU), no avx. When can we see the new implementation? Thanks. Because we want to implement based on one version of NNVM / TVM.
yzhliu
April 28, 2018, 12:23am
5
Not sure about the Intel ATOM CPU, I guess the basic optimization idea should be similar, though the SIMD width, cache size may differ, indicating different sizes of split block.
we have to incrementally send pull requests. we have made necessary changes to nnvm and tvm,
master
← yzhliu:layout.0421
opened 03:16AM - 23 Apr 18 UTC
as title.
related: https://github.com/dmlc/nnvm/pull/447
master
← yzhliu:pool2d-general
opened 01:37AM - 13 Apr 18 UTC
enables pool2d support arbitrary layout, as long as there's `H` and `W`.
the … change here also assumes a layout convention (I will make a PR to NNVM), upper case `C` indicates an axis (e.g., channel) and lower case with factor size `16c` indicates the split axis of `C`.
cc @kevinthesun @yidawang
master
← yzhliu:layout.0418
opened 05:47PM - 18 Apr 18 UTC
This PR allows NNVM to do
* Replace an operator via `AlterOpLayout`. For x86 it… is more efficient to calculate convolution in `NCHW16c` layout. Here is an example of how it is used: https://github.com/yzhliu/topi-intel/blob/master/e2e_general_pack/schedule_pack/avx512_conv_fwd.py#L128-L155
Note that kernel (pre-)packing is supported as well: https://github.com/dmlc/nnvm/issues/303
* Infer and correct layout automatically
- Given the input data layout, or the layout of operators in the network (e.g., convolution, pooling...) `InferCorrectLayout` pass can infer the layout for each operator (both inputs and outputs).
- If the required input layout of an operator is different from what it receives, a `__layout_transform__` operator will be inserted.
- Each operator registers a function `FInferLayout`. Once a model is imported, we do a `InferCorrectLayout` pass and store the *original layouts* for each operator. After `AlterOpLayout`, the `InferCorrectLayout` runs again. This time each operator will see the *original layouts* it inferred before, which can be used to decide whether to keep the original one. For example, `softmax` can still generate correct result after the input layout changes, while `flatten` cannot. So `flatten` claims it needs the `original layout` and triggers a layout transform, which make the network produce correct result.
- With such approach,
* optimized layout (e.g., NCHW16c) can flow through the network as far as possible. No pack/unpack overhead for each (e.g., convolution) operator.
* now model layout can be transparent to users. Even though a convolution neural network is trained with `NHWC` layout, users can still pass `NCHW` input as long as the input layout is specified, a layout transform happens automatically.
Moreover the convolution kernel layout becomes much clearer: https://github.com/dmlc/nnvm/pull/372 . Now we have kernel layout `OIHW` and `HWIO` for `NCHW` and `NHWC` respectively.
Will shoot a pull request for corresponding TVM changes.
@yidawang @kevinthesun
I hope I can pr our new schedules for resnet this weekend.
lyq
April 28, 2018, 3:09am
6
How about the graph runtime implement? Homebrew or the default graph_runtime ?