Automatically distributing mobilenet-v2-7

Dear everyone,

I’m trying to make a “mobilenet-v2-7.onnx” can be distributed on multiple CPU cores through TVM. I see an example for sharding in the section “Sharding CONV Op” at here, but the sharding is manual in this example.

Currently, I am trying to use “R.dist.annotate_sharding()” for automatically sharding the tensor with a simple convolution layer as test case, but got some errors. Could I ask if it’s possible to use TVM to “automatically shard” whole “mobilenet-v2-7.onnx”? The error message is

InternalError: Check failed: (op_map_dist_infer_struct_info_.count(op)) is false:  Cannot find the dist.FInferStructInfo attribute registered to op: relax.nn.conv2d

The example of a model with a simple convolution is

@I.ir_module
class ConvolutionModule_1:
    I.module_attrs({"device_num": 2})
    I.module_global_infos(
        {
            "mesh": [
                R.device_mesh((2,), I.Range(0, 2)),  # mesh[0]
            ]
        }
    )
    
    @R.function
    def main(data: R.Tensor((1, 3, 224, 224), dtype="float32")) -> R.Tensor((1, 1000), dtype="float32"):
        R.func_attr({"num_input": 1})
        with R.dataflow():
            data = R.dist.annotate_sharding(data, device_mesh="mesh[0]", placement="S[1]")
            lv: R.Tensor((1, 32, 112, 112), dtype="float32") = R.nn.conv2d(data, metadata["relax.expr.Constant"][0], strides=[2, 2], padding=[1, 1, 1, 1], dilation=[1, 1], groups=1, data_layout="NCHW", kernel_layout="OIHW", out_layout="NCHW", out_dtype="void")
            R.output(lv)
        return lv

Thanks for help!

I don’t think CPU multi cores should use distributing inference, as cores can access the whole memory. CPU parallelism is enough

Thanks for helping! I found the socket is used in NCCL with CUDA and Nvidia GPU. I wonder if it’s possible to automatically distribute the convolution to multiple SoCs by using TCP/IP network, and CPUs are only to be used to execute program?