Optimizing matrix multiplication for GPU

@comaniac, hopefully last question, do you know how to save multiple modules into one .so file? I want to save multiple versions of the function as @tqchen suggested, but tvm.runtime.export_library saves only one, and tvm.cc.create_shared doesn’t link the host code with the target code as tvm.runtime.export_library does. I am sure this can be done with gcc or something, but was wondering if tvm already has a solution for this given that it is a common use case.