Thanks!
I’m not familiar with this project BitBlas. Please correct me if I am wrong: in the code you showed, the IRModule pass that retrieves the threadblock dimensions is get_annotated_device_mod
I’m confused by how the cuda source wrapper is initialized; an IR module plus a source string is passed? don’t you typically get the source after building the module?
Also, do you initialize the TileDevice class with remote.cl()
or remote.cuda()
just as tvm examples do?
Here’s a python script that prints the source for a single conv2d (I omitted tuning for brevity). I still don’t know how to get work group sizing though. Do you have any advice on how to use your method in BitBlas here?
import numpy as np
import tvm
from tvm import relay, autotvm
import tvm.relay.testing
target_str = "opencl"
target = tvm.target.Target(target_str, host="llvm -mtriple=aarch64-linux-android")
dtype = "float16"
input_name = "input"
filter_name = "weight"
input_shape=(1, 25, 25, 64)
filter_shape=(3, 3, 64, 96)
filter = np.random.rand(*filter_shape).astype(dtype)
input = tvm.relay.var("input", shape=input_shape, dtype=dtype)
weight = tvm.relay.var("weight", shape=filter_shape, dtype=dtype)
D = relay.nn.conv2d(input, weight, padding=(0, 0), data_layout="NHWC", kernel_layout="HWIO", out_dtype=dtype)
mod = relay.Function([input, weight], D)
params = {
"weight": tvm.nd.array(filter)
}
with tvm.transform.PassContext(opt_level=3):
graph, lib, params = relay.build_module.build(mod, target, params=params)
print(lib.imported_modules[0].get_source())