Question about the dlight

I have seen dlight in mlc-llm, but default dl.gpu.Matmul() does not seem to use nv’s tensor core, which makes matrix matmul quite slow. Any good suggestions?