I have seen dlight in mlc-llm, but default dl.gpu.Matmul() does not seem to use nv’s tensor core, which makes matrix matmul quite slow. Any good suggestions?
I have seen dlight in mlc-llm, but default dl.gpu.Matmul() does not seem to use nv’s tensor core, which makes matrix matmul quite slow. Any good suggestions?