Best way to deal with kernel layout?

If you really want to add an op, I’d just call it matmul. An even better version is having matmul with all 4 possible transposes, and dense is just one of them, but this needs many changes in the code base.

cc @tqchen