[Discussion] About the weight layout of Dense/BatchMatmul

Currently TVM uses the transposed weight(in [N, K] * [M, K] format) for both Dense & BatchMatmul.

While in TensorFlow, the default weight layout is not transposed([K, M]), so we will get some extra transpose op in the relay model imported from TensorFlow. In most cases, these transpose can be easily merged by const folding, but in some model like Bert, the seond input (usually seen as weight) of BatchMatmul is not a const parameter, which means it cannot been simplified directly. We’ve seen some performance problem caused of this.

Another confusion is that in my understanding, K is the reduce axis and have both K in the inner dimension is not friendly for vectorize.

I guess the design of using transposed weight cames from the MXNet? Since there dose have some inconvenience of using such implementation, will the community agree to extend these ops to support non-transposed weight directly? (Exp. add extra attributes saying if input transposed or weight transposed)

cc @tqchen @junrushao @comaniac @FrozenGene

I would support the approach of adding a new matmul op with an attribute indicating whether each of its inputs are transposed, since we already have batch_matmul. After that, dense could be an alias of matmul(not transpose, transpose).

1 Like

I think @altanh and @tkonolige has similar findings too recently.

Yeah this limitation of relay nn.dense has bitten us lately in terms of utilizing contrib BLAS libraries. Notably we observed a large perf hit in BERT training workloads due to a combination of (1) bad BLAS library perf on some NT kernels (i.e. transa=False, transb=True) and (2) the need of 3 explicit transpose op for computing the gradient of nn.dense. I fully support introducing either a new gemm op which supports transpose as attributes or updating the existing operator (internally we have a nn.gemm workaround op).

2 Likes

Thanks! @altanh Would you like to send PRs to extend them in the TVM main branch?

Or I would going to work on this according to @comaniac 's suggestion. :smiley:

The upstreaming might take a while (and currently I don’t have TVM codegen support/schedules for the transposed matmul ops, just calling into contrib BLAS libraries). Maybe you can move ahead with adding matmul? Thanks!

1 Like

Should we extend qnn.dense to qnn.matmul to handle the transpose in quantized models? We are facing similar issues in quantized bert models.

cc @comaniac @jcf94