There seems to be no need for the GEMM module to have an accumulator matrix with 32-bit integer values.
When I looked into the TensorGemm.scala
code, I observed that when the values are being stored from the accumulator matrix to the output matrix, the datatype of the values are truncated from int32 to int8. This seems redundant because the same output can be achieved even when the accumulator values were of type int8.
There seems to be some inadequacies even in the matrix multiplication tutorial. In the tutorial, it is said that the accumulator width is set to 32 in order to avoid overflow during accumulation. However in the same tutorial, due to memory store restrictions of the VTA architecture, the output matrix can only be stored into DRAM when it’s datatype format is the same as the input (int8). Hence, the values in the accumulator matrix are truncated from int32 to int8, which eventually leads to the aforementioned overflow issue thus making the claim of having the accumulator size to be 32-bits to be redundant.
This redundancy can be overcome by setting the bit width of the accumulator matrices to int8. This would give the same results as the current GEMM module. This would help in reducing the overall space needed to store the accumulator by 4x times, as well as improves performance of both multiplication and subsequent pipelined addition.
In a more general sense the datatype of all matrices can be set to a single value (equal to the output datatype) for the sake of simplicity and functionality.