Hi,
I am reading “How to optimize GEMM on CPU — tvm 0.10.dev0 documentation” and was wondering, if the Array Packing optimization can be expressed more easily and efficiently using cache_read:
-
more easily, because we don’t need an extra
te.computeexpression for the packed B. Instead we only add the following lines to the schedule:B_packed = s.cache_read(B, "global") s[B_packed].compute_at(s[C], no) -
more efficiently, because matrix
Bis not packed as a whole in memory, but only for the subpart that is accessed in thenoloop, thus reducing memory consumption.
Many thanks in advance!