Hi,
I am reading “How to optimize GEMM on CPU — tvm 0.10.dev0 documentation” and was wondering, if the Array Packing optimization can be expressed more easily and efficiently using cache_read
:
-
more easily, because we don’t need an extra
te.compute
expression for the packed B. Instead we only add the following lines to the schedule:B_packed = s.cache_read(B, "global") s[B_packed].compute_at(s[C], no)
-
more efficiently, because matrix
B
is not packed as a whole in memory, but only for the subpart that is accessed in theno
loop, thus reducing memory consumption.
Many thanks in advance!