Implementing Array Packing via `cache_read`

Hi,

I am reading “How to optimize GEMM on CPU — tvm 0.10.dev0 documentation” and was wondering, if the Array Packing optimization can be expressed more easily and efficiently using cache_read:

  • more easily, because we don’t need an extra te.compute expression for the packed B. Instead we only add the following lines to the schedule:

    B_packed = s.cache_read(B, "global")
    s[B_packed].compute_at(s[C], no)
    
  • more efficiently, because matrix B is not packed as a whole in memory, but only for the subpart that is accessed in the no loop, thus reducing memory consumption.

Many thanks in advance!

I think te.compute is still needed if array packing changes the layout, we need to use te.compute to tell how it is packed. Even if we use te.compute we can still use compute_at to prevent packing the whole array