@egy,
Now we have: Body, Zero, Update
-
Body (mandatory): does computation with initial init to zero of the accumulators.
-
Zero (can be None): only init to zero the accumulators, no computation.
-
Update (can be None): does computation without any init (only accumulate).
Cases:
- In case Zero=None then a Body() followed Update() will be issued.
- In case Update=None then only Body() is used everywhere.
See also: Update rule for tensorize
Question:
May I implement a separate Store() (as optional 4-th) in a PR (+ reflecting testcases) ?
Imagine a HW that would need a separate Store step as final nail-in (from hidden accumulators) to a final memory destination.
I think it would be useful for many HW, at this moment I need it for tensorization in MARLANN.
I am afraid I don’t understand what the Store() uses for?
Here is the brief workload for CUDA / Tensor Core GEMM or Conv2d:
- Load data from global memory to shared memory
- Load shared memory to local memory(register)
- Do computation and cache the result in register
- Write back from register to global memory (or shared)
In this case, we need to use a special instruction to do step 4 if we use Tensor Core. We can just tensorize the copy step to do it. (Please see
the tutorial for the detail https://tvm.apache.org/docs/tutorials/optimize/opt_conv_tensorcore.html)
I’m not sure the use case in MARLANN, and I can not imagine what the Store
step looks like. It would be great if you can provide more information or an example. Thank you!
2 Likes
@Hzfengsy,
- Attaching cache_write() to schedule does the trick.
Thank you for pointing me to the right place !
@Hzfengsy
If we use tensorized store copy instructions in step 4, dose that means we can’t fuse gemm with other injective ops?
By the way, I don’t want to use shared memory as intermediate storage to store result, that will limit the shared memory amount that can be used (-> limit compute intensity).