What about using vta on DSP?

Hello all, I’m thinking about using vta on DSP, and write the GEMM and ALU kernel manually.
That’s because vta have the mechanism to utilize DMA and dispatch on host. But there is a question:
How much operater can be mapped to the basic GEMM and ALU? According to the vta paper, the first conv, max pooling and fc layer is run on CPU. Is this because that GEMM and ALU can not compose them? And I wonder if softmax is run on CPU. Will this limit also happen on DSP?