VTA Performance: FPS and Power

Hi Guys

Does VTA have a performance benchmark about its FPS and Power consumption with Image net ?

Thanks

It depends on the FPGA, but on the Pynq (which is a 2012 SoC), the performance is 0.4s at ~2W on ResNet-18. You can reproduce this example if you have a Pynq board.

We are working on a list of hardware and software changes that will lower this inference time drastically, and also are supporting a much more recent FPGA which should deliver much improved performance. Stay tuned!

Our target is 200GOp at 8-bit inference for the Pynq, and 1TOp at 8-bit inference on the Ultra-96. We’ll make announcements when these performance targets are met. In addition, this project is community driven and open source; we welcome contributions!

Thanks a lot :slight_smile:

Hi Thierry

Could you please tell me more about the data flow strategy used in VTA? For example, Row stationary, Weight stationary or some other data flow stategy?

Thanks

As far as I know, you can program VTA to implement all of the strategies you want, depending on how and where you want to cache the rows and weights
Tianqi

You can think of data-reuse strategy at two levels: at the scheduling level you can decide what data to reuse (weights, activations, accumulators) in TVM. At the hardware level, however, the implementation of the GEMM is determined by HLS. Right now we have the hardware setup so that at every GEMM invocation, new input, weights, and accumulation tensors are loaded from SRAMs.

Hope this answers your question.

Hi thierry tqchen

Thanks for your replies. This helps a lot.

I would like to compare VTA performance with MIT Eyeriss using Power/FPS/GMACS to check their Power consumption performance.
For Eyeriss, it’s about 94 mW/17 FPS(224x224)/2.66 GMAC = 2.08 based on their paper.
For VTA, its about 2000 mW/2.5 FPS(224x224)/1 GMAC(my estimation) ~= 800.

If the data above is not correct, pls let me know.

Do you think adding data process strategy on the hardware(HLS) level will help VTA improve performance from the perspective of Power consumption?

Thanks.

Very good question Zack - the apples to apples comparison is a little tricky in this context since we are comparing different systems, based on different experiment setups.

VTA here is running on a legacy FPGA (ca. 2012), and we’ve programmed a small design clocked at 100MHz on it to ship with the alpha version of VTA. The other thing that we are measuring is wall-clock time of full inference including network communication over the RPC - this surely adds some overheads that are not accounted for with Eyeriss eval.

So overall on paper Eyeriss ASIC does look significantly better, and that’s certainly expected. With VTA you get a full stack to experiment with, but there are no hidden costs since we let you measure end-to-end inference on a complete system. So the advantage lies in being able to do full system evaluation.

And as I said earlier, there is plenty of room for improvement both in the software and hardware stack, so expect to see numbers go up soon :slight_smile:

Thierry

Hi Theiry,

Talking about the apple-to-apple comparistion with Eyeriss, I am wondering if you have a rough estimation of the VTA performance / power with the following assumptions ?

  1. Assume VTA is implemented with the same foundry technology as Eyeriss
  2. Assume VTA has the same number of ALU / GEMM cores as Eyeriss
  3. Assume VTA has the same buffer size as Eyeriss across memory hierachy
  4. If NoC would be available for VTA, assume VTA has the same NoC capacity as Eyeriss

Please add your assumptions to the list as long as it make sense to do a fair comparison.

Thanks very much :slight_smile:

I can’t give you a quotable assessment since we haven’t done the close comparison. It would require pushing the RTL through synopsys toolchains to get a fairly close estimate of power consumption. Is the Eyeriss RTL open source to get accurate power estimates?

Hi Thierry,

I don’t think Eyeriss is open source.

There’s interesting paper https://arxiv.org/abs/1805.02566 , which use open source cost model MAESTRO to analyze the performance / power / area of various accelrator, including Eyeriss and MAERI (the same authors of MAESTRO).

It’s claimed with MAESTRO, no RTL source code is required for the accelerator, only data flow description is needed as the input of MAESTRO.

Would it be possible to do the same assessment for VTA with MAESTRO, or VTA will have it’s own tool to do the design space exploration ?

Thanks :slight_smile:

Thanks @kevinyuan for the reference. MAESTRO is a very cool tool, and can help us assess different dataflow implementations of GEMM/GEVM that could be instantiated within a VTA-like design. I’m not aware of anyone using using MAESTRO to analyze VTA, I’d certainly be interested to learn what insights it can get on our design.

One of the issues is that VTA does not rely on stationary weights, activations, or accumulations between GEMM invocations for ease of compilation and flexibility reasons. That said, within a GEMM invocation we can rely on stationary weights, activations, or accumulations to implement an efficient GEMM circuit (e.g. systolic arrays). Hope that makes some sense.

VTA does have a design space exploration tool but at the moment doesn’t consider different implementations of GEMM as Maestro, or Eyeriss considers. I think that this could definitely be an interesting future works direction. It would certainly be something that the Chisel design that @vegaluis wrote, which has more modularity than the HLS design can explore.