We can consider to support Bfloat16 data type in native. More and more hardware will support this type since this year. Because it’s not a standard type in generic programming language(c++ / numpy) and traditional compiler (gcc / llvm), developer is facing a challenge of writing kernel with it. It’s a huge advantage if TVM can generate Bfloat16 kernel efficiently, and TVM is designed to do this well.
Thanks for all the suggestions so far, we will summarize the roadmap near the end of the month. The current goal is to aim for a three month timeframe (April).
Besides the set of features we want to see, we would love to what our community members want to work on(either new proposals) or some of these existing ones. It would help us coordinate and estimate feasibilities of these items
I will mainly work on the automated quantization project in the next three months, see the RFC for details:
In summary, I hope that with this project we can provide a easy-to-use quantization tool for users, which can be adapted to different models, and different hardwares quickly.
It would be very helpful to support dynamic batch size.
FYI:
Thanks everyone for a great discussion, a draft roadmap is posted here https://github.com/apache/incubator-tvm/issues/4845
Can you add nvdla in tvm/vta too, as a milestone?
I see an increasing demand for replacing Relay recursive visitor/mutator by non-recursive ones (due to stack limit). Would you think it is doable in v0.7?
I am against this idea. Let me explain.
It is definitely doable by using continuation passing style/trampoline.
However, they require rewriting code in a much uglier manner. Also, as most call are not tail call, there wont be much memory saving, and we had only trade stack space for discontinuous heap space.
A better solution imo is to call setrlimit(RLIMIT_STACK, &R);
How is GNN going? I am thinking of doing some program optimization in GNN training.
but increasing stack size might be against internal security policies.
There is the raw IRVisitor which doesnt recurse. The difficulty is migrating all the pass… I imagine whoever is against setrlimit can help by migrating the pass.
How about we aim to have at least an RFC and an infra landed in the next cycle of release?
cc @jroesch @mbrookhart , I agree that it is important to introduce non-recursive version for most cases, in particular the PostOrderRewriting case where we can visit the dataflow part of the Expr and use a callback to rewrite them. As long as we manually manage the stack, there won’t be stack overflow problem
I agree that it is super ugly (and great amount of work) to migrate, but it is existential problems when we want to optimize a medium-sized network. I also agree that setrlimit
is a good workaround if I am working alone on my personal laptop. However, industrial issues may require a potentially different solution, as @yzhliu has mentioned
If there is better approach (less ugly, less amount of work) to manually manage the stack, I think I would vote for it. So, why not think about it
I am open to better solution then CPS. I just personally dont know any.
I hope a name
argument can be added to Relay ops. Absence of a name
makes debugging difficult and losing connection with ops in frontend frameworks.
Hi, I fork the https://github.com/GaryYuyjl/incubator-tvm/tree/int4tensorcore for int4 computation with tensorcore. I found it cost too much time while packing int4 to int32 with cpu. So I write the pack progress into conv2d compute&schedule and get good results. But the packing data time still takes up at least 30% of the total convolution time. It may because my compute&schedule code is bad. Do you hava any good suggestion about efficiently packing data?