[Discussion] A Technical Approach to LLMs with TVM Unity

hi, community,

Large language models (LLMs) have been revolutionizing our world and enhancing our lives. We build the project MLC-LLM, trying to bring LLMs to everyone’s local devices.

It’s one of our vertical applications on TVM Unity and also the first step for LLMs. Would love to hear opinions and discussions from the community :slight_smile:

6 Likes

Thank you @Hzfengsy ! Would also love to hear more about possible technical approaches we should take for new foundational models

Not limited to LLMs (Llama, Vicuna, RedPajama, etc) and on-device deployment (ARM, Hexagon, Adreno, etc), but TVM Unity also enables fastest Stable Diffusion available on cloud GPUs. All those accomplishment won’t be possible without the technical stack that TVM Unity provides

1 Like

Thanks TQ for the great question. We are working on dlight, a lightweight auto-scheduler for dynamic shape workloads. After that, users are able to define their own models with different architectures.

Thanks for the convenience provided by TVM unity. I have a few questions I want to know:

  1. Will TVM Unity support TensorFlow Lite front-end in the future?
  2. When will TVM Unity be merged into the main branch?
  3. Will the documentation of TVM Unity be updated? Currently, we can only learn about TVM Unity by reading the source code.

At present, due to the incomplete documentation of TVM Unity and the possibility of bugs due to the immaturity of Unity, we have had multiple internal debates on whether to use relay or relax.

2 Likes

Thank you for your question.

TVM unity is currently being developed as part of the unity branch https://github.com/apache/tvm/tree/unity and is this case being frequently updated with robust community support. see more background here Establish TVM Unity Branch

For emerging needs that involves dynamic shape or stable diffusion, likely it is already the best(or only) robust path. So from maturity pov it is likely already matured(and more mature than relay since relay was not designed for first-class dynamic shape) for first-class dynamic shape use-cases. So if you care about emerging needs(e.g. LLM or SD) and dynamic shape is to your use-cases, have a go with it.

The support of frontend usually comes in a community driven way, right now it comes with ONNX and FX frontend. The overall cost of ramp up time of a frontend is reasonable and we are open for community to bring in more frontends, we also welcome contributions :slight_smile:

Good point on the documentation, you can find the ML compilation course https://mlc.ai/ that teaches basic concepts, along with tutorials, some are posted in unity - Apache TVM Discuss

1 Like

Thank you for your reply.

One more question, we are going to work on TVM for our MCU. And I have seen some guys ask questions about TVM Unity for a new NPU. Due to lack of integration examples for MCU or NPU in Unity. I wonder if microTVM can also work properly on Unity in addition to BYOC? Are there any more materials related to BYOC and microTVM besides relay related documents?

2 Likes

On BYOC, yes, the unity flow actually simplifies the BYOC, see [Unity][Tutorial] TVM Unity BYOC

On the MCU support. Because a lot of focus have been on emerging workloads, the focus have been on platforms that comes with linux/windows/mac and we haven’t yet try to look at MCU that are below MB memory level. This being said, all the foundations are there, and the current build also have a compiled mode that get code into TIR.

The goal of unity is actually to enable such modularization so runtime backend can be reasonably decoupled from core compilation, e.g. as long we can lower an IRModule to TIR function calls that goes to set of C API, we should be able to run on the environment of interest and welcome contributions on these fronts

1 Like

Okay, I got it. Thank you very much. After we have more understanding of TVM, we are very interested in contributing to the community.

mlc-llm Will support hexagon backend in the future?

1 Like

@Hzfengsy @tqchen ditto regarding Qualcomm:Hexagon backend support.

  • I noticed some old posts mentioning that TVM:Hexagon is experimental. Is that still the case?

  • ATM I’m resorting on Qualcomm:QNN (AI Direct Engine SDK). I noticed that they ship some tvm stuff under the hood. Are you familiar with their approach?

Hi @escorciav, I’ve tried a bit on Hexagon, however, the hardware is a bit weak at the moment. To be specific, the memory bandwidth is only 1/3 of the Adreno GPU, according to my limited knowledge.

1 Like

Thanks for chiming in 🙇🏽‍♂️. Just clarifying,

  • Did you test mlc-llm in Hexagon? Is there any public branch about it? :slight_smile:

  • I wonder which hardware you used. I’m using S23 Snapdragon 8 Gen2. I’m still hammering at it. Thus, I haven’t measured the memory bandwidth.

No public branches already. I had some time to develop it but did not finish

8g2 too :slight_smile: