Phasing out Legacy Components

tqchen · November 25, 2024, 2:18pm

As a next step, let us plan to phase out the micro flow which is mostly based on legacy. The particular component will remain available in 0.18.0 and previous releases and is not actively maintained as of now. We also hope to empower bringing back up some future examples through unity flow if there is community members who are interested in that direction.

fPecc · December 6, 2024, 8:33am

I have to leave a comment to express my feelings about this.

I just saw the PR for this, and I have to say, this is a truly sad day for me. The thing that first brought me near TVM was microTVM, and the ability to target embedded devices with such a reduced runtime. I have been using it a lot during the last few years, and of course, will continue working with it.

My feeling is that, without it, TVM is not going to be used anymore in papers targeting custom accelerators, which was a very interesting niche that was previously mostly filled by TVM. Some of the features that could be used with it, like USMP or the AoT Executor where truly very amazing features, and it is sad to see I will not be able to take advantage of this using microTVM in the future. The phase out of the VTA flow takes TVM in the same direction.

I hope we can take back again the development of microTVM in the future, maybe building some bridge to/from Relax.

tqchen · December 7, 2024, 5:31am

Thanks @fPecc , we would love to see relax based approach for targetting accelerators in future, hopefully the modularized flow make it even easier to do so, both in-tree and out-of-tree. There is indeed tradeoff here, however, at this point i also think bringing focus on the modern approach is critical for us to regain momentum and be sustainable for future developments. In the meantime, I would love to provide more inputs supporting discussions on how relax can help in some of these directions

cbalint13 · December 7, 2024, 4:14pm

@fPecc, folks,

I add here my humble experience with this topic, but only a pure personal point of view.

I used TVM in past for custom micro stuff (including experiments with custom fpga flows) and never relayed on the current micro part. I believe one can achieve his goals given the modularity of TVM, it is very easy to insert you passes or to hook in any parts of TVM internal flow without even touch upstream code (fork) or to declare a highly custom target with a wierd runtime. For micro stuff I always ended up using the native C codegen backend and passover results to my own needs, but this way it is possible to target even super-micro things like whatever 8bit u-controllers.

As another concrete example of custom HW acceleration I always enjoyed it that one even can insert verilated (from pure verilog land) blobs of block/micro-kernel and tensorize with it any ML operator without even touching upstream code, just by simple declarations of tensorizer in metaschedule for the tuning process. This is probably one of the neatest user-side feature of metaschedule (autotensorize, with it’s very intuitive template-declaration that auto-magically fits itself into operators).

As for the VTA part (again a personal opinion) I saw it as super inflexible & rigid thing, the mentioned [verilog-hw-blocks]->[autotenzorizer]->[metaschedule] approach for me yielded way much more flexibility and performance, and also the generated C code handled straight booth the HW acceleration parts on any custom soft-core cpu (having HW acceleration as pure ISA extensions).

I also think that the micro dragged in a lot (way to much) of non ml-compiler things, specific micro-runtime related headers and libraries that are quite diverse and numerous.

TVM really pioneers and keep pioneering lots of things starting early with elegant IRs (where was MLIR at that time ?) to the very neat end-to-end flow of autotune/metascheduling. I hope TVM continue keep the focus and rise the bar on these very things.

tqchen · December 7, 2024, 10:52pm

Thanks @cbalint13 for sharing your exprerience, such kind of modular experience is indeed something we hope to enable in the new relax flow, love to continue working together and leverage relax pipeline helps to further modularize and enable more usecases like you mentioned, perhaps also they can serve good community tutorials for general flow

fPecc · December 17, 2024, 7:46am

Thanks @cbalint13 for this insight! Indeed, I have been interested in doing something like what you are describing for a long time. Do you have some paper or more information on what you have been working on? I would love to know more about it.

cbalint13 · December 17, 2024, 7:47pm

@fPecc, Cc @tqchen

Do you have some paper or more information on what you have been working on? I would love to know more about it.

I don’t think it would be worth for a paper, but a small & clear tutorial might do it.

I am thinking to publish a small tutorial on this, within TVM, with the main goal to highlight the metascheduler’s autotensorization feature, how to use it to further tune kernels and nets in custom ways (i.e. it can showcase there a simple declarative older sse2/sse3 constructs as sample). Here the highlighs can be on:

how to decalare TIR search template for autotensorizer
how to declare the template’s call/implementation to tie it with a the fast ISA/intrinsics
how to tune nnet operators (imported graph) with metaschedule’s autotensoriezer enabled
how to inspect IR within this metaschedule tunning process (as a human readable form)
how to check/select/filter the autotensorized variants (regardless of performance) of tuned net

The autotensorizer can be used to insert more complex one-shot HW supported things too, not only classical fast ISA/instructions.

As a consolation that VTA & micro is gone, the mentioned tutorial’s last part/goal can include a small showcase how to construct a small custom “vector instruction/block” (i.e. a instantaneous HW dot-product) as a hypotetic ISA extension (i.e. it can be a futuristic RISC-V extension/block) and how to declare the TIR search template for it with it’s real or a virtual (in our case, to run on a local PC for simulation, a C equivalent or a verilated call/implementation function for it).

If you think this is a good idea and don’t mind I Cc you to the Draft of the PR.

My apologize if I derailed a bit the subject but I tried a alternative for the missing VTA/micro stuff here.

tqchen · February 3, 2025, 2:05pm

Happy new year! We just landed v0.19.0 branch thanks the community. This year is indeed more exciting and rapidly evolving as ever. Given the current landscape and the state of the project, I think it is a right time to phase out legacy relay flows.

To continue support community members who depends on legacy flows, the v0.19.0 branch will continue contain these components

This would allow us to focus a lot more on the new architecture and bringing up momentum as @Hzfengsy mentioned

1. Cleanup the codebase: By removing outdated or redundant elements, we can significantly reduce complexity and improve maintainability.
1. Unify our focus: Concentrating our efforts on the new unity flow will allow for more efficient development and innovation.

LeiWang1999 · February 10, 2025, 6:56am

Some suggestions for phasing out python dependencies:

remove dependency attrs, as it only be used in 3rdparty/tvm/python/tvm/relay/transform/memory_plan.py to wrap a class Region, but it will introduce am extra python dep attrs. instead, from python 3.7 we have a builtin package dataclass provides equivalent functionality
remove dependency decorator: maybe we can replace it with functools.wraps or copy decorator.py directly, as SciPy has done: decorator/src/decorator.py at master · micheles/decorator. Since decorator consists of a single Python file, maintaining it locally may be a viable option.

tqchen · February 10, 2025, 2:34pm

these sounds good, @LeiWang1999 do you mind send PRs for that?

LeiWang1999 · February 17, 2025, 6:20pm

Another discussion about the llvm dependency. I think we currently enable LLVM by default because we typically generate llvm host functions for different devices (such as CUDA). But generating C host code also seems to be a good option indeed. Relying on LLVM introduces many system dependency issues, making it difficult for users to build a project from scratch. (for example, llvm depends on some system libraries like libxml2 which user must install from source or from apt)

tqchen · February 17, 2025, 6:22pm

I have think a bit about LLVM dependency, while it is possible to some extent to get rid of it (we even had a stackvm version earlier for host that was not very commonly used), i think the benefit of having the LLVM dependency outweights its negatives, conda usually have great llvm dependency installation, perhaps we can have clear guides in the docs on how to do so

LeiWang1999 · February 17, 2025, 6:26pm

I see, I misunderstood—I thought USE_LLVM was set to ON by default, but it’s actually disabled. I don’t mean to phase out the LLVM dependency entirely, but rather to disable it in certain cases.

LeiWang1999 · February 17, 2025, 6:52pm

But looks like currently c_host only provide code generation but lack runtime support, to produce:

import tvm
from tvm.script import ir as I
from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(Q: T.Buffer((1, 4096, 32, 128), "float16")):
        T.func_attr({"target": T.target({"arch": "sm_89", "host": {"keys": ["cpu"], "kind": "llvm", "mtriple": "x86_64-unknown-linux-gnu", "tag": ""}, "keys": ["cuda", "gpu"], "kind": "cuda", "max_num_threads": 1024, "tag": "", "thread_warp_size": 32})})
        by = T.launch_thread("blockIdx.y", 32)
        v = T.launch_thread("threadIdx.x", 256)
        for i in T.vectorized(8):
            Q_shared = T.allocate([16384], "float16", "shared.dyn")
            Q_shared_1 = T.Buffer((16384,), "float16", data=Q_shared, scope="shared.dyn")
            Q_1 = T.Buffer((16777216,), "float16", data=Q.data)
            Q_shared_1[v * 8 + i] = Q_1[by * 128 + v * 8 + i]

mod = Module

rt_mod = tvm.build(mod, target="cuda", target_host="c")

print(rt_mod.get_source())
print(rt_mod.imported_modules[0].get_source())

import numpy as np
Q = tvm.nd.array(np.random.randn(1, 4096, 32, 128).astype("float16"), device=tvm.cuda())

rt_mod(Q)

'''output
Traceback (most recent call last):
  File "/root/tilelang/debug/unit_vectorize_test.py", line 28, in <module>
    rt_mod(Q)
  File "/usr/local/lib/python3.10/dist-packages/tilelang/3rdparty/tvm/python/tvm/runtime/module.py", line 201, in __call__
    return self.entry_func(*args)
  File "/usr/local/lib/python3.10/dist-packages/tilelang/3rdparty/tvm/python/tvm/runtime/module.py", line 128, in entry_func
    self._entry = self.get_function(self.entry_name)
  File "/usr/local/lib/python3.10/dist-packages/tilelang/3rdparty/tvm/python/tvm/runtime/module.py", line 176, in get_function
    raise AttributeError(f"Module has no function '{name}'")
AttributeError: Module has no function '__tvm_main__'
'''

tqchen · February 17, 2025, 8:10pm

indeed this was the support for default main symbol, perhaps because c host codegen did not add that, try to see if we can look things up by function name. Also if we only go through c codegen, module blob packing can no longer work

ksaurabh-cadence · March 2, 2025, 3:45am

@tqchen I asked this in a separate thread as well, but with removal of relay I don’t see calibration and quantization tools the main branch anymore which were located in https://github.com/apache/tvm/tree/v0.19.0/python/tvm/relay/quantize. What kind of activation calibration/quantization tools are going to be available in the main branch? Appreciate your inputs on this.

tqchen · March 2, 2025, 11:40pm

For new flow we don’t have caliberation flow inside framework, as many calberations are now moved to the upper layer.

For example, frameworks like MLC-LLM that usually runs quantization first then build up the quantized model through fused dequant mm operators

ksaurabh-cadence · March 3, 2025, 7:45pm

@tqchen - Thanks a lot of your reply. If I’d still like to add calibration/quantization pass in Relax (so that I can have a common quantization process for models coming from different frontends), is there a fundamental limitation in the design of Relax which would prohibit me from doing so? I was thinking of creating an equivalent of simulate_quantize node or add observers as external library calls / python functions in an IRmodule. Do you see any problem with this approach?

tqchen · March 4, 2025, 6:24pm

i don’t think there should be fundamental limitations to do so and love to learn about your experience

xinetzone · April 1, 2025, 1:49am

I urgently need your help…

[Bug] relax ONNX Resize operator conversion with opset_version=11 is unsupported - Troubleshooting - Apache TVM Discuss