Modularizing LLVM codegen/JIT

kparzysz · May 14, 2022, 1:32pm

Interesting. Thanks for checking!

Edit: If this approach works, I’ll change the plan to use that instead of dlopen/dlclose.

kparzysz · May 14, 2022, 2:14pm

Can we wait a bit with this PR, until it’s clearer what mechanism we will need?

tqchen · May 14, 2022, 3:11pm

Of course, the PR is mainly to demonstrate the mechanism. We can wait until we agree on the right mechanism

areusch · May 18, 2022, 7:16pm

Sorry but I thought @kparzysz was pointing out that resetting the global state is sometimes impossible. I don’t think we are talking about dlopen() a library with just LLVM inside of it. The idea would be that the .so contains target.build.llvm PackedFunc plus anything else needed for the backend. I think this complements the UMA proposal quite well.

kparzysz:

Also, even in the latest release of clang/LLVM, some options can only be specified once:
$ clang++ -O2 -mllvm -unroll-threshold=100 -mllvm -unroll-threshold=200 hello.cc
clang (LLVM option parsing): for the --unroll-threshold option: may only occur zero or one times!
This check has only recently been removed, and will be effective in clang/LLVM 15.

kparzysz · May 18, 2022, 7:37pm

Yes, it is impossible to completely reset the global state since there is more than just the command line flags (e.g. statistics). The options were the main issue though, and given the resistance to dlopen/dlclose, I didn’t feel like spending more time defending that approach.

Edit: the quoted email (about non-redefinable flags) likely doesn’t apply when we have direct access to the registered options. As of now I’m not aware of anything that cannot be done by manipulating them directly, but I haven’t run any experiments yet.

tqchen · May 19, 2022, 2:24pm

It would be good still to ground the discussion a bit. In general, it is indeed true that if a program has a global state, it is impossible to reset all of them. This (global state that cannot be completely reset) applies for most libraries in general. However, that does not mean that for example, we have to reload cuda runtime every time we launch a kernel. The intended use-case of most libraries is that they are being linked and used throughout the span of the process.

Due to the nature of how most compilers evolve and are centered around compilation of a single backend. As a result, the rationale is that there should have been a single global configuration of options once, and they stay as constant across compilation of multiple functions.

Importantly, this does not mean that other global states are being reset here, for example, statistics like @kparzysz mentioned are still evolving, but they do not impact compilation.

Our discussion background boils down to a very concrete use-case. Use LLVM to just in time compilation of multiple functions across a longer time span.

Wearing the shoes of LLVM, JITing increasingly has become the use-case that LLVM is being used for. For example, there are already quite a lot of packages leveraging LLVM’s capability like this one, the ones that are in the ML Ecosystem include: Numba, PyTorch, Julia.

In all of these use-cases, LLVM is being used throughout the life-span of the process, just like what we would do for a normal library. There is also a whole JIT component in LLVM being built for this purpose.

Note that global states like statistics still get updated in all those previous cases(and not being reset). They however, do not affect compilation, as it is fine and even desirable to accumulate global statistics across compilation.

The particular issue did not arise publicly in previous approaches. In our case, the added complexity is that we are targeting more than one heterogeneous backend (e.g. Hexagon and Host), where we want to configure pass configurations differently(as the max-unroll-count example.

Our problem is an interesting one, as it is a less frequent case of JITing that likely was still the intended use(with growing demand), but on the boundary of considerations. It is going to be increasingly popular, and perhaps also already in use, as mentioned examples (Numba, PyTorch, Julia) all have a GPU JITing component with them. But they do not arise in those cases because likely these projects did not yet think about reconfiguring options like max-unroll-count differently per pipeline.

This particular problem is only caused by our desire to change options that are specialized by each pipeline, and having the ability to reset the corresponding options would resolve that problem.

The most ideal solution would be to configure the pass completely independent from the global static option. The ability to reset the corresponding options that we are interested in changing seems like a good middle ground, and also still bring the usage to be in the same line of other widely used packages (Numba, PyTorch, Julia) .

BTW, I think this is also a very interesting discussion that would benefit the LLVM community in general. Perhaps we should also bring up some of the insights to the LLVM community. JITing for heterogeneous backends should be something that the compilation community solves in general.

tqchen · May 19, 2022, 2:55pm

cross ref the post created in the llvm discourse [DISCUSS] Making Global cl::opt Friendly for JITing Hetro Computation - LLVM Project - LLVM Discussion Forums

kparzysz · May 19, 2022, 3:13pm

I agree with you in principle, and the problem here is specifically with LLVM (considering the libraries that TVM uses). It’s a consequence of the evolution of LLVM, and of its use cases, something that likely does not apply to other libraries to the same extent.

I’m pretty sure that LLVM would be interested in a solution to the reconfiguration problem. The issue is that it would take a while to both design an appropriate solution, and to implement it. At the earliest, the fix would go in LLVM 15.0.0, and since TVM supports LLVM 4+, it means that we’d still have to deal with this problem for a very long time.

The dlopen/dlclose solution has the benefit of being clean—we start with a clean slate, and do not leave any “leftovers” behind. If we treat the global state issue in LLVM is a “feature”, then the best practice would be to implement it in a way that completely avoids it. At the same time, with the ability to modify command-line options in-place, the importance of that is greatly diminished, to the point where, I’d agree that it’s no longer needed.

I think that the current plan (i.e. access the option registry directly) is sufficient to accomplish our goals, but I’m open to further arguments. It doesn’t actually affect the need to “localize” LLVM configuration, so it’s more of an implementation detail rather than the basis of the plan.

areusch · May 19, 2022, 3:21pm

I think this is to say that we haven’t found an explicit need for this yet. Needing to reload CUDA in this example would be particularly bad, but needing to reload a library each time some less frequent event occurs e.g. a config hot-reload is a very real type of thing that production engineers routinely band-aid.

I think the main concern from @tqchen is decreased throughput when LLVM is used for JIT. In that case, doesn’t the config remain static as the hardware is not changing? Perhaps it’s possible to load LLVM codegen per-Target and not unload it unless our config changes. Then we can balance between the need to fully separate configuration per-backend and being over-defensive at the cost of throughput.

junrushao · May 20, 2022, 7:55am

Thanks for the discussion! It’s indeed quite insightful and I personally learned a lot

Just wanted to share my 2 cents:

Library reloading is clean. I completely agree, and if there is no much overhead, on-demand library loading could be even better if TVM is integrated with a frontend framework (PyTorch) which ships with its own LLVM.

Compilation speed matters. In our auto-tuning process, compilation is usually invoked more than 20k times, which is a major bottleneck. Therefore, if library reloading impacts performance, it would be less efficient for AutoTVM/AutoScheduler/MetaSchedule.

LLVM’s ongoing effort. As @kparzysz said, LLVM seems to be trying to fix this in 15.0.0. Well, it’s a bit slow, but given LLVM is moving towards this direction, I believe some day in the distant future this feature could be turned on without having to resort to library reloading.

My proposal. Given TQ’s approach seems to suffice already, I would say we could move forward with his proposal for now. If in the end we still need library reloading, we might make it an optional environment variable which could be turned on or off given the need for fast compilation and static linking.

tqchen · May 20, 2022, 2:45pm

Trying to summarize my thoughts in a few sides:

Consistency of intend in LLVM. LLVM itself clearly had an intend to provide JIT. e.g. the MC-JIT module provides the necessary features for JITing code. They likely also comes with things like caching, local architecture detecting etc. Our current JITing path relies on this part of feature.

Moving to library reloading effectively means LLVM is used as an AOT command line driver(that is not that different from clang cli) compiler we we build JIT layer on top, that shift some burdens into TVM, and we cannot benefit from improvements in MCJIT. Simply put, the current JITing path will no longer work and we need t rework things around.

Consistency with existing eng practices, most existing JITing solutions, such as numba, pytorch, Julia choose to link against LLVM. Having consistency is usually good as consistent paths are more maintained and tested upstream/downstream, and more familiar to developers

Limitations that comes with reloading. Library reloading itself comes with additional complexities. Some of them include: C++ ABI stability; memory being allocated in different DLL and get freed after a library is already unloaded; Having to go through serialization/re-serialization. The limitation of not being able to statically link also would have implications on down streams, in many cases a single dso simplifies things like path discovery, code-signing issues, or application itself have specific rules for bundling.

These limitations can be considered solvable with effort, and in our current discussions also suggested a few solutions to solve some of the limitations. They do come with unknown-unknowns(for example, one can free a memory from a DLL that get unloaded, it is a bit hard enumerate all possible corners cases that can arise here) and their own engineering cost (such as maintaining some additional features that are covered by MC-JIT).

Considering the design intend of LLVM, existing engineering practices and tradeoffs that comes with it. I would avoid going library reloading when it is possible.

A2 is scoped workaround that solve our current needs before LLVM provide a more systematic way(likely in the direction of A1 based on the discussion in LLVM forum). It would also reduces the effort of refactoring back when some form of A1 lands in LLVM.

areusch · May 20, 2022, 3:52pm

I’d be okay with settling on @tqchen’s solution as a practical step forward for now. I’m not sure I’m convinced it’s a general solution and wouldn’t be surprised if it breaks when we add some new backend or someone tries to compile for particular combination of targets later on.

A question I have for @kparzysz : I think we have spent some time in this thread trying to understand the intended way to use LLVM. I’d argue that while it’s good to understand LLVM’s design intent, the beaten path is that which is well-tested in LLVM (particularly if we intend to be compatible with a wide range of LLVM versions). Therefore, I’m curious whether you know if LLVM is exercised in test against multiple backends in the same process? I wouldn’t be surprised either way, but it seems like this global flag problem would have come up before if so.

Could you say more here? I think it’d be better to understand the specifics, otherwise we’re just sort of impeding potential evolution of the compiler.

Agreed that the memory and serialization issues would be new burdens imposed by reloading. I don’t know I see the ABI stability issue–both libraries would ship from the same release. The ABI could be considered “internal” to a degree. Finally, the serialization problem is also imposed by the Artifact refactor, though we discussed a way to avoid serializing during JIT-based compilation. I agree we would need to find a way around this.

I do think that in the case that a global LLVM flag needs to be set to two different values to compile the same IRModule for two different LeafTargets, and that flag cannot be adequately reset without unloading, it doesn’t seem possible to JIT without some serialization across a process boundary. I don’t want to build a slow compiler, but it would be great to understand this need a bit more. Is there a use case you could elaborate that might give us a way to judge what might break if we adopt serialization in the future?

Here I’m asking not necessarily from the POV of wanting to push the reloading solution forward now, but wondering what options we might have if the suggested solution now turns out to be inadequate. At the time we find out it’s not adequate, it seems like it will either be because someone is trying to integrate a new target and running into this problem, or because someone is trying to use TVM with two targets that need different LLVM options. Either case is basically a bug report that we would prefer to fix with some urgency.

tqchen · May 24, 2022, 6:45pm

Following up. The RAII based solution(A3) should be able to resolve our current needs before LLVM lands a longer term alternative based on the reading of the LLVM code.

We can also run some experiment to confirm the unroll option kicks in effect, one way could be to hack LLVM to look, alternatively, we can see the impact of the generated work. In case new problems arises(that are likely have nothing to do with global configurations), we could explore other solutions

kparzysz · May 24, 2022, 7:05pm

No, that has never been tested. That has not been a design goal of LLVM in the past, although, given the interest in heterogeneous targets, it may become one in the future. This is not to say that it won’t work, just that there hasn’t been any specific effort to make it work. It someone wanted to work on it, I bet it would be welcome in the LLVM community.

kparzysz · May 24, 2022, 9:19pm

The way that we use the JIT functionality is really quite close to AOT. The main benefit we get is that we can easily execute the code that was just generated. I’m assuming that your concern is with the execution part, because if we loaded LLVM dynamically for the duration of the lifetime of the LLVMModule, the library would remain loaded throughout both code generation and execution.

My plan is actually to separate the codegen step from the execution step, but for a different reason: some TVM targets use LLVM for codegen only, and having that isolated into its own entity would allow it to be cleanly reused between those targets, and anything that falls under the “llvm” target. This would require writing a dynamic loader for JIT execution, but that’s easy, since all the components are already provided by LLVM. This doesn’t address the loading/unloading of LLVM, however.

It is fairly clear that we want some form of isolation of different instances of LLVM. Most of that desire is motivated by the global state mostly affecting code generation. The JIT execution part would also require some LLVM libraries, but those are mostly unaffected by the global state. If we wanted to load/unload, we could limit that to the code generation libraries, and for execution we’d link the required LLVM libraries statically, making the symbols invisible outside of libtvm.so (there is overlap between the LLVM libraries doing codegen and those needed for runtime loading/execution).

tqchen · May 24, 2022, 10:15pm

I agree with some of the improvements directions over JIT, specifically separating jit away from the aot as a separate target (e.g. llvmjit vs llvm) They are somewhat orthogonal to the global config problems we discussed here and we could try to address them independently.

tqchen · June 7, 2022, 12:38pm

A gentle ping to followup, to see if folks have additional thoughts. It might be useful to start with the RAII solution as it is strictly better than what we have now, then continue to improve.

kparzysz · June 7, 2022, 1:17pm

I’m working on a prototype now. I want to iron out any additional issues before we go on to discuss more details.

kparzysz · June 24, 2022, 4:39pm

Quick update: I have a prototype now, and I’m committing some preparatory changes that are independent, and generally beneficial (IMO). I will have the draft PR and the RFC next week.

kparzysz · June 27, 2022, 10:18pm

RFC is up.