[pre-RFC] Language Specification for TIR

slyubomirsky · May 2, 2023, 5:08pm

Pre-RFC: Language Specification for TIR

Thanks to the assistance of several core TIR contributors at OctoML, I have written a draft of a language specification for TIR. I present it here to the community for further review. If the community is interested, I hope the specification in some form can be adopted by the project and used as documentation for TIR, which would be useful for those seeking to learn TIR as well as contributors to the compiler stack. The draft can be found here; comments are welcome, especially where they concern 1.) inconsistencies, 2.) factual errors, 3.) omissions, 4.) ambiguities, or 5.) unclear or difficult language. This proposal also takes some views on what a good approach for specifying TIR would be; these are meant to be subject to debate, since most of them are my own opinions or those of the TIR developers who assisted me in writing the text.

Motivation

TensorIR (TIR) is TVM’s lower-level intermediate representation (IR) for operations on tensors, allowing users to specify arithmetic operations on values that can be loaded from and stored into buffers. Many of TVM’s most valuable features for improving performance rely on TIR: in particular, scheduling transformations are defined on TIR programs and, similarly, TVM’s autotuning functionality (e.g., Metascheduler) also operates on TIR. Additionally, TIR is the simplest way to define a new tensor operator in TVM, which is often a necessary step for supporting the latest deep learning models.

Yet, despite the importance of TIR for the language, there is relatively little documentation on the language itself, which makes it difficult for new users to pick up the language. Some existing resources include Feng et al.'s recent ASPLOS paper on TIR and Tianqi Chen’s Machine Learing Compilation course. While these sources are valuable and provide an overview of TIR’s usage and utility, they do not (and are not intended to) provide a comprehensive review of the language and its features. The lack of a full language reference can be frustrating for new users, expert users, and compiler implementers alike:

New users may not fully understand the utility of constructs that they encounter in example code, creating an impediment to their understanding of the language.
Expert users may encounter edge cases where the intended functionality is unclear. This can lead to unease as to whether a program’s behavior in such a situation is a “hack” that relies on compiler implementation details or indeed intended behavior.
Compiler maintainers may encounter situations where it is unclear whether implementing a pass or making some change could alter the behavior a user would expect. Similarly, it may not always be clear whether some surprising behavior is a “bug” or, in fact, intended. A clear specification would eliminate the ambiguity in such situations.

The goal of writing a language specification would be to address these problems by writing a document that achieves the following goal: Someone who is not otherwise familiar with TIR can refer only to the specification and be able to describe what any given TIR program is intended to do (without running it), assuming all validity conditions are met (something that will be discussed below). That is, with a language specification, it should be unambiguous what any given TIR program is intended to do (or, alternatively, whether the TIR program is doing something that should be considered unsupported).

Additionally, by creating a single document that describes the core language mechanics, the specification would have the further benefit of making it simpler to propose revisions to the language design, since they could be described in terms of the specification and what revisisions would need to be made to it. The implications of a change to the language on different language features would be easier to enumerate, which could highlight from an early stage any difficulties that might arise in implementing such a change.

The Draft and How It Is Written

As aforementioned, the draft specification can be found here. Review and comments on it are welcome. The document is intended to be self-contained, so it should not be necessary to say much more about how to read and approach it (though if it is difficult to understand, that would be useful feedback).

The document is written as a technical reference, intended to describe the behavior of each language construct in a precise manner. By contrast, the document is not meant as a tutorial, so it does not generally describe very much of the intent behind the design. Personally, I do not think a language specification is the right document for describing that, though if there are some places where a brief word on intent would be helpful, we could perhaps consider doing that.

Additionally, the specification is concerned only with the functional behavior of TIR programs. That is, it describes only the visible behavior that results from executing a program. This is elaborated upon in the draft itself, so I do not want to belabor this point in the pre-RFC, but the main implication is that the specification is not intended to make guarantees about performance and which constructs are likely to result in better performance. The reasons for this approach are twofold: First, it is difficult to make any a priori guarantees about performance (most of the time, we use heuristics) and, second, this simplifies the specification and thus gives the compiler implementation more freedom to make changes related to performance.

Design Choices

The draft as I have written it makes the decision to specify only a subset of programs that are possible to write in TIR, designating certain behaviors (some of which may be in use) as deliberately unspecified. This choice may prove controversial, but my collaborators and I found it to be a useful measure to address an ambiguous aspect of TIR’s design. In particular, TIR’s AST exposes many things that are typically thought of as compiler internals—even though they are “compiler internals” in some sense, however, expert users can and do manually set them to take advantage of properties of the compiler implementation or of specific devices. While it is useful for expert users to be able to do this, it poses some problems for the specification: accounting for all these low-level implementation details across all devices TIR targets would greatly expand the scope of the specification and, more importantly, including this behavior in the specification would essentially “commit” the compiler maintainers to supporting it, which could restrict their ability to change other compiler implementation details. (It would not be very helpful for the specification to be so tied to the compiler implementation that it essentially repeats what the implementation says; moreover, if changing small details of the implementation requires frequent revisions to the specification that break backwards compatibility, then the document is not very useful as a reference either.)

To summarize, TIR has historically made the distinction between a “front-end user” and a compiler implementer rather ambiguous (which is unusual among most programming languages). This specification instead aims to provide a distinction between “high-level” behavior that is guaranteed at TIR’s front-end versus lower-level details that are left to the compiler implementation. The decision to stick to a high level of abstraction, I hope, will allow the specification to be simpler and remain more stable over time and give room for defining some of the lower-level invariants for the compiler later. However, I do acknowledge that there is a tradeoff here between supporting functionality and the simplicity of the specification. I lean towards the latter in part because even the TIR maintainers who helped me write the draft did not always agree on which low-level implementation tricks “should” work in TIR—I think it is reasonable for the specification not to make guarantees in such cases.

What Should the “Front End” Be?

As a working definition, this specification considers the following to be expected “front-end” users (as opposed to expert users who would be expected to make use of compiler internals):

Users trying to straightforwardly implement a tensor operator to be optimized by the autoscheduler.
Users of TE, which lowers into TIR.
In the case of the ongoing Unity work, users who want to express tensor operators in TIR so they can be invoked directly in Relax.

Behaviors That Are Not Supported in the Draft Specification

The draft specification abstracts over certain implementation details regarding how buffers are laid out in memory as well as hardware-specific details. The following are the additional restrictions the draft specification imposes on input programs:

Buffers may not alias each other (except in the case of match_buffers for Block nodes, which is specifically addressed in the draft).
Each allocation creates exactly one buffer (by contrast, lower levels of compilation may use a single allocation to represent multiple buffers by changing the strides and element offset appropriately).
Users must not manually specify the strides or element offsets for buffers.
No buffer allocated in the body of a PrimFunc may be exposed externally.
The program may not contain device-specific builtins or builtins that engage in pointer manipulation.

TIR programs that violate these rules are considered to be unspecified, which means that the specification makes no guarantees about their semantics. Otherwise, later stages of compilation are expected to violate these assumptions, but they are meant to be true of input programs. (It is not desirable for the TIR compiler to entirely reject input programs that do these things, but this behavior should left to expert users who are familiar with the compiler implementation and are prepared to update their code if the implementation details change.)

Portions of the Text that Should Receive Additional Attention

There are some areas of the text in which I lack confidence and request further review from those who are knowledgeable about these TIR constructs. They are listed below:

Vectorized loops: In the compiler implementation, vectorized loops are lowered into a sequence of vectorized operations using the VectorizeLoop pass, whose implementation is generally very complicated. It would be undesirable for the specification to simply describe such a large and complicated pass in text form, since it would likely also be more difficult to understand in that form compared to the source code. Instead, it would be preferable to have a simple description of the high-level intent of the vectorized loop. I am not certain that the current description in the text (describing them as having the same semantics as a parallel loop) is entirely accurate given how the loop vectorization pass really works.
The semantics of BlockRealize and BufferRealize would be good to review closely, as I was not entirely certain of how those worked. Additionally, it would be good to add some description of how some of the extra information is used by the compiler (even if it is not part of the semantics per se).
There are likely many additional implicit validity conditions for the different PrimExpr and Stmt nodes that are not noted in the type-checking rules in the current draft. Any omissions would be good to note.
Another tricky area in the current draft is dealing with varialbes that could be contained inside Buffer nodes. elem_offset, strides, and shape could all contain variables and constructs featuring Buffers should be careful to specify when these are assigned or not.

Adoption and Process for Updating the Specification

If the community agrees that a specification like this one or a revised version should be adopted, I think it should proceed to a formal RFC. This, however, still leaves the question of how the specification would actually be used by the community: the specification cannot be enforced by any mechanical means, so it would be incumbent upon users and maintainers of TIR to refer to the document and ensure the language behaves as described. It may be worth discussing how to ensure the specification is actually read; one way to do it might be link to it from prominent places in TVM’s documentation (e.g., from the documentation homepage) or make references to it in top-level header comments.

Additionally, we should determine the process for updating and maintaining the specification. The default option might be to require RFCs to update the specification and, accordingly, to have RFCs that add major functionality to TIR to also include how the specification would be revised as a part of the RFC process. My concern might be whether this would perhaps end up being cumbersome, though an out-of-date specification clearly would not be a very useful one. Another concern would be defining the distinction between a major revision to the specification like a new language feature and a minor one like correcting a small inconsistency—it would probably be reasonable to use a simple PR for minor changes.

Potential Future Work

If adopted, a high-level specification for TIR could serve as the first stage for potential other lines of work, like formalizing the notion of “dialects” in TIR that are used for different stages of compilation. Having specifications for each dialect would allow for clearly communicating the invariants expected at different stages of compilation, which would be helpful for future compiler development. (Such a project would, though, be of a larger scope and likely require reasoning about some of the low-level details that are elided in this draft.)

Another potentially useful project that the specification would enable would be the writing of a reference interpreter—that is, an interpreter that implements the behavior described in the specficiation in the most straightforward manner possible, with no eye towards optimization. A reference interpreter would serve as an “executable” specification and allow for directly checking whether the compiler follows the expectations of the specification by comparing outputs with the reference interpreter.

Special Thanks

While I wrote most of the text, the initial draft would not have been possible without the efforts of Wuwei Lin (@vinx13), Eric Lunderberg (@Lunderberg), and Junru Shao (@junrushao). I am grateful to them for patiently answering my questions about the language and giving their input on many language features. Additionally, I thank Christian Convey (@cconvey), Denise Kutnick (@denise), and Prakalp Srivastava (@psrivas2) for many insightful comments during the process of writing the initial draft.

elenkalda-arm · May 2, 2023, 3:18pm

This is awesome, thanks @slyubomirsky for putting this together, I think it’s really important to introduce some standardization around the various IRs in TVM! A question I have about the scope of this… From the document,

Instead, in this version of the specification, we account for high-level uses of TIR intended to correspond to very common applications:

The output of the TE (Tensor Expressions) library,

Does the TE library here refer to the whole set of TE schedules we have, i.e. all the TIR we currently generate out of the TE in TVM should be covered by the spec? As you mention as well, there are good and less good TIR programs and some things that the current schedules produce are certainly not healthy. As an example, until quite recently one of the schedules was vectorizing over a whole input channels axis, which then resulted in vectors that were sometimes like 2000 elements. It certainly isn’t in the spirit of vectors, I think. I see you also point it out in the document if there should be some restrictions on the extent of the loop that is vectorized… So is the spec supposed to describe everything TIR can do or will it be limited to some set of “proper usage” of it?

Maybe a bit tangential, but is the spec going to to interact with TVMScript some way? I don’t understand TVMScript very well or its scope, but as an example, are we going to claim that all “Python TIR” that conforms to the TIR spec should compile in TVM?

Hzfengsy · May 2, 2023, 3:36pm

TVMScript is a syntax (or text format) for TIR, i.e. we can print the TIR from cpp data structure to TVMScript (a readable syntax) and parse it back.

Taking Python as an comparison. This language specification is a bit more like Python AST docs (ast — Abstract Syntax Trees — Python 3.11.3 documentation), while TVMScript is a bit like Python syntax, which we are writing everyday

slyubomirsky · May 2, 2023, 4:47pm

Great questions, @elenkalda-arm

That is the intent. If there are constructs TE outputs that are not covered, those would be good to note.

I wrote the spec with the latter view in mind, where “proper usage” means “these are programs where we can make guarantees about what will happen when you run them.” If a user wishes to directly specify certain fields to coalesce allocations, etc., this spec does not guarantee what will result.

As for the case you mention of a very large vector, I’m very curious as to whether that worked. Everything my collaborators told me about how vectors are intended to work in TIR suggests that it really shouldn’t have!

As @Hzfengsy noted, TVMScript is just how we parse Python into TIR. I did not discuss parsing in this document and I’m not sure whether it belongs in the same document, though it definitely should be documented somewhere.

elenkalda-arm · May 3, 2023, 10:58am

Thanks @slyubomirsky and @Hzfengsy for your responses!

I imagine it depends on the backend, but it works for Arm CPUs! By “works” I mean it generates code that runs, not necessarily the best code. This scheduling was in TVM for years, I only cleaned it up earlier this year. What happened with these huge vectors in TIR was that they got directly turned into huge LLVM vectors that then got broken down into architecture compatible vectors in Arm codegen. However, the way this is done is by creating a scalar chain of vector instructions… So if we have a vector of 2000 float32 elements and Neon vector fits 4, this is a lot of instructions. Combine this with unrolling of the outer loop in the next line in this schedule and it can keep compiler busy for like 15 minutes to compile one conv2d… It would make more sense for LLVM to process these huge vectors in a loop, but since the other LLVM frontends usually don’t create these kind of vectors, this optimisation has not been implemented there (I don’t blame them). In general, I’d favour making it explicit in spec that these kind of vectors should not be created, even if they result in correct low level code. (Maybe even enforcing it in TIR, but I guess fixing TIR is out of scope for this work)

Another thing about vectors… I think it is rather confusing that something can be a vector by the virtue of its dtype->lanes > 1 or by being a Ramp node. That gets especially confusing if buffer->dtype is vector type… As an example, this TIR lowered form a schedule in TVM:

conv = T.allocate([24], "float32x8", "global")

[some TIR...]

conv_1 = T.Buffer((24,), "float32x8", data=conv)

[more TIR...]

for owi_init in range(8):
    conv_1[oco * 8 + owi_init] = T.Broadcast(T.float32(0), 8)

By just looking at the compute nest in the bottom, it looks like 8 elements will be stored into one element, so you have to scroll up to find the buffer definition to realize that it is indexing over float32x8 elements… It makes sense, but I don’t think it is particularly intuitive representation… I am not sure why we allow Buffer to have such exotic user defined types? Does it have any benefits? There is the extra headache (which you have also documented) that if the indices of BufferStore/BufferLoad are itself vector types, then the total number of lanes will be the multiple of lanes of the buffer data type and lanes of the index. It’s another opportunity to accidentally create very big vectors So I wonder if the use of vectors should be standardised somehow as well in the “proper use” spec?

Let me clarify my TVMScript confusion… I thought initially that all TIR is scriptable, but some weeks back I was lowering various bits of TE into TIR, scripting the results and trying to parse it and it worked about 60% of times, so I concluded that not all TIR is supposed to be scriptable. Is this true? I could have also stumbled upon a string of hard to expose bugs because I was using schedule_to_module for lowering, which I think is not used much in the main compilation pipeline. But I suppose the problem is similar, there are “good” scriptable TIR programs and “bad” non-scriptable ones and it is not clear what constitutes as a good TIR program. So I was wondering if there can be claims made there about the scriptability in relation to this spec, e.g. “all TIR that conforms to this spec is also scriptable”?

Hzfengsy · May 3, 2023, 11:31am

You are right. All TIR that conforms to this spec should be scriptable.

I’m surprised to hearing that. As we have been using TVMScript for development and debugging for a long time, it works perfectly and saves our time.

Could you please share some codes that are not scriptable? We can have a look at it and make TVMScript better. PS: Let’s start a new thread for TVMScript bugs if you want more discussion and make this thread focus on the Language Specification.

slyubomirsky · May 3, 2023, 9:21pm

I agree with @Hzfengsy, any parsing issues (especially potential bugs) would be good to note but that should likely be in a separate thread. I assume that it is the intent for TVMScript to be able to express any valid TIR program.

elenkalda-arm:

By just looking at the compute nest in the bottom, it looks like 8 elements will be stored into one element, so you have to scroll up to find the buffer definition to realize that it is indexing over float32x8 elements… It makes sense, but I don’t think it is particularly intuitive representation… I am not sure why we allow Buffer to have such exotic user defined types? Does it have any benefits? There is the extra headache (which you have also documented) that if the indices of BufferStore/BufferLoad are itself vector types, then the total number of lanes will be the multiple of lanes of the buffer data type and lanes of the index. It’s another opportunity to accidentally create very big vectors So I wonder if the use of vectors should be standardised somehow as well in the “proper use” spec?

I can certainly see the potential for tricky situations to arise. I’m not certain what the specification could or should say about this, though. As for the reasons for these features existing in the first place, I would defer to those more knowledgeable about the development of TIR—my understanding is that buffers of vectors are permitted in order to make use of hardware support for the storage of vectors.

Lunderberg · May 3, 2023, 6:21pm

This is necessary for targets such as SPIR-V, where pointer type casts are forbidden, even between pointers to types that differ only by the number of lanes. As a result, if you want to perform a vectorized load of 8 float32, it must be done by indexing into an array of float32x8.

I’d see this as analogous to a int[4][8] arr in C. If I access with a single index arr[i], I’m accessing the memory location at static_cast<char*>(arr) + i*sizeof(int[8]), because arr can be viewed as an array of 4 elements, where each element has type int[8].

slyubomirsky · May 5, 2023, 5:22pm

I’m planning to discuss this topic live at a community meeting, please join if you’re interested!

slyubomirsky · May 10, 2023, 4:33pm

Meeting recording: TVM Community Meeting May 10, 2023 - YouTube

Discussion notes:

BufferRealize: Indicates to compiler what region is going to be used, later lowered into allocation
- More of a hint, implementation is free to allocate only the region in the bounds
- Could be turned into an attribute in the future
Prefetch: Used only for TE, but maybe we should give semantics to it
The question of whether the RFC should include a reference interpreter:
- Maybe we could have a simple LLVM code generator for it, essentially using unscheduled version (this approach already exists in TIR)
- Opinion: Reference interpreter should be independent from the core compiler implementation
- Additional benefit: Debugging! If we don’t know the right answer, but we would run into an issue if it doesn’t support some builtins

slyubomirsky · June 1, 2023, 1:00am

The public RFC thread is now up: https://github.com/apache/tvm-rfcs/pull/101