[RFC] TVM Object Schema DSL

ziheng · September 16, 2020, 9:08pm

Introduction

TVM Object system provides a convenient and decent way to share objects between backend (C++) and frontend (Python/Java/Rust/etc.). For example, one can construct a variable in Python and pass it to functions written in C++, and vice versa.

However, adding one object node into TVM stack requires manually adding lines of code to different places in both Python and C++. For example, here’s how tvm::tir::IntImm is implemented and registered,

Definition for Node and its Reference: https://github.com/apache/incubator-tvm/blob/master/include/tvm/ir/expr.h#L228-L270
Implement functionality: https://github.com/apache/incubator-tvm/blob/master/src/ir/expr.cc#L58-L68
Node registry in C++: https://github.com/apache/incubator-tvm/blob/master/src/ir/expr.cc#L70-L84
Node registry in Frontend (Python): https://github.com/apache/incubator-tvm/blob/master/python/tvm/tir/expr.py#L275-L290

This RFC advocates the approach to generate C++ implement directly from Python class definition and registry. Moreover, as we still allow users to write C++ code manually in order to bring in more complex features, the object transpiler will provide basic validation for these manually written C++ code.

Here is an example of how an object can be described in Python and how the generated C++ code looks like:

@declare
class BaseExprNode(Object):
    """
    Base type of all the expression.

    See Also
    --------
    BaseExpr
    """
    type_key = "BaseExpr"
    default_visit_attrs = False
    default_sequal_reduce = False
    default_shash_reduce = False

@declare
class IntImmNode(PrimExprNode):
    """
    Constant integer literals in the program.

    See Also
    --------
    IntImm

    Attributes
    ----------
    value
        The internal value.
    """
    type_key = "IntImm"
    value: ty.int64_t


/*!
 * \brief Base type of all the expressions.
 * \sa Expr
 */
class BaseExprNode : public Object {
 public:
  TVM_DECLARE_BASE_OBJECT_INFO(BaseExprNode, Object);
};

/*!
 * \brief Managed reference to BaseExprNode.
 * \sa BaseExprNode
 */
class BaseExpr : public ObjectRef {
 public:
  TVM_DEFINE_OBJECT_REF_METHODS(BaseExpr, ObjectRef, BaseExprNode);
};

/*!
 * \brief Constant integer literals in the program.
 */
class IntImmNode : public PrimExprNode {
 public:
  /*! \brief The internal value. */
  int64_t value;
  void VisitAttrs(AttrVisitor* v) {
    v->Visit("dtype", &dtype);
    v->Visit("value", &value);
  }
  void SEqualReduce(const IntImmNode* other, SEqualReducer equal) const {
    return equal(dtype, other->dtype) && equal(value, other->value)
  }
  void SHashReduce(SHashReducer hash_reducer) const {
    hash_reducer(dtype);
    hash_reducer(value);
  }
  static constexpr const char* _type_key = "IntImm";
  TVM_DECLARE_BASE_OBJECT_INFO(IntImmNode, PrimExprNode);
};

/*!
 * \brief Managed reference class to IntImmNode.
 *
 * \sa IntImmNode
 */
class IntImm : public PrimExpr {
 public:
  TVM_DEFINE_OBJECT_REF_METHODS(IntImm, PrimExpr, IntImmNode);
};

We name it as TVM Object Schema DSL, or tschema. In summary, tschema will bring several benefits for the TVM architecture:

Reduce boilerplate code;
Verify to avoid missing some definition like TVM_REGISTER(...);
Enable deployment on all kinds environment even without C++;
Fields like type_child_slots can be automatically generate for optimizing;
Allow users to define Objects in Python, build and export them to a .o/.so file;
Have more type information during runtime, enable some optimizations in TIR compilation;

High-level Object Compilation Pipeline

Define TVM Object in Python. This object definition Python file is in a seperate directory (which will not be a part of PYTHONPATH) other than python/tvm/
Run python class parser to generate related .h, .cc files. This step can be triggered manually or via cmake. The generated files will be checked into the code base so that code completion tools can locate them.
Compile TVM using cmake as usual.

Notice that the second step happens during (or before) compiling TVM itself. We provide a standalone tool to parse the Python code.

TSchema DSL

Take IntImm as an example, the

@declare
class IntImmNode(PrimExprNode):
    """
    Constant integer literals in the program.

    See Also
    --------
    IntImm

    Attributes
    ----------
    value
        The internal value.
    """
    type_key = "IntImm"
    value: ty.int64_t

There are several things require to be parsed,

Object name. In the above example it is IntImmNode, therefore class IntImmNode (extends Object) will be generated.
Type key. In the above example it is IntImm, therefore class IntImm (extends ObjectRef) will be generated.
Parent class. In the above example it is PrimExprNode
Member variables. In the above example they are,
- value and its type annotation int64_t
The constructor arguments in C++ will be generated as the same order of the arguments in Python class definition.
We also will generate default VisitAttrs, SEqualReduce, SHashReduce methods unless user specify default_visit_attrs as False.

Inplace C++ Source File Modification

As we mentioned before, there are cases where users need to implement complex functions manually. To leverage the convenience of Python declaration and automatic code generation in such cases, we provide an option to modify the C++ source file in-place, and give users the control to specify which part of the file can be modified.

We provide comment parser for .h and .cc file, in which users can wrap the auto-generated section by comments, e.g.,

// tschema: ObjectName

The lines between tschema: ObjectName and tschema: end
will be manipulated by tschema

// tschema: custom-begin

User can also mark sections which should be left unchanged by objgen
This section will be inserted at the end of the class definition,
right before the close brace

// tschema: custom-end
// tschema: end

Here is also an example for it:

Before generation

// tschema: GlobalVarNode
// tschema: custom-begin
bool SEqualReduce(const GlobalVarNode* other, SEqualReducer equal) const {
  return equal(name_hint, other->name_hint) && equal.FreeVarEqualImpl(this, other)
}
bool SHashReduce(SHashReducer hash_reducer) const {
  hash_reduce(name_hint);
  hash_reduce.FreeVarHashImpl(this);
}
// tschema: custom-end
// tschema: end

TSchema Definition

@declare
class GlobalVarNode(RelayExprNode):
    """
    Global variable that lives in the top-level module.

    A GlobalVar only refers to function definitions.
    This is used to enable recursive calls between function.

    See Also
    --------
    GlobalVarNode

    Attributes
    ----------
    name_hint
        The name of the variable, this only acts as a hint.
    """
    type_key = "GlobalVar"
    default_sequal_reduce = False
    default_shash_reduce = False
    name_hint: ty.String

Generated Code

// tschema: GlobalVarNode
class GlobalVarNode : public RelayExprNode {
 public:
  String name_hint;
  void VisitAttrs(AttrVisitor* v) {
    v->Visit("span", &span);
    v->Visit("checked_type_", &checked_type_);
    v->Visit("name_hint", &name_hint);
  }
  static constexpr const char* _type_key = "GlobalVar";
  TVM_DECLARE_BASE_OBJECT_INFO(GlobalVarNode, RelayExprNode);
  // tschema: custom-begin
  bool SEqualReduce(const GlobalVarNode* other, SEqualReducer equal) const {
    return equal(name_hint, other->name_hint) && equal.FreeVarEqualImpl(this, other)
  }
  bool SHashReduce(SHashReducer hash_reducer) const {
    hash_reduce(name_hint);
    hash_reduce.FreeVarHashImpl(this);
  }
  // tschema: custom-end
};
// tschema: end

@tqchen @yzhliu @jwfromm @jroesch @junrushao , also thanks Yizhi for the initial idea and RFC writing.

comaniac · September 16, 2020, 11:30pm

Thanks for the RFC and it looks super useful! I have two questions from the RFC:

Are the generated .h and .cc files supposed to be tracked in the repo, or they are more like the build files?
For the in-place modification, I am a bit confused about the example of customized C++ code (the example of Before generation). I imagine the TSchema definition is a standalone Python file. Then where should this piece of C++ code be specified?

Thanks.

ziheng · September 17, 2020, 12:35am

Hi @comaniac,

They will be tracked in the repo. But user should write the tschema for objects instead of writing the cpp files directly except the tschema-custom part.

Sorry that I did not make it clear, there actually no such “before generation” files. Finally we will keep the generated code in our codebase and normal users will build them directly. I just use the code snippets to explain what’s tschema’s job.

comaniac · September 17, 2020, 12:42am

Thanks for the clarification So this is more like a tool to facilitate the development process. After the C++ code has been generated, we can continue working on it as always.

jcf94 · September 17, 2020, 1:54am

Thanks, this is really an interesting work! For those who have a requirement to add their own modifications to use TVM, it will be very helpful!

I’m just thinking about how frequently will this new feature be used. IMHO, advanced users who may benefit from it are more likely to write their C++ code directly, while other users may not really have a requirement on this.

Another problem is I guess it will be hard for a IDE or editor(e.g. VSCode, Vim with CTags) to track the code and provide navigation?

zhiics · September 17, 2020, 4:46am

Yeah, this could be a useful tool to generate the generic templates or the code with the fixed pattern which is actually the major part of a node. For some other members, e.g. SEqualReduce and SHashReduce, we may still need users to manually check/add since they are not always Equal(this->a, other->a) && Equal(this->b, other->b);

ziheng · September 17, 2020, 4:50am

Hi @jcf94,

First, this is not only for C++ code generation. In the future, we will extend it for Python/Rust code generation, which is helpful for unifying object definitions between different languages.

Second, some object fields is hard to fill in even for advanced users, e.g, type_child_slots, which is the number of object’s children.

And last but not least, by defining objects with tschema, we will have more in-memory information about the object itself. For example, the type hierarchy between objects, the memory layout of an object, etc. This will enable more compilation optimization in the TIR and help us improve TIR’s type system (my next step).

Since we will keep the generated C++ code in the codebase, it will not make any difference with current code in terms of code navigation.

ziheng · September 17, 2020, 4:49am

@zhiics Yep, we have an option to turn off the default method generation and allow user to fill their customized code snippets.

mwillsey · September 17, 2020, 9:27pm

Hey @ziheng! I think this is a great idea. As someone who is pushing on the Rust bindings right now (along with @jroesch), I love the idea of deduplicating work.

One design choice I see is whether to centralize or decentralize code-generation. It seems like your original design is leaning towards centralizing it. It would like to start a little discussion on why/if this is the right idea.

Decentralizing code generation could have some benefits, here’s how I see it looking. The schemas themselves live in some central location like /schema, and they are defined as simply data (perhaps JSON). Each “backend”, including C++ and Python, is then responsible for reading this data and generating code for itself. The downside is that there may be some duplicated logic in the code generation. But on the upside, each backend gets to use different tooling to do the codegen; for example, it would be nice to use Rust (using syn and quote) to generate the Rust code. This could also simplify the story for implementing additional code for the methods: each backend just handles it itself, no need to toggle or parse anything.

Here’s a example on what the JSON could look like:

[
    {
        "name": "IntImmNode",
        "key": "IntImm",
        "ref": "IntImm",
        "parent": "PrimExprNode",
        "fields": { "value": "int64" }
    },
    ...
]

You could imaging grouping these schema into namespaces or something too, if you want.

On the topic of checking the generated code in, I’m not sure why that is necessary. As long as the files are generated by the build system, shouldn’t autocomplete and stuff work fine?

ziheng · September 18, 2020, 7:36pm

Hi @mwillsey, The decentralizing code generation sounds a good idea technically! We choose Python mainly for user-friendly. I would also like to know @tqchen’s opinion here.

We can make an automated build pipeline, but checking in the code directly will make the project codebase more clear. After all, not all the user need to know those details.

tqchen · September 19, 2020, 2:48am

I like the idea of using rust to generate rust side. In the meantime, a python syntax for data structure setup can be useful in the future when we want to design custom data types from python side. One potential solution is we keep the python schema frontend, and generate a json exchange format that the rust generator can take. Essentially a separation of frontend, ir repr and backend.

lixiaoquan · September 24, 2020, 2:49am

I just wonder why the generated files are tracked in repo? It seems not neccessary since they are generated.

mbrookhart · November 5, 2020, 4:23am

I’ve been looking at the PR and some of the discussion, and I thought I’d bring my thoughts back jto this RFC, it seems like a better place for broader design thoughts.

First, thanks for the RFC, @ziheng. There is definitely waaaay too much boilerplate in TVM right now, and finding ways to streamline that will help development in the future.

I’m a still a little confused on what the exact goal of this RFC is.

It seems like the current design is as a setup tool: You write it once, execute it once, and then throw away the schema code. After that, you edit and check in the generated code. At most, I think that would save 15-30 minutes of development time per new datatype introduced, I’m not sure it’s really worth the complexity of parsing the python AST.

If we want to move to a situation where we remove the boilerplate code from the repository and generate it on every build, that becomes a more complicated question. First, declarative code like that can be very difficult to debug, it places enormous pressure on the correctness of the parser implementation. Second, if we do want to move to a system where we automatically generate more of the bindings, I really don’t think it should be in python. The more we write core TVM functions in python, the less portable the entire system becomes for lower level production uses and more resource constrained systems.

I guess I’m not sure I fully understand the problem this is trying to solve?

Thanks, Matthew

tqchen · November 5, 2020, 2:27pm

First of all, given that the schema generation itself is de-coupled as a frontend, there won’t be a problem for the lower-level production system, as the object themselves are still presented as part of the C++ and build into the system. The schema generation is ran separately just like clang-format (and the language to implement the tool matters less).

One thing that a lot of the discussion rightfully point out is that it is hard to build a generator that handles method binding in a language agnostic way. Given the above consideration and the cost mentioned about the complete generation. The current proposal focused on the problem that we can solve, namely the data layout generation. Notably, the proposal gives a more inplace generation process, which means that we can start from the codebase as it is right now, gradually add object into schema and make use of the feature, without having to do a disruptive transition. The tool will also serve as a clang-format style, which means it can be repeatively invoked, and complete the regions that needs to be completed.

Now back to the overall problems and complexity. There are a few considerations:

C0: We are adding more language bindings, and would want quick data structure accessor to these language binding(e.g. rust, or even a first class cython based member accessor)
C1: As we introduce more typing into the TIR, we want to enable direct access of the data structures from the generated code(Allow TIR to access runtime::Array and any ir nodes), which would require such data layout schema of these data structures.
C2: As we start to enhance the python side of the frontend, we eventually want user to be able to declare their ADT in python, as part of enhanced TVMScript.

While it is true that keeping the current C++ only binding would not gain a lot from the schema generation. There are additonal gains in the area of C0. More importantly, a schema is the blocker to enable C1. Notably, the compiler does not have to depend on python to make use of C1, as we can still generate layout info into a backend language and register there. But python could be a quick starting point.

Of course C0 and C1 do not force us to use python ast as frontend of the schema. C2 is one motivation to enable this route. Notably, there is no intention to support arbitary python. Like TVMscript, we want to define a clear syntax for data layout itself, which is critical to the above enablement, but also acknowledge that it is simply not possible to define a language that handles method def/bindings in a langauge agnostic way(people always want features in their native language for mathod), thus still allow developers to provide editing directly in the target language. Notably, most of the objects of interest(IR objects) do not have method functions, and our main design goal is to enable first class support to these objects. While there is certainly a bit of complexity being bought in via AST parsing, the goal of a clear pythonic syntax(defined by ourselves) is managable, and aligned with the first class python support philosophy.

Of course our other core philosophy is to not get into the ways of the developers and usecases. If the introduction of the python frontend hampers the developer’s ability to custom define a new object, or port any application on resource constrained devices and/or languages, then we would need to think more carefully about it. My understanding is that the current proposal does not provide constraint in that regard.

Moreover, the explicit design choice of inplace generation(e.g. the clang-format approach) instead of the full generation greatly reduces the cost for adoption and transition. The codebase can stand just OK without the schema tool and continue to add objects manually if needed. The annotated region get generated (and checked via linting pass) as we gradually add objects that uses schema generation. The code are checked in as part of the codebase avoid the complexity of a full generator system. While I understand that there might be some desire to push for a full-fledged generator, I do feel that the strong need for customization, and gradual adoption would make this path a better one with less resistance.

tqchen · December 10, 2020, 12:57am

Now that we are post TVMConf, it would be great to discuss this thread further and land the initial object schema support. In particular, the goals on C0, C1, C2. inplace generation vs full generation.

tkonolige · December 10, 2020, 1:22am

I think the simplest way we can accomplish goals C0 and C1 is to have a language independent schema in something like JSON or YAML. Both are easy to write and understand. Both have existing parsers for many languages, so we will not have to write our own.

Using a Python DSL as a schema seems to have many down sides and few upsides. We’d have to parse the Python AST, which is incredibly painful and changes between versions. Furthermore, using the full Python AST means that there are more places for the user to be confused. For example, if type_key = "GlobalVar" is allowed, then is type_key = "GlobalVar" + self.name_hint allowed? The user has no way to know which parts of the Python AST are allowed and which are forbidden. Its not clear to me that a Python DSL is easier for anyone to use even though it is “just Python”. And using the Python AST certain creates a larger maintenance burden for everyone maintaining TVM.

tqchen · December 10, 2020, 2:41am

First of all, I think there is no disagreement about the general schema as source of truth. e.g. json that can be accepted by another generator like rust. And hopefully we converge on choosing the in-place generation (clang-format) style for now.

The main question of interest is whether a python frontend to that schema is desirable.

Notably as part of the TVMScript we are already parsing and restricting a subset of python AST(hopefully synr can help to reduce the python ast variation problem), and the python type annotaton does makes it easier to represent the necessary information(like types of fields, inheritance relation, documentation, namespacing) in a natural way.

Considering C2 and there is already an working frontend, introducing a python frontend to the schema would bring more consistency, and help us streamline the future support of ADT in TVMscript without introducing another source language for data types.

tqchen · December 10, 2020, 2:41pm

Some updated note after looking into C2 a bit deeper

First to why user might want to write python typing syntax over yaml/json. Consider the example below, where we need to introduce parametric typing Array[PrimExpr].

@declare
class Call(PrimExpr):
    args: ty.Array[PrimExpr]

Under the python syntax it is quite natural, while if we switch to a json synax, we have to either:

X0: Mark the type as "Array[PrimExpr]"(string) and build a type parser. This increases the complexity of the parsing itself.
X1: Further elaborate what "Array[PrimExpr]" is, makes backend’s life easier.

With a python frontend, we can easily generate X1 (shown in below), which is easier for backend to digest

{
  "Array[PrimExpr]": {
    "generic_type": "Array",
    "type_arg": ["PrimExpr"]
  },
  "Call": {
       "fields": [
          ["args", "Array[PrimExpr]"]
       ]
    }
}

Now back to the complexity of to the python AST parser. Direct parsing may not be needed in this particular case. For example the DSL changes to all static fields then no parsing is needed

@declare
class Call(PrimExpr):
    fields = [("args", ty.Array[PrimExpr])]

Of course the original type annotation syntax is more desirable. One approach is to do lightweight staging transformation, to make @delcare transform the type annotation to include the field case. In the particular case of class field type annotation, we can use introspection functions https://docs.python.org/3/library/typing.html#typing.get_type_hints to get the necessary information, without the need of AST parsing. They will also support most of the python syntax, as the information needed are static field and type information of a class.

To summarize, choosing a python frontend would simplify the syntax user want to write, introduce structural typing. Additionally, we can minimize the dependency on python AST parsing by using introspection features.

junrushao · December 10, 2020, 6:24pm

Thanks @ziheng for bringing up this topic! Automatic object generation is an important functionality that could significantly lower the burden of especially cross-language developers.

I think we all agree that the functionality itself is desirable, and the way we design the schema fields and customized methods.

It looks like our disagreement comes from the choice of language, i.e. Python vs JSON/YAML. It certainly makes sense that more discussions are meaningful because the choice will profoundly impact on all the future development.

I would love to compare Python and JSON in the following dimensions.

D1. Parsing. As a general purposed language, Python is certainly harder to parse than JSON, and as @tkonolige said, there are fewer libraries support. If some day we want to implement a parser in C++/rust for Python, it would be much more trouble than JSON. However, I would argue that

Parsing is not part of TVM’s compilation pipeline - it is only the object schema’s compilation pipeline.
Technically, as @tqchen said, we neither really need a parser for python, nor touch the python AST - we can inspect on the python class instead, using python’s standard APIs, which is more stable than Python AST itself.

Therefore, the case is that in Python we don’t really need to parse or touch the AST, while in JSON we need. IMO, my understanding is that python is not that bad in this particular task, especially when we do not really need to do parsing. I would score 9/10 for Python, and 10/10 for JSON.

D2. Concision. JSON is supposed to be (somewhat) human-readable. However, it is not always the case in reality. In our particular case, we can define an object in python with each of its fields listed in a single line, like:

class Add(PrimExpr):    # <= inheritance
  """ docs """
  lhs: PrimExpr         # <= name and type of the field
  rhs: PrimExpr
  # you can insert whatever comments too

However, in JSON, it lacks the concise support of those structure. It gets long and tedious quite easily, especially when you want to bring in customization.

Therefore, due to the fact that JSON does not provide the functionalities that a language usually have (e.g. succint type annotation, comments), I personally prefer Python to JSON in concision, with Python 10/10 and JSON 5/10.

D3. Recursive containers. Containers are used quite frequently in TVM. For example, we do have to handle types like Array<PrimExpr>, Map<Instruction, Array<ObjectRef>>. With python syntax, we can write it in a single line with python’s builtin type annotation:

array: List[PrimExpr]
map: Dict[Instruction, List[ObjectRef]]

However, as we have discussed in D2, JSON does not provide approachable syntax to allow customization in a human-readable way. Just imagine what it will look like in JSON.

IMHO, for open source contributors who have limited time to learn the core infra, the tooling must be really easy to use. we would prefer that the core developers to spend a bit longer (although not actually) to make the infra cool and robust, rather than asking OSS contributors to spend their valuable time figuring things out. In this particular case, I really want to advocate Python than JSON. It is a 10/10 vs 3/10.

areusch · December 10, 2020, 6:59pm

I think generating bindings for the TVM Object system is a great idea. I’m in particular excited for:

reducing boilerplate
improving documentation
extending this to work with PackedFunc and the global function registry

I realize that the last point is not part of this RFC, but I think it is helpful to also consider it in designing this system as I think it seems quite natural that if we ever did implement auto-generation of PackedFunc bindings, we would do it here. I also think there is a reasonable argument to be made that those bindings constitute our “API” and are wholly undocumented at this point, leading to widespread confusion for new TVM developers.

There are a couple of parts I have thoughts on.

Generator Design

I think when considering multi-language generation capability, this system can quickly become more complex then it initially seems. If we design this system to allow a separate language-specific codegen, then that motivates a traditional frontend-ir-backend compiler split. Sketching this out a bit:

+-----------+           +-----------+
|  schema   |    ->     |   parser  |      (frontend)
+-----------+           +-----------+
                              |
                        +------------+
                        |    IR      |
                        +------------+
                              |
                        +-------------------+       +-----------+
   (backend)            | language-specific |       | generated |
                        |      generator    |    -> |  bindings |
                        +-------------------+       +-----------+

Rationale for each piece in this design:

schema: human readable representation of the Object schema, easy to edit and code-review
frontend: consolidates schema validation code
IR: provides a barrier/API between the frontend and generator, each of which may be written in a different programming language. should be parseable by lots of languages.
generator: produces the bindings, but separated from frontend to allow the generator to be written in a different language
bindings: the final result, checked-in to git when it is necessary to mix hand-written and machine-written code in the same file.

Design Considerations

I don’t think we should allow ourselves to directly generate bindings in the language the frontend is written in (i.e. the generator should always consume only the IR). In other words, it is tempting to write the schema as Python dataclasses and just require a sub- or meta-class or decorator that implements the Python binding. I am against this because:
- Python may be able to generate more complex/featureful bindings than other languages. In particular, the doc generator should be considered as important as the Python bindings.
- Python runtime code can access the schema and use it in unintended ways.
The frontend should be written in a high-level language that’s most familiar to the developers. It doesn’t matter which one it is as long as we all understand it. I don’t think a low-level language e.g. C++ is sufficient for this task because the standard library lacks both sufficient string-manipulation functions and data parsing functions (i.e. a JSON parser). This parser should be hackable by people who aren’t familiar with the language, so doing advanced language-specific things like parsing an AST should be motivated by some clear benefits.
The IR should be represented in a format that’s easily parseable by each language generator. It could be the same format as the schema.
The schema should be written in a human-readable language that’s easy to code-review. It should also be unambiguously parseable and simple.

Documentation

Documenting core data structures is one of the easiest ways to improve developer understanding of TVM. I think we should make documentation mandatory for each Object attr declared and for the overall Object itself. Exporting an object from libtvm means it’s a core data structure. So, I think it would be great for this system to enforce that each member at least has a docstring.

I don’t see a way to do this in Sphinx (open issue), so this unfortunately doesn’t come for free with a Python schema definition. It seems like it’d be better to implement validation in the frontend here anyhow. We could either parse the docstring for the object (if represented in Python) or choose a schema format that makes this easier to validate.

We’ll need to find a way to either exclude generated classes from the Sphinx docs, or redirect them to the canonical documentation generated from the parser.

Mixing machine-generated and hand-written code

It’s common practice to avoid checking in generated code. I think the main reasons are:

pollutes commit diffs when new members are introduced
increases git repo size
can make it harder to tell which is machine- and human- generated

So by mixing these two, we are deciding to go against this practice. It seems like the main reasons for this are:

override a generated function such as SHashReduce
to add convenience functions to the C++ Node objects

Seems like we can avoid the first case by generating a declaration but not a definition, and the second case by adding a configuration option to tell the C++ generator not to generate a corresponding Ref object, and then adding the convenience functions to a hand-written Ref object.

Preferences

Based on all of the above, I advocate for:

The frontend should be written in Python. I think it’s practically a requirement to know how to write Python to use TVM. Even if you’re using a different language as a frontend to the TVM runtime, you can’t invoke TVM as a compiler without Python. TVM contributors know Python and it’s easy to use.
The IR should be represented as JSON. Most languages can parse JSON easily.
The Python binding generator should consume only the JSON. I think we should add this now, before someone starts depending on the schema subclassing a production class or metaclass.
The frontend should not allow you to generate code for an object without writing docs for the object and each member. It can parse the docstring if we want to go that route.
We should generate Sphinx docs for each Object and each Object attr.
We should avoid checking in generated code and not mix generated and machine code.

I’m not taking a position on how the schema should be written, other than that I agree JSON is pretty hard to read for this purpose.