[RFC] TVM Object Schema DSL

tqchen · December 10, 2020, 8:01pm

I agree with most of the points.

The most interesting point to be discussed is whether we do:

K0: inplace generation (like clang-format)
K1 complete generation

When we have a case where things are completely self contained, then I agree that having K1 (and not checking in the code) could be the right way.

On the other hand, the reality is that:

We cannot do a quick complete switch, as most of the objects are manually written and it is desirable to have a gradual path of migration.
There are other member functions which are already written in Node or Ref, and they need to be added.

One valid concern of checking in generated code is the consistency and git history. If the generated code is quite complicated (like parser generator) then I agree it would be a concern. In this case, our goal is to have readable code being generated. Additionally, running the generator like clang-format in the linting stage would address the consistency issue.

So in summary, I would view the tool as more like an developer auxiliary tool (like clang-format), rather than a generator. This view would also allows us to gradually opt-in generation, while still manually write objects in cases that are necessary.

areusch · December 10, 2020, 9:03pm

@tqchen I think there is also the point of whether we are going to generate JSON IR now to avoid accidentally depending on the schema, or whether we are just going to proceed with the PR as is. I would favor generating the JSON and thereby separating backend and frontend now, so that organic code growth doesn’t run into problems later on.

wrt K0 vs K1: shouldn’t it be possible to generate complete classes and derive from them in C++ where we have additional features (I.e. helper functions) that cannot be translated into this language? I agree that clang-format would mitigate the effects on git history and consistency, but i’m not sure I see why we are unable to do a gradual switch with K1.

tqchen · December 10, 2020, 9:57pm

I agree with the need of having JSON serialization format for exchange and general separation. In the python side though it might be useful for both frontend and backend to use the same internal repr(while making sure they are serialized and deserialized from json).

K0 vs K1. While in theory it is possible to do a clean separation of a class, there are more cases we do not(due to Hash and SEqual customization). So K0 is still somewhat easier

areusch · December 10, 2020, 10:20pm

using the same internal repr makes sense to me in python (I.e. having one library to serialize and deserialize).

what do you mean by “there are more cases we do not”? my proposal for customizing Hash and SEqual are to just declare stubs and separate the implementations.

tkonolige · December 11, 2020, 12:19am

Although I like how robust @areusch’s design is, I’m not sure we need an IR. It seems like we could just skip the parser -> IR step and go straight from schema to language-specific generator. Isn’t this how all other similar projects work (protobuf, cap’n proto)?

If we still want to go the schema -> parser -> ir -> language-specific generator route, why don’t we just start with the IR -> language-specific generator part? That way we can get something working and see what the pain points are.

Regarding the python schema here are some issues I foresee:

How do we add documentation to fields
How do we add attributes to fields I’m still against using Python here. I think it will be easy to get trapped by the limitations of the python AST when we want to add features in the future.

altanh · December 11, 2020, 1:10am

I agree with @areusch on separating declarations from implementations. Reading this thread, I think everyone is in agreement that the “TVM Object Schema DSL” should clearly and easily define the data layout of all Objects in TVM (in the sense of plain-old-datatype) with type information.

I think the current debate (apart from choosing a schema language, which is IMO less of a real problem) is how we should deal with functionality. I agree that mixing generated and hand-written code is undesirable, and ideally we should separate out all implementation to a separate file that can live in the codebase more permanently. I’m not sure what this would look like (and perhaps this requires another RFC), but instead of using

default_visit_attrs = False

we could declare that the base Object class has a function void VisitAttrs(AttrVisitor *v) which each language backend defines an implementation for. I think generally, this would help with e.g. Rust which doesn’t have inheritance. I guess my point is that the part where code starts being mixed should be handled by the language backend (in @areusch’s design), whereas we can expose a more general function API/schema.

tqchen · December 11, 2020, 3:55am

I think the main question here is that the IR itself may not meant to be written manually(per the previous post discussion about recursive types). While we already have a concise syntax in python DSL(per @junrushao’s point)

About the frontend language for the schema. While it could be tempting to design a separate language syntax for it (schemalang++), that would also mean an extra thing for the developer to learn, which contains a higher amount of overhead.

Additionally, it is important to consider the python-first design principle and the potential need to expressing ADTs collectively with TVMScript, having a clear schema syntax in python help streamline that process.

To address the two particular questions about attributes, we could reuse the existing solutions in the python ecosystem(rather than creating our own).

Documentation: adopt numpy doc to document the fields
Attributes to the fields: it is unclear whether attributes are needed beyond the type signature, since most of the property of the Object field itself are reflected in the typing. But in case we really want to introduce such a thing, we could still have a global static assignments (or a map) depending on how rare the attribute is. Right now I think most of the attribute info should be enclosed in typing.

tqchen · December 11, 2020, 3:56am

In most of the cases the member functions only need one implementation in C++. I agree that we could have a strategy where the declaration and impl are generated separately, although that does not preclude us to have a mixed section for custom functions, when needed.

junrushao · December 11, 2020, 5:19am

I think @tkonolige’s comments on documentation makes sense, which hits another of python’s fundamental weirdness that it doesn’t even have a standard syntax for per-argument documentation…It means we need to parse the docstring to actually get the doc for each argument.

To alleviate the problem, numpy provides its official docstring parsing tool: numpydoc. It works well with the exact same docstring style as we have been using in TVM repo. An example I took from stackoverflow, link:

class Photo():
    """
    Array with associated photographic information.


    Parameters
    ----------
    x : type
        Description of parameter `x`.
    y
        Description of parameter `y` (with type not specified)

    Attributes
    ----------
    exposure : float
        Exposure in seconds.

    Methods
    -------
    colorspace(c='rgb')
        Represent the photo in the given colorspace.
    gamma(n=1.0)
        Change the photo's gamma exposure.

    """

    def __init__(x, y):
        print("Snap!")

doc = NumpyDocString(Photo.__doc__)
print(doc["Summary"])
print(doc["Parameters"])
print(doc["Attributes"])
print(doc["Methods"])

We can see that it seems to me a straght-forward library to use, while keeping being consistent with our current style.

BTW, I don’t think JSON allows any comments/documentation by design. Of course, we could argument that docs can be filled as a JSON field, but it does not really look like something human-readable though, especially when we work with multi-line documentation, where we have to write escape characters manually…

tqchen · December 11, 2020, 2:04pm

I agree, Thanks @tkonolige @junrushao for pointing it out:)

Python syntax itself certainly can have limitations, but because of the rich ecosystem, there are reference solutions that work around these potential drawbacks (which developers are also familiar with). Some of the drawbacks could be worked around by designing another frontend language, on the other hand not having an opinion on frontend language design perhaps is a good thing in here, and would benefit future TVMScript integrations.

junrushao · December 11, 2020, 5:17pm

Thanks @tqchen! Just to be clear, we should avoid relying on Python AST (which is unstable) if possible, and use python’s built-in inspection instead.