[RFC] Relay Containers Array/Map/String

xiandi · August 10, 2020, 10:39am

I am a mlsys engineer who mainly provides server support for the nlp-related model for the company. In the nlp-related business scenario, we found that tvm can provide superior tensor computing performance, so we intend to promote tvm applications within the company.

Motivation:

In addition to Tensor computing, we want to integrate more data processing operations with tvm relay to provide end-to-end solutions.

Currently, TVM already provides data type support such as String/Array/Map at the Runtime level, but there is no corresponding interface at the relay ir, and we sincerely hope to provide corresponding support at the ir level.

Example

This is an example of a dictionary lookup that converts a text sequence into an id array and then do model inference. To Implement this function, we may need to support String/List/Map and some of its actions at the relay level.

py_vocab = {"hello": 0}

def term2idx():
    vocab = tvm.runtime.container.Map(py_vocab)
    terms = relay.var("terms", relay.ListType(relay.StringType()))
    idxes = relay.lookup(vocab, terms)  # lookup is a op
    return relay.Function([terms], idxes, ret_type=relay.ListType(relay.scalar_type("int32")))

Other Application

Converting PyTorch models also requires these data structure, for example:

Roadmap

We want to do this in three stages:

Add String IR types and support easy operation of String
Add Array IR types and support easy operation of Array
Add MapIR types and support easy operation of Map

Currently we have written a demo to increase the String type, hoping to get community support and feedback, and if possible, merge into tvm master, which can reduce branch maintenance costs here.

PR LINK: [relay][ir] add string type to relay ir by cloud-mxd · Pull Request #6242 · apache/tvm · GitHub

Good day!

xutianming · August 10, 2020, 12:42pm

Good idea. I am compiling a torch model containing Dict input. Tvm raised above error.

xiandi · August 10, 2020, 1:10pm

@tqchen @MarisaKirisame @jroesch

junrushao · August 10, 2020, 4:53pm

Would you like to enhance the relay type system, adding types like string, array and map?

xiandi · August 10, 2020, 5:02pm

Yes, that will help convert torchscript model and integrate more business to tvm stack. we will need to discuss the best way to support them

junrushao · August 10, 2020, 5:14pm

I can understand the motivation of adding string help for NLP-related models. Just some curious questions:

While embedding lookup can be implemented using a map lookup, TVM does not provide the functionality of tokenization, does it imply that tokenization is done outside but the lookup has to be done inside TVM module? Is there any specific reason that it must be done inside TVM module?
TVM String is using “char” as storage, and is not aimed for unicode handling. Does it imply that right now we only focus on ASCII string?
What is the plan for array type and map type in Relay? Do you introduce an ADT, or expand the type system to support arrays and maps, for example, support type inference with arrays and maps? Are types inside arrays and maps homogeneous?
How does it lower to TIR? Into an intrinsic call that invokes array indexing and map lookup?
Map is not inside the runtime right now, due to concerns of the binary size, but to use map in generated code, it seems that we need to move them into runtime. Is that correct?

xiandi · August 10, 2020, 6:11pm

First of all, thank you very much for your question, in my opinion:

Tokenizer and lookup can be implemented within the tvm through adding tokenizer op or substring operations. this is similar to torchscript. All nlp business placed inside the tvm mainly to consider the convenience of deployment and application promotion.
wo now fucus on utf8 string.
Relay has implemented List by ADT. However, we intend to expand the type because there may be performance issues with the adt list. The item types of arrays and maps are homogeneous which already meets the business needs.
we plan to add some relay vm instructions to support the operation on the string, array and map container.
yes，need map in runtime.

The above is still in the discussion stage, Hope to get more support and feedback from the community, thank you!

junrushao · August 10, 2020, 7:11pm

Thank you for your quick response. It certainly makes sense to me!

jknight · August 10, 2020, 10:54pm

wo now fucus on utf8 string.

I don’t follow. Are you planning on expanding support to include utf8 as well?

yes, need map in runtime.

Runtime binary size implications are increasingly important to eg uTVM. CC @areusch . Could this support be optionally compiled into the runtime? Though that might get confusing when models run in some runtimes but not others. Is there anyway for a runtime to advertise what capabilities it supports and then validate models against their needed set of capabilities?

junrushao · August 10, 2020, 11:05pm

Yeah I agree the point for runtime binary size, and that is the reason why I finally decide not to put tvm::Map into runtime before. In the case of uTVM, it implements its own pure C runtime IIRC @areusch, so won’t be affected by this proposal.

jknight · August 10, 2020, 11:07pm

But if TVM’s various runtime capabilities start to diverge, isn’t that a bad thing for TVM user experience?

junrushao · August 10, 2020, 11:11pm

Good point, I agree! Another approach might be having a compilation flag to move related functionalities in/out of the runtime, for example, a CMake flag “USE_MAP_IN_RUNTIME”, and if we need minimal binary size, we can turn the flag off

xiandi · August 11, 2020, 2:01am

Good point, FYI

xiandi · August 11, 2020, 2:24am

Maybe we can refer to lua ext lutf8 and provide some utf8 functions to relay?

follow @junrushao proposal

Thanks you and best regards !

junrushao · August 11, 2020, 8:28am

UTF-8 string operations are highly non-trivial. If no other operations are needed besides look-up, a better way might be that we treat them as an opaque buffer.

wwwxxxhhh · August 11, 2020, 8:36am

I modify the two files,but it didn’t work,it still have the problem, I think I will crazy

xiandi · August 11, 2020, 8:37am

opaque buffer is also ok to me.

wwwxxxhhh · August 11, 2020, 8:39am

the pre-problem,I can not reply,I have only three times,do you have mail address,I need you help ,I use the method,modify two file,but it does not work,I think I will crazy,I try lots of methods,but all not succeed

xiandi · August 11, 2020, 10:01am

Maybe we can also add a UnicodeString type and then add two conversion functions:

encode(str: UnicodeString) -> String
decode(str: String) -> UnicodeString

tqchen · August 13, 2020, 12:03am

Thanks for the proposal. There are a few things that we need to figure out besides the types themselves:

P0: mutability of the container
- Right now all the containers are immutable, which also makes the relay analysis simpler. My understanding is that immutable containers are good for our usecase(as we mainly do lookup). but please confirm if that is the case
P1: the type signature of Type themselves, there are a few alternatives. Having a concrete Type for Array/Map/String is certainly one possibility. But we will need to confirm that we won’t add additional times besides these ones. and additional ones should be added through ADT