[DISCUSS] Runtime Array Containers Array/ADT/String

wweic · November 7, 2019, 11:01pm

Sure. I’m happy to scope the work and send out a PR.

wweic · November 14, 2019, 9:03pm

@tqchen I believe most ADT objects have no more than 3 fields, the memory cost should be ok without move support. We can also have the ability to move large ADT objects by having 2 modes of ADTObject, if size_ is small(less than 3), data_ points to memory directly following size_. Otherwise it points to heap memory allocated somewhere else.

tqchen · November 14, 2019, 10:06pm

to keep things simple, let us just use one mode(no move)

wweic · November 15, 2019, 1:57am

Sure. It won’t be an issue as well.

tqchen · December 11, 2019, 4:23pm

Thanks for getting the ADT landed, any interest in pushing for the rest? Array/String?

wweic · December 11, 2019, 5:49pm

Yes will do. Will look into the two.

wweic · December 20, 2019, 3:41am

@tqchen As for Array, what’s the use cases you expect? It might affect the design decisions. Questions:

Why support move from std vector? We might need to have 2 variants(c++ classes) of array, one for std, one for in place array. It doesn’t make sense for std based variant to inherits from InplaceArrayBase.
Use int32 or int64. Unless it’s needed for some use case, I think int32 is sufficient.

tqchen · December 20, 2019, 3:52am

The need for move is an interesting one, we could debate whether or not we want it. It really depends on whether a user likes to have the native(std::vector) container a lot or not.

For string it definitely makes sense as people likes native std::string, and it can be used to exchange string among languages.

I am not that sure about Array, on one hand, if the Array provides enough feature like a vector, perhaps it is not strictly necessary. On the other hand, having the move support(with one additional cost of 8 bytes) seems to be not too bad as a deal.

Hmm, if we are going to support the explicit data_ pointer, it is somehow no longer an InplaceArray, although there are some similar functions that we might want to reuse. Most of the push_back functions are on the Array case(they are un-aware that the underlying container is from a vector or other cases).

I do not have strong opinion of int32 vs int64 in this case

tqchen · December 20, 2019, 3:55am

Would be great to get others’ thoughts @FrozenGene @haichen @junrushao

wweic · December 20, 2019, 5:24am

Thanks! While waiting for Array feedbacks. As for String, I think we can simplify it to be a read-only container, it keeps reference to data created by std::string. It’s just for cross language sharing. Does this make sense?

FrozenGene · December 20, 2019, 11:40am

My opinion is we don’t need to support to move from std::vector. When we design it, we should encourage developers using this. From the experience, we like to use std::string, but we like to use XXX::Array / XXX::SmallVector and so on.

wweic · December 28, 2019, 8:22pm

Thanks @FrozenGene for your input! Looks like you vote for a full feature array implementation. I think it will also cause less confusion if we can program directly in tvm’s containers as much as possible.

@tqchen For vector/string, if we want to support move from std containers without copying, we have to assume the original container object’s lifecycle is longer than our reference object. But I don’t think it’s always true. For example, std::string is usually short lived. We might have to copy the content anyway, but we can reuse InplaceArrayBase actually now to reduce one extra indirect memory access.

tqchen · December 28, 2019, 9:25pm

Re: life-cycle management in move semantics, we will need to explicitly ask user to move, which means std::move(mystr) would results in a move and the original container is moved to FromStd container and thus its lifecycle will be the same as the object. Otherwise, it will results in a copy, and we invokes the normal inplacearray allocation(but we will need to store the same pointer for compact reasons). So the main question is whether the additional 8 bytes cost worth it.

Here are some additional thoughts that I think worth discussing.

Another thing about String, which might worth considering is whether we want to lazily memoize hash, given string is immutable here.

Given that both Array and ADT are immutable(under the hood, as we do copy on write for array). It is interesting to ask whether we want to simply use Array to represent a Tuple. We will face this issue when we enhance the language frontend to do the conversion. For example, when a tuple is passed from python side, it gets converted into ADT, but when the implementation might requests an Array instead. Having a different container means we need to do forced conversion(either copy or move). The only complication might be to teach the relay vm to recognize these additional runtime objects

wweic · December 28, 2019, 11:13pm

Re: move support, yes we can get rvalue reference of std::string for example, we can not simply save the data pointer though, we also need to set the rvalue’s data pointer to null, otherwise the rvalue’s destructor will return the memory back to allocator. But the data pointer is not public I think, so we can not set it to null. Let me know if this explains the problem.

If space efficiency is not a concern, we can add cached hash value.

For ADT object, we need to know the tag of the object(to identify which variant it belongs to). This is not needed by Array. Unifying the two will have waste for Array objects I think.

tqchen · December 29, 2019, 12:44am

I do not mean unify ADT and array, but use Array to store tuples(as opposed to ADT with tag=0), for the rvalue move, we can simply move the std::string to a local container

Here is an example code that supports move(string should be similar) for Array

class ArrayObj::FromStd : public ArrayObj {
 private:
   std::vector<ObjectRef> data_;

   explicit FromStd(std::vector<ObjectRef>&& data)
      : data_(std::move(data)) {
      data_ = data_.data();
      CHECK_LE(data_.capacity(), UINT_MAX);
      size_ = static_cast<uint32>(data_.size());
      capacity_ = static_cast<uint32>(data_.capacity());
   }
   friend ObjectPtr<ArrayObj> ArrayFromVector;
};

wweic · December 29, 2019, 1:16am

Oh, I think it makes sense to use Array to store tuples. They are often used interchangeably.

Thanks for the clarification, I was under the impression we want to get rid of std containers in our containers object. So we just need to ensure the head of our object to be POD-C compatible.

tqchen · December 29, 2019, 6:32pm

yes, because after the construction, the reference will no longer see the non-header part, nor do they need to(as they can always call the deleter to delete)

tqchen · December 30, 2019, 4:43pm

Combining the opinions so far, seems that we start to agree on building a fully featured Array and not support move(given it is not a common usecase). I also start to agree on this.

I am still not too sure about String given that there are quite a few cases where string are still constructed natively.

Supporting move from string allows interesting cases like constructing a long std::string and move into it, or move language native string(like python string) into the container.

On the other hand, directly support an inlined string is also not bad for most cases where number of chars are smaller than 8 or 16 bytes. Note that we need to cost these extra bytes(or even more) anyway to store the original data structure, we might as well just copy the content over. So the benefit for move might only occur for very large bytearrays.

Finally, having a data* field to support move won’t prevent us to use a smarter create array inline and copy the content or move on the size of the string being constructed. Although it maybe overkill. So the may difference would be whether we want to pay 8 bytes for such flexibilibity.

Given that this decision won’t affect the user API, but mainly the ABI(the behavior when we constructed a compiler that directly mainpulates the String) perhaps we can make a call and move forward for now. Also cc @junrushao @haichen @ajtulloch @jroesch for their thoughts.

wweic · January 2, 2020, 3:02am

So for String, let’s implement the move only version, the inline string can be added later if needed.

tqchen · January 2, 2020, 4:13am

sounds good, if we decided to pay the cost of data*, then the followup decision of which variation can be deferred.