DataLoader -- an API to wrap datasets from other machine learning frameworks

electriclilies · March 24, 2021, 12:17am

In this RFC, I propose a new feature in TVM called the DataLoader. The DataLoader class is a soft wrapper class for dataset classes in other machine learning frameworks.

Motivation:

There are a wide variety of datasets that exist in the machine learning framework ecosystem, and each have their own separate API. Since TVM does not have its own datasets, we must write code that uses datasets from other frameworks (most often, pytorch, tensorflow, keras and mxnet). Since these datasets all have different APIs, it is difficult to write generalized code that uses datasets without assuming that the dataset is from one of these specific frameworks.

For example, for quantizing relay models using data aware quantization, it is useful to have a unified API that wraps datasets so we don’t have to handle each type of dataset separately.

Existing APIs:

The most popular frameworks have very different APIs for their datasets.

Pytorch datasets use indexing by overwriting get_item, so you can index directly into the dataset object:

for i in range(len(dataset)):
    sample = dataset[i]
		// Do something with the data

Mxnet provides the option to create python iterable out of the dataset:

iterable_dataset = iter(dataset)
for data in iterable_dataset:
	sample = data.asnumpy()
	// Do something with the data

Tensorflow is very similar to Mxnet, however converting individual datapoints to numpy requires using .numpy() method instead of the .asnumpy() method, so it would look something like this:

iterable_dataset = iter(dataset)
for data in iterable_dataset:
	sample = data.numpy()
	// Do something with the data

Keras datasets are actually provided in a single large numpy array. So, to use the keras dataset, you have to iterate over the batch axis.

for i in range(0, len_dataset, batch_size):
	data = dataset[i * batch_size : (i + 1) * batch_size]
	// Do something with the data

Proposed solution:

I propose writing a class that iterates over an existing dataset and also contains batch size and the number of batches in the dataset— information that is useful to software that may use the DataLoader.

Here is the abstract DataLoader class (subclasses will implement the DataLoader for each framework).

class DataLoader:
    """Wrapper class for data loader or data set classes implemented by other machine learning
    frameworks. Use this class when you want to use different machine learning framework datasets
    interchangably."""

    def __iter__(self):
        """Returns the DataLoaderIterator."""
        return self

    def __next__(self):
        """Returns the next batch of data.

        Returns
        -------
        inputs : List of ndarray
            The inputs to be provided to the graph.
            The list is of the form [batched_input_1, batched_input_2, ..., batched_input_n]

        labels: List
            The expected outputs of the graph.
            The length of labels should be equal to the batch size. If the DataLoader doesn't
            have labels, labels will be None.
        """
        raise NotImplementedError

    def get_num_batches(self):
        """Returns the number of batches the DataLoader has.

        Returns
        ------
        num_batches : int
            The number of batches the DataLoader contains.
        """
        raise NotImplementedError

    def get_batch_size(self):
        """Gets the batch size.

        Returns
        -------
        batch_size : int
            The size of the batch returned by the DataLoader.
        """
        raise NotImplementedError

You can construct a python iterator out of the DataLoader class easily. This means that our framework is similar to tensorflow and mxnet’s dataset API, which makes it familiar to users who want to use it directly. And, the other fields allow us to store information that might be useful for things like calculating accuracy or doing averages, like batch size and the total number of batches.

Here’s a link to the PR that introduces this code: https://github.com/apache/tvm/pull/7710

In the PR, I also implement the RandomDataLoader, MxnetDataLoader, TFDataLoader and NumpyDataLoader (loads keras and other datasets that are stored in the numpy format). The RandomDataLoader class provides random numpy data of a specific shape and dtype for testing purposes. The TFDataLoader takes in a tensorflow dataset as an input, the MxnetDataLoader takes an mxnet dataset as an input, and the NumpyDataLoader can take in any data that is formatted in a numpy array (keras datasets provide data in this format).

Writing a PytorchDataLoader will be future work. I think it will be similar to the implementation of the NumpyDataLoader, and will not be difficult to implement.

tqchen · March 24, 2021, 12:41pm

Thanks @electriclilies for initiating the proposal. Here are some possible aspects that are worth thinking about

Dependency

Ideally we do not want to introduce dependency on to all the libraries. so we should think about inccuring the minimum amount of dependencies on the other frameworks (e.g. the ability to pass in other data loader and wrap them is a good design choice as seen in the POC).

Namespacing and Code Organizaton

Looking at TF and Pytorch, the current naming convention for the base classes are:

tf.data.Dataset
torch.utils.data.DataLoader

In PyTorch’s terminology dataset corresponds to the original raw data source while DataLoader is a wrapper. So it could be more approperiate to use DataLoader here. In terms of namespacing, it might be useful to think about the following namespace choices:

tvm.utils.data: consistent with PyTorch style, tvm.utils will then corresponds to a set of stable utilities that we support in the system
tvm.data: could hints a more heavy lifting effort, e.g. real data loading utility that loads data from disks etc.
tvm.contrib.data: current collection of “utils” in tvm that might subject to future changes.

Array Data Type

Right now the API returns numpy.ndarray. Another alternative would be returning a tvm.runtime.NDArray, given the data loader is tvm specific.

The main advantage of the later is that we might be able to get zero-copy data via DLPack(see latest proposal Data interchange mechanisms — Python array API standard 2021.01-DRAFT documentation). Additionally, considering the possibility of data loader that does pre-processing on GPU, we might need to return GPU arrays, which is not supported by the numpy interface.

Of course tvm.NDArray does not support numpy style array manipulation, so if we want to further manipulate as numpy, we still need to go with result.asnumpy

Property Getter

The properties like batch_size can also be exposed as a @property field, omitting get_xyz interface. It would be useful to looking into how does existing data loaders(Pytorch, TF) access these properties and try to make sure our experience is consistent.

jwfromm · March 24, 2021, 8:52pm

Thanks for writing this up, Lily, I think standardizing how we handle external datasets is highly valuable.

To comment on some of the points Tianqi raised, I quite like that this approach is fundamentally separated from any dependencies as it allows user’s to wrap any dataset or dataloader they want.

I agree with Tianqi that we should consider returning ndarray instead of numpy as it’s more tightly integrated with TVM. The point about zero-copy through DLPack is quite interesting and could be cool follow-up work if we go with the ndarray standardization.

I also like using @property for batch_size and num_batches since itll look a little cleaner.

In terms of naming, I think the proposed DataLoader is a better description of the functionality than Dataset and lean towards tvm.utils.data being the best namespace for this work.

That said, these are all pretty minor points, this work overall is great. Thanks Lily!

electriclilies · March 24, 2021, 9:05pm

Thanks for the feedback @tqchen @jwfromm. I’ll move the the code to the namespace tvm.utils.data, and set batch_size and num_batches through the @property decorator.

I do agree that future support of zero copy through DLPack is interesting, so it’s worth considering using tvm.runtime.ndarrays instead of numpy arrays. One question I have about this, though, is whether we should store labels as tvm.runtime.ndarrays as well as the data. If I provide a tvm.runtime.ndarray as input to a graph runtime module (or one of the other ways to run a relay module), is the output also a tvm.runtime.ndarray?

I want to make sure that the datatype of f(data) matches the datatype of the labels so users can directly compare them.

electriclilies · March 24, 2021, 9:10pm

Also, it appears that tvm.runtime.ndarray only has one method for comparing ndarrays, same_as, and same_as checks object identity equality, not value equality.

If the output of running a relay mod is a tvm.runtime.ndarray, and the labels are also a tvm.runtime.ndarray, it seems that the user will not have a good way to compare the output to the labels without converting the ndarrays to numpy by using the as numpy method.

tqchen · March 24, 2021, 9:12pm

@electriclilies that is correct, the output of the tvm’s code is always going to be tvm.runtime.ndarray.

To run further comparison (e.g. eval metric). We will need to convert them to numpy for further comparison. or bake evaluation as part of the computation(of the relay program).

electriclilies · March 24, 2021, 9:15pm

I guess having the user transform them into numpy before comparison is OK for now, and to be consistent I’ll make both data and labels tvm.runtime.ndarrays. I can put a note in the documentation that they need to convert them to numpy arrays before comparing them.

It would be nice if there was a way to directly compare the values of tvm.runtime.ndarrays, though.

altanh · March 27, 2021, 9:31pm

Commenting to agree that I like the approach, and strongly believe this will be useful (e.g. for reducing the boilerplate involved with setting up datasets for TVM training, since common datasets already exist in PyTorch or TF). Also agree with Tianqi about the NDArray/DLPack interfacing as we want to eliminate any unnecessary data copying especially in the training workflow.

Perhaps this is more specific to the PR, but I’m a bit wary of assuming a specific input and target/label shape (e.g. NCHW and integer) for some of the loaders, since this seems overfit to vision (how would we support a BERT dataset, for example?) Is knowledge of the layout really required? I’m also not sure about __next__ return a list of ndarrays, since when batching inputs we want them to be in a single contiguous array of shape (batch_size, ...). Hope this makes sense and would be happy to formally review the PR once you’ve had time to incorporate the other feedback!

electriclilies · March 29, 2021, 7:57pm

@altanh Thanks for the input. I think you’re right, knowledge of the layout is not required, and I can remove that.

With regard to your concern about the list of ndarrays – the ndarrays in the list are meant to be batched (I should make this clearer in the documentation, though). The intention is to allow DataLoaders to be used with relay mods that take more than one input. So if we have a list that is [ndarray1, ndarray2], ndarray1 is the first input to the relay mod, and ndarary2 is the second. For a mod that takes batched inputs, the list would look like this: [ndarray1], where ndarray1 has dimensions (batch_size, ...).

Then running the graph runtime module would look something like this:

for data in dataloader:
     for i, inp in enumerate(data):
           graphmodule.set_input(i, inp)

graphmodule.run()

With regard to whether the batch size is necessary – one of the algorithms that is commonly used to pick scales and zero points uses batch size because it calculates an average across batches. (I guess an interesting and related question here is how we would use this calibration method with BERT, since it doesn’t have batch size). I thought it was cleaner to package the batch size with the data coming into the function rather than requiring a user to figure out what it is and pass it in directly.

And additionally, anytime you are doing averaging or trying to calculate accuracy, having the batch size easily available without having to slice it out of your tensor with an index calculated based on what the layout is is useful.

But, for non-batched data, I agree that it doesn’t make sense to have the batch size. I’m not sure what the best solution is here. One option is that the DataLoader could have subclass called BatchedDataLoader that has the batch_size property. I’m open to other suggestions, though.

altanh · March 30, 2021, 8:25pm

Thanks, the batched inputs thing makes sense, I misunderstood! I also didn’t mean to imply that batch size itself is unnecessary- I do think it’s a fairly universal concept for data loading (except for perhaps dynamic models where the input shape changes for each instance, but in this case you could dynamically batch the same shapes and/or set the batch size to 1 and manually aggregate results). In any case, supporting non-batched stuff seems out of the scope of DataLoader so I wouldn’t worry about it too much.

Thanks for clarifying!