Thanks for writing this up, Lily, I think standardizing how we handle external datasets is highly valuable.
To comment on some of the points Tianqi raised, I quite like that this approach is fundamentally separated from any dependencies as it allows user’s to wrap any dataset or dataloader they want.
I agree with Tianqi that we should consider returning ndarray instead of numpy as it’s more tightly integrated with TVM. The point about zero-copy through DLPack is quite interesting and could be cool follow-up work if we go with the ndarray standardization.
I also like using @property for batch_size and num_batches since itll look a little cleaner.
In terms of naming, I think the proposed DataLoader is a better description of the functionality than Dataset and lean towards tvm.utils.data being the best namespace for this work.
That said, these are all pretty minor points, this work overall is great. Thanks Lily!