[RFC] Support for large tensors

As in this pr, we plan to support indexing with int64 variables, so that large tensors with more than 2^31 elements can be supported. We plan to support llvm and cuda first, and it has been tested on these two backends by adding two large tensors.

As tests with large tensors are costly in terms of both memory and time, I am not sure about where to add the tests. Any advice is welcome.

@yzhliu @tqchen

1 Like

@tqchen anything in mind that we need to take care of other that what @hzfan has implemented ?

It is a non-intrusive change that is helpful when encountering super intense computation. I think this idea is practical and useful :wink:

This is a legacy issue(of HalideIR) that we want to resolve a while ago.

Historically i32 was used to index all the tensor access. Ideally we want to change all the indices to i64. Of course in many cases(like in CUDA) we still benefit from using i32(or even i16 in the case of mobile GPU) for indexing when we can.

The idea is to introduce a pass that analyze all the integer constant bounds of each index expression, and see if we can narrow the type into smaller types. Once we introduce this pass, we can switch all the indices to i64 by default.

1 Like

Also CC @were who might have some ideas on this

I also look favorably upon using i64 for indices in IR, since the address space is stored in i64.

For example, in LLVM, if you use i32 as indices, many redundant cast instructions are generated to prevent i32 overflow in i32-i64 mixed computations. This stumbles many aggressive optimizations.

Anyways, I think TVM should be as llvm-friendly as possible.

Not sure if it is the simplest way to go. Why not leave the space to developers to specify the type of their index variables?

In most cases code are generated from the high-level(e.g. neural network) and the index variable types are build into the system. To followup on @were’s point, it is indeed depends on the device class. For example, we know that in NV GPU i32 is sometimes preferred. And in Apple iphone GPU i16 is the best native type. But there are indeed cases(e.g. 64bit CPU) where i64 is preferred

I see. Just to clarify, do something like a Relay pass, to convert the indices into a certain type (conditional on the size of iteration domain)?

It should be a pass in the low-level TIR. The pass should do a const int bound analysis using arith::Analyzer then decides the conversion. The related bound information is usually lost in the lower level(e.g. LLVM) so it makes sense to do such conversion before we lower the code

I see. It sounds reasonable then :slight_smile:

Thanks for sharing the idea. Just to rephrase to see if I understand correctly: before codegen, a pass determines the type of some index by its bound. During codegen, different backends may do different things with these typed indices. For example, if some index is bound by i32, llvm may use i64, while cuda uses i32, both for efficiency. But for mobile gpu, we cannot use i16 as i32 does not fit in i16.

Also, there is a concern that the type of some index cannot be determined solely by its own bound. For example, we cannot use i32 for buffer[2^20][2^16], because the flattened index is as large as 2^36, which does not fit in i32.

We can run the pass after storage flatten, and only rewrite those indices for Load and Store, which will address your problem. Instead of doing it in the codegen phas, we can do it as a rewriting pass that rewrites a IR to another IR before codegen,

so we always make symbolic index i64, no matter what type did user specified?

we can still support i32 indices, but in the relay codegen, we can switch to use i64 by default

I see. my understanding is, for constant index we do analysis pass to decide the correct type. for symbolic index we still rely on what frontend tells us, while we can change relay to use i64 by default, right?

I think the pass works for non-symbolic shapes. However, for symbolic shapes:

n = tvm.var('n', dtype='int32')
m = tvm.var('m', dtype='int32')
X = tvm.compute((n, m), lambda *idx: tvm.const(1, dtype='float32'))

In this example, the ConstIntBoundAnalyzer gives -2^31 <= n <= 2^31 -1, and n * m goes beyond i32. So i64 will serve as the index. This does not seem backward compatible.

As @yzhliu mentions, we may handle const index and symbolic index seperately.

The pass should work for most of the symbolic indices that have a constant shape, which means we can deduce the maximum value. For the case of purely symbolic shape without a bound. I think it might be fine to directly use i64 here

I see. I will start working on the pass based on the above discussions.

However, there will be two impacts:

  • dtype in tvm.var() is almost ignored. As long as there are more than two vars in a shape, they will be promoted to i64 even specified as i32 explicitly.
  • Currently pure symbolic shapes with all vars being i32 work well. After this pass gets added, some codegen (like llvm) will not work with this case, as vars will be promoted to i64 and these codegen do not support i64 for now.

I’m not sure whether it is a good idea to automatically promote var_i32 + var_i32 to var_i64, as @hzfan mentioned, it essentially kills users’ dtype input. I would be surprised if I were the user. On the other hand, promoting var_i32 + var_i64 to var_i64 makes more sense to me.