[RFC][Tensorcore] INT4 end-to-end inference

wowow11111 · April 14, 2021, 3:46am

I changed my gpu to gtx2080 and the problem is solved… thanks anyways.

zachzzc · April 14, 2021, 4:26am

No problem. I think the GPU you ran on before doesn’t have Tensor Core so the TVM doesn’t find the corresponding schedule to use.

wowow11111 · April 14, 2021, 5:45am

I found out that gtx1050, gtx3090 does not support the corresponding schedule for cuda. I think 20 series are needed (at lease gtx2080 does support).

wowow11111 · April 19, 2021, 11:15am

Hi zachzzc,

I’m wondering what is needed to change relay IR to appropriate llvm target IR (TVM IR, with target device - llvm). It seems like, just by changing the target TVM IR does not come out appropriately.

Can I get any clue on where to look at or start? Thank you.

zachzzc · May 2, 2021, 7:24pm

So you want to run on CPU instead of GPU? What’s the error you are seeing after changing the target? The inference script in my repo won’t work because I think some of the convolution layouts are not supported yet in CPU x86 computation. If you want to run the quantized NNs on CPU, you may need to tweak the data layout depending on what’s supported now.

wowow11111 · May 8, 2021, 5:30am

Thank you so much. Based on the error code, your assumption seems right.

Can you let me know why some layouts of quantized NNs are not supported on CPU?

zachzzc · May 24, 2021, 4:08am

We chose some special layouts that run faster on GPU, but it may not run as fast in CPU.

cron · September 21, 2021, 3:15pm

Hi @zachzzc,

I was looking into int4 support in TVM and came about this thread. Maybe you can help me clarify some doubts.

The current “int4” support is only possible via natively Relay level definition of the workload? In other words:
- I see in the linked repository of HAWQ that in fact the networks are built from Relay operators directly and since your HAWQ is not “natively” supported by TVM, then this int4 support cannot be done out-of-the-box for networks imported from other frameworks
- How did you avoid any kind of problem, specifically regarding constant folding passes, on the weights? (I have a hunch that its because you never declare weights as relay.const, but I am unsure)
How did non convolution operators handle the int4 datatypes? I cant seem to find their implementations, or was there some int4->int8/32 upcasting right after conv/dense in order to use the “standard” implementations?
Can you provide any intuition on the following couple lines of code in the Topi implementation? i.e. why are those values set as they are for the int4 case?
Why did you need to upcast the operands of this te.compute to int32?

zachzzc · September 23, 2021, 1:28am

Hi @cron,

Thanks for your interests about our works.

The current “int4” support is only possible via natively Relay level definition of the workload? In other words:

You are right we didn’t create a pass to import networks from other frameworks. It requires more works on debugging and development.

When we do constant folding, int4 weights are treated as int32. And I did some hacky things to avoid error raised from importing numpy array. For example, array dimension will mismatch since the numpy array is int32 array but we import into TVM int4 array. int32 array dimensions will be 8 times smaller than int4 array.

How did non convolution operators handle the int4 datatypes? I cant seem to find their implementations, or was there some int4->int8/32 upcasting right after conv/dense in order to use the “standard” implementations?

All the results from convolution are int32. We do non-convolution operators on the int32 results, then down cast into int4 before we feed into the next layer. Due to the limit of the hardware, int4 addition is not natively supported.

Can you provide any intuition on the following couple lines of code in the Topi implementation? i.e. why are those values set as they are for the int4 case?

This follows the Nvidia Tensorcore GPU requirements. In T4 GPU, we can only compute gemm of size 8x8x32 for int4 data type.

Why did you need to upcast the operands of this te.compute to int32?

The calculation result of the Nvidia tensorcore is int32. The expression here is just to match the assembly intrinsics. We don’t do extra upcasting after we get the result from the hardware. Instead, we need to do downcast to int4 before we enter the next layer.

cron · September 23, 2021, 12:18pm

Thanks for the reply, I really appreciate it.

On the side of being lazy, could you point me to where you did these hacky things?

int32 array dimensions will be 8 times smaller than int4 array.

Dont you mean one of the dimensions will be 8 times smaller?

zachzzc · September 26, 2021, 4:38am

Dont you mean one of the dimensions will be 8 times smaller?

Yes the last dimension will be 8 times smaller.

could you point me to where you did these hacky things?

The actual size in bytes need to be updated if the data is smaller than 8 bits (1 byte).. For example, vector of 16 int4 data is 4bits * 16 / 8 = 8 bytes

github.com/zachzzc/incubator-tvm

include/tvm/runtime/ndarray.h

7d09472a9


      
          if (arr.dtype.bits < 8)
            size = size * arr.dtype.bits / 8;
          else

The data dimension assertion need to add exception for int4. It requires the last dimension of TVM data type is 8 times than np array, since np array stores in int32 type.

github.com/zachzzc/incubator-tvm

python/tvm/runtime/ndarray.py

7d09472a9


      
          if source_array.shape != shape and \
              not (dtype == 'int4' and source_array.dtype == 'int32' and source_array.shape[-1] * 8 == shape[-1]):
              raise ValueError("array shape do not match the shape of NDArray {0} vs {1}".format(
                  source_array.shape, shape))
          
          if dtype in ['int4', 'uint4']:
              dtype = 'int32'

And one line in graph run time

github.com/zachzzc/incubator-tvm

src/runtime/graph/graph_runtime.cc

7d09472a9


      
          if (!attrs_.device_index.empty()) {
            device_type = attrs_.device_index[i];
          }
          size_t size = 1;
          for (int64_t sz : attrs_.shape[i]) {
            size *= static_cast<size_t>(sz);
          }
          CHECK_GE(storage_id, 0) << "Do not support runtime shape op";
          DLDataType t = vtype[i];
          size_t bits = t.bits * t.lanes;
          CHECK(bits % 8U ==  0U || bits == 4U || bits == 1U);
          size_t bytes = ((bits + 7U) / 8U) * size;
          
          uint32_t sid = static_cast<uint32_t>(storage_id);
          if (sid >= pool_entry.size()) {
            pool_entry.resize(sid + 1, {0, -1});
          } else {
            CHECK(pool_entry[sid].device_type == -1 ||
                  pool_entry[sid].device_type == device_type)
                << "The same pool entry cannot be assigned to multiple devices";
          }

I may miss something since I have not worked on it for a while. Let me know if you see any errors in your development.