[RFC][Tensorcore] INT4 end-to-end inference

wowow11111 · March 15, 2021, 5:41am

I’m sorry I made you a bit confusing.

I meant first trial : git clone --branch int4_direct_HWNC [ http://github.com/zachzzc/incubator-tvm.git /tvm]

zachzzc · March 15, 2021, 6:41am

Did you finish the section Install zachzzc's TVM and see no errors?

wowow11111 · March 15, 2021, 7:47am

Yes I did finish all of that section and no, I didn’t see any errors there.

I also checked all the versions like cuda and upgraded them if needed as the page says.

zachzzc · March 20, 2021, 11:47pm

I tried install again from scratch but didn’t see the errors. Did you set the path to my TVM repo? It may points to your other versions of TVM.

wowow11111 · March 21, 2021, 8:25am

My tvm_home path is correctly linked to hawq tvm.

For number6,(you might have already realized from the error message above but,) the issue is in the /tvmhome/python/tvm/relay/backend/compile_engine.py

in “def select_implementation” function, there’s this codeline all_impls = get_valid_implementations(—) , which returns nothing and therefore nothing is run during select_implementation function and finally, outputs[best_plevel_imple] is detected as error since best_plevel_imple is NONE. get_valid_implementation is also like that, and there’s similar function “fstrategy(—)” which is an api function so I did not trace any further.

I thought this problem is related to the one number3 is not functioning well, so I started working on that again.

The file “hawq_utils_resnet50.py” line 483, 484, 485, my machine can’t find any key with those parameters. I looked into the pytorch model and had a bit of clue what the code intended, but apparently the machine doesn’t know what the code lines mean. The dictionary keys that ‘model’ has, are just the keys of checkpoint (epoch, arch, state_dict, best_acc1, optimizer).

There might be something wrong with my PC so I am working to run this in one of our lab servers. Anyways, can you let me know the version of your python and llvm? That might be a problem (I’m not sure tho). Also, parts of the issues above is filed as an issue in the github page.

Thanks a lot for your care

zachzzc · March 21, 2021, 11:24pm

Anyways, can you let me know the version of your python and llvm? That might be a problem (I’m not sure tho).

python 3.7.4 and llvm 10.0.1

wowow11111 · April 4, 2021, 4:12am

Hello zachzzc, I’m having a hard time trying to run this implementation.

Can you print out the strategy.specializations in op/relay/compile_engine.py :120 and see if it is not empty while running your code?

Now I’ve fixed the hard coded 'hawq_utils_resnet50.py" file to fit resnet 18, and now I get the exact same error as the ‘test_resnet_inference_time.py’ that I mentioned above.

I’ve been working on this for days but couldn’t find what strategy.specializations is, and what it should contain.

Thank you in advance,

zachzzc · April 5, 2021, 7:19am

It is not empty in my run. I print out the impl.name it shows conv2d_hwnc_tensorcore_direct.cuda for convs and other implementation like injective.cuda, pool.cuda. It should not be completely empty.

wowow11111 · April 11, 2021, 6:24am

Yes, it should not be empty. And the reason it is empty is because strategy.specialization is empty.

If I print out the strategy, it says “relay.OpStrategy(0x6b23fb0)” Also, the fstrategy says “GenericFunc(0x34h8af4)”

But the strategy.specialization is totally empty. Do you have any idea what this specialization is, and can you let me know what is in yours?

Thank you.

wowow11111 · April 14, 2021, 3:46am

I changed my gpu to gtx2080 and the problem is solved… thanks anyways.

zachzzc · April 14, 2021, 4:26am

No problem. I think the GPU you ran on before doesn’t have Tensor Core so the TVM doesn’t find the corresponding schedule to use.

wowow11111 · April 14, 2021, 5:45am

I found out that gtx1050, gtx3090 does not support the corresponding schedule for cuda. I think 20 series are needed (at lease gtx2080 does support).

wowow11111 · April 19, 2021, 11:15am

Hi zachzzc,

I’m wondering what is needed to change relay IR to appropriate llvm target IR (TVM IR, with target device - llvm). It seems like, just by changing the target TVM IR does not come out appropriately.

Can I get any clue on where to look at or start? Thank you.

zachzzc · May 2, 2021, 7:24pm

So you want to run on CPU instead of GPU? What’s the error you are seeing after changing the target? The inference script in my repo won’t work because I think some of the convolution layouts are not supported yet in CPU x86 computation. If you want to run the quantized NNs on CPU, you may need to tweak the data layout depending on what’s supported now.

wowow11111 · May 8, 2021, 5:30am

Thank you so much. Based on the error code, your assumption seems right.

Can you let me know why some layouts of quantized NNs are not supported on CPU?

zachzzc · May 24, 2021, 4:08am

We chose some special layouts that run faster on GPU, but it may not run as fast in CPU.

cron · September 21, 2021, 3:15pm

Hi @zachzzc,

I was looking into int4 support in TVM and came about this thread. Maybe you can help me clarify some doubts.

The current “int4” support is only possible via natively Relay level definition of the workload? In other words:
- I see in the linked repository of HAWQ that in fact the networks are built from Relay operators directly and since your HAWQ is not “natively” supported by TVM, then this int4 support cannot be done out-of-the-box for networks imported from other frameworks
- How did you avoid any kind of problem, specifically regarding constant folding passes, on the weights? (I have a hunch that its because you never declare weights as relay.const, but I am unsure)
How did non convolution operators handle the int4 datatypes? I cant seem to find their implementations, or was there some int4->int8/32 upcasting right after conv/dense in order to use the “standard” implementations?
Can you provide any intuition on the following couple lines of code in the Topi implementation? i.e. why are those values set as they are for the int4 case?
Why did you need to upcast the operands of this te.compute to int32?

zachzzc · September 23, 2021, 1:28am

Hi @cron,

Thanks for your interests about our works.

The current “int4” support is only possible via natively Relay level definition of the workload? In other words:

You are right we didn’t create a pass to import networks from other frameworks. It requires more works on debugging and development.

When we do constant folding, int4 weights are treated as int32. And I did some hacky things to avoid error raised from importing numpy array. For example, array dimension will mismatch since the numpy array is int32 array but we import into TVM int4 array. int32 array dimensions will be 8 times smaller than int4 array.

How did non convolution operators handle the int4 datatypes? I cant seem to find their implementations, or was there some int4->int8/32 upcasting right after conv/dense in order to use the “standard” implementations?

All the results from convolution are int32. We do non-convolution operators on the int32 results, then down cast into int4 before we feed into the next layer. Due to the limit of the hardware, int4 addition is not natively supported.

Can you provide any intuition on the following couple lines of code in the Topi implementation? i.e. why are those values set as they are for the int4 case?

This follows the Nvidia Tensorcore GPU requirements. In T4 GPU, we can only compute gemm of size 8x8x32 for int4 data type.

Why did you need to upcast the operands of this te.compute to int32?

The calculation result of the Nvidia tensorcore is int32. The expression here is just to match the assembly intrinsics. We don’t do extra upcasting after we get the result from the hardware. Instead, we need to do downcast to int4 before we enter the next layer.

cron · September 23, 2021, 12:18pm

Thanks for the reply, I really appreciate it.

On the side of being lazy, could you point me to where you did these hacky things?

int32 array dimensions will be 8 times smaller than int4 array.

Dont you mean one of the dimensions will be 8 times smaller?

zachzzc · September 26, 2021, 4:49am

Dont you mean one of the dimensions will be 8 times smaller?

Yes the last dimension will be 8 times smaller.

could you point me to where you did these hacky things?

The actual size in bytes need to be updated if the data is smaller than 8 bits (1 byte)… For example, vector of 16 int4 data is 4bits * 16 / 8 = 8 bytes

github.com

zachzzc/incubator-tvm/blob/7d09472a907b7eb8e10c6bc42250c4b5d9d2845c/include/tvm/runtime/ndarray.h#L315-L317


if (arr.dtype.bits < 8)
  size = size * arr.dtype.bits / 8;
else

The data dimension assertion need to add exception for int4. It requires the last dimension of TVM data type is 8 times than np array, since np array stores in int32 type.

github.com

zachzzc/incubator-tvm/blob/7d09472a907b7eb8e10c6bc42250c4b5d9d2845c/python/tvm/runtime/ndarray.py#L138-L144


if source_array.shape != shape and \
    not (dtype == 'int4' and source_array.dtype == 'int32' and source_array.shape[-1] * 8 == shape[-1]):
    raise ValueError("array shape do not match the shape of NDArray {0} vs {1}".format(
        source_array.shape, shape))


if dtype in ['int4', 'uint4']:
    dtype = 'int32'

And one line in graph run time

github.com

zachzzc/incubator-tvm/blob/7d09472a907b7eb8e10c6bc42250c4b5d9d2845c/src/runtime/graph/graph_runtime.cc#L271


if (!attrs_.device_index.empty()) {
  device_type = attrs_.device_index[i];
}
size_t size = 1;
for (int64_t sz : attrs_.shape[i]) {
  size *= static_cast<size_t>(sz);
}
CHECK_GE(storage_id, 0) << "Do not support runtime shape op";
DLDataType t = vtype[i];
size_t bits = t.bits * t.lanes;
CHECK(bits % 8U ==  0U || bits == 4U || bits == 1U);
size_t bytes = ((bits + 7U) / 8U) * size;


uint32_t sid = static_cast<uint32_t>(storage_id);
if (sid >= pool_entry.size()) {
  pool_entry.resize(sid + 1, {0, -1});
} else {
  CHECK(pool_entry[sid].device_type == -1 ||
        pool_entry[sid].device_type == device_type)
      << "The same pool entry cannot be assigned to multiple devices";
}

I may miss something since I have not worked on it for a while. Let me know if you see any errors in your development.