I’m sorry I made you a bit confusing.
I meant first trial : git clone --branch int4_direct_HWNC [ http://github.com/zachzzc/incubator-tvm.git /tvm]
I’m sorry I made you a bit confusing.
I meant first trial : git clone --branch int4_direct_HWNC [ http://github.com/zachzzc/incubator-tvm.git /tvm]
Did you finish the section Install zachzzc's TVM
and see no errors?
Yes I did finish all of that section and no, I didn’t see any errors there.
I also checked all the versions like cuda and upgraded them if needed as the page says.
I tried install again from scratch but didn’t see the errors. Did you set the path to my TVM repo? It may points to your other versions of TVM.
My tvm_home path is correctly linked to hawq tvm.
For number6,(you might have already realized from the error message above but,) the issue is in the /tvmhome/python/tvm/relay/backend/compile_engine.py
in “def select_implementation” function, there’s this codeline all_impls = get_valid_implementations(—) , which returns nothing and therefore nothing is run during select_implementation function and finally, outputs[best_plevel_imple] is detected as error since best_plevel_imple is NONE. get_valid_implementation is also like that, and there’s similar function “fstrategy(—)” which is an api function so I did not trace any further.
I thought this problem is related to the one number3 is not functioning well, so I started working on that again.
The file “hawq_utils_resnet50.py” line 483, 484, 485, my machine can’t find any key with those parameters. I looked into the pytorch model and had a bit of clue what the code intended, but apparently the machine doesn’t know what the code lines mean. The dictionary keys that ‘model’ has, are just the keys of checkpoint (epoch, arch, state_dict, best_acc1, optimizer).
There might be something wrong with my PC so I am working to run this in one of our lab servers. Anyways, can you let me know the version of your python and llvm? That might be a problem (I’m not sure tho). Also, parts of the issues above is filed as an issue in the github page.
Thanks a lot for your care
Anyways, can you let me know the version of your python and llvm? That might be a problem (I’m not sure tho).
python 3.7.4 and llvm 10.0.1
Hello zachzzc, I’m having a hard time trying to run this implementation.
Can you print out the strategy.specializations in op/relay/compile_engine.py :120 and see if it is not empty while running your code?
Now I’ve fixed the hard coded 'hawq_utils_resnet50.py" file to fit resnet 18, and now I get the exact same error as the ‘test_resnet_inference_time.py’ that I mentioned above.
I’ve been working on this for days but couldn’t find what strategy.specializations is, and what it should contain.
Thank you in advance,
It is not empty in my run. I print out the impl.name it shows conv2d_hwnc_tensorcore_direct.cuda for convs and other implementation like injective.cuda, pool.cuda. It should not be completely empty.
Yes, it should not be empty. And the reason it is empty is because strategy.specialization is empty.
If I print out the strategy, it says “relay.OpStrategy(0x6b23fb0)” Also, the fstrategy says “GenericFunc(0x34h8af4)”
But the strategy.specialization is totally empty. Do you have any idea what this specialization is, and can you let me know what is in yours?
Thank you.
I changed my gpu to gtx2080 and the problem is solved… thanks anyways.
No problem. I think the GPU you ran on before doesn’t have Tensor Core so the TVM doesn’t find the corresponding schedule to use.
I found out that gtx1050, gtx3090 does not support the corresponding schedule for cuda. I think 20 series are needed (at lease gtx2080 does support).
Hi zachzzc,
I’m wondering what is needed to change relay IR to appropriate llvm target IR (TVM IR, with target device - llvm). It seems like, just by changing the target TVM IR does not come out appropriately.
Can I get any clue on where to look at or start? Thank you.
So you want to run on CPU instead of GPU? What’s the error you are seeing after changing the target? The inference script in my repo won’t work because I think some of the convolution layouts are not supported yet in CPU x86 computation. If you want to run the quantized NNs on CPU, you may need to tweak the data layout depending on what’s supported now.
Thank you so much. Based on the error code, your assumption seems right.
Can you let me know why some layouts of quantized NNs are not supported on CPU?
We chose some special layouts that run faster on GPU, but it may not run as fast in CPU.
Hi @zachzzc,
I was looking into int4 support in TVM and came about this thread. Maybe you can help me clarify some doubts.
relay.const
, but I am unsure)Hi @cron,
Thanks for your interests about our works.
The current “int4” support is only possible via natively Relay level definition of the workload? In other words:
You are right we didn’t create a pass to import networks from other frameworks. It requires more works on debugging and development.
When we do constant folding, int4 weights are treated as int32. And I did some hacky things to avoid error raised from importing numpy array. For example, array dimension will mismatch since the numpy array is int32 array but we import into TVM int4 array. int32 array dimensions will be 8 times smaller than int4 array.
How did non convolution operators handle the int4 datatypes? I cant seem to find their implementations, or was there some int4->int8/32 upcasting right after conv/dense in order to use the “standard” implementations?
All the results from convolution are int32. We do non-convolution operators on the int32 results, then down cast into int4 before we feed into the next layer. Due to the limit of the hardware, int4 addition is not natively supported.
Can you provide any intuition on the following couple lines of code in the Topi implementation? i.e. why are those values set as they are for the int4 case?
This follows the Nvidia Tensorcore GPU requirements. In T4 GPU, we can only compute gemm of size 8x8x32 for int4 data type.
Why did you need to upcast the operands of this te.compute to int32?
The calculation result of the Nvidia tensorcore is int32. The expression here is just to match the assembly intrinsics. We don’t do extra upcasting after we get the result from the hardware. Instead, we need to do downcast to int4 before we enter the next layer.
Thanks for the reply, I really appreciate it.
On the side of being lazy, could you point me to where you did these hacky things?
int32 array dimensions will be 8 times smaller than int4 array.
Dont you mean one of the dimensions will be 8 times smaller?
Dont you mean one of the dimensions will be 8 times smaller?
Yes the last dimension will be 8 times smaller.
could you point me to where you did these hacky things?
I may miss something since I have not worked on it for a while. Let me know if you see any errors in your development.