I am trying to deploy LLaMa int4 with tvm.
- I have converted llama-7B to onnx llama.onnx
- Converted onnx to tvm with
relay.vm
- Added a feature to GPTQ-for-LLaMa to export quant table, toml+np format
Now I am blocked in inference stage with some questions.
- How to inject this quantization table into tvm ?
- Since GPTQ-for-LLaMa using RPTQ method, is there any tutorial about implement “RPTQ-matmul” ?
And I noticed that relay.quantize.quantize
only support QuantizeCalibrate
, no one-short/zero-shot quantization method. tvm tutorial should give introduction to relay.vm
and onnx dynamic shape.