How can I inject my quantization table into tvm?

I am trying to deploy LLaMa int4 with tvm.

  1. I have converted llama-7B to onnx llama.onnx
  2. Converted onnx to tvm with relay.vm
  3. Added a feature to GPTQ-for-LLaMa to export quant table, toml+np format

Now I am blocked in inference stage with some questions.

  • How to inject this quantization table into tvm ?
  • Since GPTQ-for-LLaMa using RPTQ method, is there any tutorial about implement “RPTQ-matmul” ?

And I noticed that relay.quantize.quantize only support QuantizeCalibrate, no one-short/zero-shot quantization method. tvm tutorial should give introduction to relay.vm and onnx dynamic shape.

Is https://github.com/mlc-ai/web-llm/blob/1906d9020bde53c094332c7144e2f5e368122522/web_llm/transform/quantization.py#L113 a good answer ?

hi,

how did you managed to convert llama-7B to onnx? I tried several approaches, and all ended up with failure.