How to load params when using VirtualMachine in python?

Bobo-y · February 5, 2024, 11:37am

VirtualMachine to run custom llava, use tvmjs.load_ndarray_cache to load params, but report “expect Tensor with ndim 2 but get 1”. Then run original llama, report " expect Tensor with dtype float16 but get uint32". It’s seem like the loaded params format not correct. The ChatModule can run with converted weights the cpp runtime.

    primary_device = tvm.device(args.primary_device)
    params, meta = tvmjs.load_ndarray_cache(args.artifact_path, primary_device)
    const_params = []
    for k, v in params.items():
        const_params.append(v)
    ex = tvm.runtime.load_module(os.path.join('dist/libs/Llama-2-3b-chat-hf-q4f16_1-MLC.so'))
    vm = relax.VirtualMachine(ex, primary_device)
    tokenizer = AutoTokenizer.from_pretrained(os.path.join(args.artifact_path), trust_remote_code=True)

    inputs = tvm.nd.array(tokenizer(args.prompt, return_tensors="pt").input_ids.to(torch.int32).numpy(),primary_device,)
    vm['embed'](inputs, const_params)