thread_warp_size=1 for opencl

thread_warp_size=1 for opencl targets: https://github.com/apache/tvm/blob/7f7762d53a2cf073e55e88e3cb7550a6a60cba3d/src/target/target_kind.cc#L346

If we are using TVM on, say, Mali G610, where each warp contains 16 threads, should we update thread_warp_size to 16 when compiling our model (I’m using python -m mlc_llm compile)? I tried it - the model compiles, but the decoded tokens are gibberish.