Hi @chenugray,
The default TIR PrimFunc emitted by EmitTE do not have thread binding construct, so Relax currently relied on MetaSchedule to do thread binding when targeting GPU, you just need to perform one tuning trial per task (subgraph) when targeting GPU.
You can find a test case here: https://github.com/tlc-pack/relax/blob/relax/tests/python/relax/test_autotir_integration.py#L131.