How to parallel tune on multi-card GPU?

I am trying to tune a GPU kernel on 8-card server and want to parallelize to speedup. I follow an old post How to tune CNN networks with multiple gpu devices? - #3 by nicklhy and set separate rpc_server as following

CUDA_VISIBLE_DEVICES=1 python3 -m tvm.exec.rpc_server --key 1080ti --tracker ...
CUDA_VISIBLE_DEVICES=7 python3 -m tvm.exec.rpc_server --key 1080ti --tracker ...

and my tuning configuration is

rpc_runner = auto_scheduler.RPCRunner(
        device_key,
        host=rpc_host,
        port=rpc_port,
        timeout=30,
        repeat=1,
        n_parallel=4,
        min_repeat_ms=200,
        enable_cpu_cache_flush=True,
    )

    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=200,  # change this to 20000 to achieve the best performance
        runner=rpc_runner,
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )

However, only device:0 is utlized

cc @merrymercy