Auto-scheduler failed to extract tasks with the heterogeneous target as reported by https://discuss.tvm.apache.org/t/can-autoscheduler-support-tuning-for-multiple-targets/1048. I made some investigation and found that we need to manually call target = relay.build_module.build_target_by_device_type_map(target) to first transform the target to be a dict. However, even with that change, I still got an error in the TE compiler:
The error is:
TVMError: No target is provided for device llvm
where the input targets for UpdateMainWorkspaceSize is:
1: llvm -keys=cpu -link-params=0
2: cuda -keys=cuda,gpu -max_num_threads=1024 -thread_warp_size=32
I have no idea why the TE compiler is looking for llvm in 0 instead of 1. The most weird thing is, this error wonāt happen if we directly build the Relay module.
@comanic there is some strange code in there that uses 0 as a default iirc, I remember debugging some super familiar looking bug like this at some point in the past. I now canāt remember what the resolution was. I half remember changing this code, realizing some strange invariant and putting it back. One of my next goals is to clean up target handling in this piece of code as its currently a mess imo.
Iāve also found that the number of extracted tasks seems different after the TE compiler PR. For example, previously TF2 ssd mobilenet v2 extracts 57 tasks, but now it has 130 tasks. Investigatingā¦
For graph codegen, it seems TE compiler doesnāt repeatedly call auto_schedule_topi_compute on the identical workloads. So each unique task is extracted only once, and weight is correctly updated by te_compiler_update_weights call. This doesnāt apply to VM codegen, where the same task is extracted multiple times. But since the key is different now, workload cache look up fails and identical tasks are returned as distinct tasks.
Yes, I believe a workload hash should be sufficient for the key, since previously the same workloads share the same func name anyway (but now func names are unique).
I will follow up directly on GitHub, I think we will need to iterate on task extraction as we clean up the lowering, etc. I am preparing an RFC on unifying the lowering.