Target and Attributes

Motivation

We start to introduce quite a few target-related concepts when we start to add more flexibilities into code generation, autotvm and tensorization. One particular example is the recent introduced “compiler” concept. While it is useful to add concepts as we grows the features, it is also important to group the concepts around a few key data structures — so that developers get first-class customization as a result of the infrastructure design.

In particular, this RFC proposes to revisit the target namespace and discusses how can be built customizations around it.

Considerations

Target Data Structure

A target object represents a collection of information needed to customize the compilation for a specific device(or a collection of deviceswhen we deal with hetrogenous environment).

  • R0: Each target should have a TargetKey (e.g. cuda, llvm, vulkan, dnnl)
    • Target key can be used to index target specific behaviors(which code to call for codegen)
  • R1: A primitive target need to have a string representation that users can point to(e.g. “cuda”)
  • R2: Target need to have a list of attributes (e.g. llvm -mcpu=avx2)
  • R3: We need a specific attribute about the hardware type, so that it can be used by AutoTVM for indexing.
  • R4: For most of the device target, we also need a target_host to represent how can be compile the host driving part of the program(that calculates the device launching parameters)
  • R5: We will need to provide a list of targets and a target_host for hetrogenous compilation(which could bring the possibility of a CompositeTarget(pending name)).

Hardware

Hardware is an unique string that identifies the target device. A hardware string can imply a list of target and target hosts. It is important to keep a simple concise string format for hardware, so that our users can directly select from a built-in list when possible. We can also use the built-in names for benchmarking purpoes.

Some example hardware strings:

  • rasp4b: implies llvm -mcpu=cortex-a74 -hardware=rasp4b
  • nvidia/gtx2080ti
  • aws/c4.xlarge
  • rk3399/gpu: use the gpu on rk3399 board
  • rk3399/bigcpu: use the big cores

There are many ways to name a hardware, and some of them are hierachical. For example, two phones could have different names, but corresponds to the same SoC. Our current approach is to canonicalize the names to an agreed upon name(e.g. the SoC name), and use that as a key to autotvm.

Importantly, a hardware string is not a target key itself, it can imply a composite collection of targets that are needed to perform the compilation. One way to do so is to allow the target creation to take in [hardware-str] [additional-attributes], and we manually maintain the default configuration in a file.

Strawman proposal for hardware

  • S0: Introduce target/hardware.py that maintains the mapping(hardware→target) and hierachy(e.g. rasp4b→soc-name→arm-board)
  • S1: rename -model to -hardware in the target string.

Target Atttributes

In order to consolidate all the target aware customization into the target, we will need to introduces target specific attributes. Here are a list of possible attributes that a target:

  • A0: Intrinsic lowering rules for ops
  • A1: Ability annotation pass(for relay annotation to suggest supported features)
  • A2: Rewriting passes for the specific target(relay or TIR level)
  • A3: runtime::Module generation function for relay or TIR(bring your own codegen)
  • A4: Memory hierachy information (alignments for special registers in accelerators)

For example, to implement the bring your own codegen DNNL example, we will need to introduce a dnnl target, and register A1, A2, A3.

There are a few ways to achieve the target attribute registration.

  • B0: register via a specific PackedFunc callback
    TVM_REGISTER_GLOBAL("tvm.target.dnnl.relay_codegen")
    .set_body(DNNLCodegen)
  • B1: register via a columar attribute table(as in Op)
    TVM_REGISTER_TARGET("dnnl")
    .set_attr<FRelayCodegen>("FRelayCodegen", DNNLCodegen);
  • B2: register via a row-wise table
    TVM_REGISTER_TARGET("dnnl")
    .set_relay_codegen(DNNLCodegen)

Both B1 and B2 requires us to introduce a target registry as the op registry(there are quite some code that can be re-purposed).

B2 will require us to have a TargetInfo data structure that centralizes all the possible target attributes in typed form. B1 is more flexible in terms of growing the list of attributes, just like the op_attrs_type.h file. Note that we will likely need to extend the target attributes as we add new specialized hardware targets.

Option B0 is slower to lookup, but is still very useful when we try to dispatch against a Op and Target combination (e.g. lowering an intrinsic rewriting rule for exp under cuda target).

Discussions

Please share your thoughts. In particular, it would be helpful to discuss:

  • What would be a good string format for target, is the current format good enough
  • Do we need to introduce a CompositeTarget for hetrogenous cases
  • Hardware choices wrt to S0, S1
  • Whether to introduce target attributes, B0 vs B1 vs B3
1 Like

cc @thierry @comaniac @zhiics @junrushao @jroesch @mbaret

For hardware strings, I think it may contain <vendor-id>/<device-id>, for instance: broadcom/bcm2711 , rockchip/rk3399, nvidia/gtx2080ti. An optional <core-id> or <arch-id> can be added to specify target core or architecture on such device.

For hardware settings, I think we can probably use the format similar to the clang triple format as well (http://clang.llvm.org/docs/CrossCompilation.html#target-triple). I prefer to introduce target/hardware.py to have a centralized description of them.

I think CompositeTarget is quite useful. With it, we can cleanup code for heterogenous execution.

For the target attribute, I would vote for B1 as it is consistent to op attribute and quite flexible to add new attributes.

Refined ths description based on post 6 by @tqchen.

  • What would be a good string format for target, is the current format good enough

Before presenting the hardware format I prefer, I’d like to mention two related side functions:

  1. A utility function to list all supported hardware and attributes (and probably a guide about how to choose hardware attributes. Many people have asked about how to determine -mcpu=?, for example.)

  2. Enforce/Canonicalize not only name but attributes. In other words, we should make each combination of hardware and attribute unique and representative. For example, -model attribute in CUDA is now just a label. AFAIK, it is only used for AutoTVM to match the records. However, users may use t4, nvidia-t4 for the same GPU.

Based on 2, I suggest escalating the attributes we want to enforce to the hardware with @zhiics’s proposal of triple format as it is straightforward for checking and hashing. Specifically, the format could be <backend>/<backend-specific-triple-format> (e.g., cuda/nvidia-t4).

Meanwhile, I also like the idea of aliases as @tqchen mentioned in post 6. The aliases do not have to follow the triple format as their purpose is to simplify user efforts, so we can use the format @liangfu suggested. In this case, I’d expect >80% of users to use only aliases, and the community should actively maintain the alias list to support the latest devices.

  • Do we need to introduce a CompositeTarget for heterogeneous cases

I think this is definitely required. For example, the TensorRT codegen would prefer to use CUDA instead of LLVM for the unsupported ops so that it can guarantee the same or better performance over CUDA. This is also considered as a heterogeneous case as they are using different codegen flows.

  • Hardware choices wrt to S0, S1

Vote for S0 for better flexibility and extensibility.

  • Whether to introduce target attributes, B0 vs B1 vs B2

Vote for B1 for its flexibility. One miner question: If we register the target like B0-B2, which is in C++ side, can we implement A0 and A1 in the Python side? For example, the composite patterns and customized annotation pass may be implemented in Python.

2 Likes

re python support. As with all TVM APIs, we will have first class python support in all of the proposals.

We are already using target triple for LLVM as part of the attribute, so it is mainly a question can be capture additional things. Or are we suggesting to increase the triple format?

To make better discussion, I would recommend everyone to give some examples of a target string that a user could write.

One thing to note on the hw side is that usually that will be more refined than a typical triple. For example, broadcom/bcm2711 could corresponds to a typical ARM triple that another device of cortex-a73 could corresponds to.

The main intention to introduce hardware is to allow most users avoid setting the attributes(e.g. mcpu) Because a hardware string usually implies all the information.

Alias

It is also interesting to ask about alias, broadcom/bcm2711 and broadcom/rasp4b could mean the same thing, but the later is more intuitive to the users.

Support Cloud Instance

Finally, we could ask if we want to include cloud instance as a type of hardware. aws/c4.xlarge is a prety useful flag that cloud users can directly use.

Support Multiple Device Choices

On certain devices, e.g. rockchip/rk3399, there are both GPU and CPUs, it would be useful to see if we can improve the hardware string to cover the choice of the device. for example rockchip/rk3399:gpu

I prefer S0 and B1 for the interface design. I personally prefer hierarchical target string over the clang triple format as it’s more flexible and easy to understand.

CompositeTarget is definitely useful and can be extended to third party library as well.

in cases where we are running compiled TVM models in environments that can’t run the python tvm module (I.e. embedded devices), we’ll need to use a different runtime. this also changes our strategy for interacting with such devices for e.g. autotvm and debugging the compiled code.

it would be nice to have a way to specify the runtime as a target attribute as well, e.g. runtime=host, runtime=standalone, etc.

I’m a bit late to this, but here are my thoughts:

The hardware string shouldn’t be the only way to define target, it should be a convenient shortcut of a more complex definition. I think that’s what S0 does—we can maintain a mapping from a set of predefined hardware “keywords” into an actual target configuration, but not make it the only way to specify a target.

Target configuration should allow target-specific parameters (for example a string with target-specific content). Those parameters would be accessible to the target’s codegen and runtime, and only the code specific to that target would know how to interpret it.

I like the idea of targets having attributes, specifically having an attribute specifying the code generator for that target. For example, “llvm” could be a known value of that attribute for some targets. “llvm” by itself should not be a target. As a matter of fact, conceptually, the fact that a target uses LLVM as code generator should not be visible outside of that target at all. However, we allow target-specific intrinsics (specifically LLVM intrinsics), so we need some sort of visibility of it outside of the target code. Now, instead of having “call_llvm_intrinsic”, we could have a “call_target_intrinsic”, with an extra parameter stating what codegen it is for. It would be up to the author to make sure that if a target intrinsic is used directly in the TVM IR, it will match the target’s code generator.

Lastly, we shouldn’t really use target triples specifically. The LLVM code generator will need to know how to construct one from TVM’s target desciption, but triples themselves are outdated as a concept, and are still around because it would be next to impossible to phase them out of use.

1 Like

Trying to capture the discussions here and here is a strawman [RFC] TVM Target Specification

Hi tqchen,

if there are multi-nvidia gpu on my platform, how could I chose which one(I want to use only one of them) to schedule and run on?

thanks.

in this case, you can use tvm.cuda(device_id) to set the device in various executors and construct ndarrays

Thanks for your answer.

But, When using auto_shcedule or meta_schedule to tuning the model, can I set the target as “cuda 0” or “opencl 0”?

Besides, If there are many cpu cores on the platform and I want to only chose one or some of them for tuning, can I set “cpu 0” or “cpu 1,2,5” when tuing .