[RFC] Canonicalizing AutoTVM Log Format

Please also refer to this topic that moves AutoTVM log version to 0.2. Some issues have been discussed there: AutoTVM log format

@mdw-octoml I don’t think there’s currently enough interest to justify adding Protobuf as a dependency in TVM. TVM users are used to readable json for their autotvm logs. If there is more interest from the broader community, we can revisit this.

@tqchen I feel we may not need to re-discuss the schema heavily since it was already discussed very recently in the RFC @comaniac shared and, as you say, ansor will likely introduce modifications. Maybe instead of schema specifics this RFC should be more about adding code structure to log production so that there is a single source for future log modifications (I have no problems with preserving the current log format exactly as it is). I will clarify this in the original post.

1 Like

I agree with @tqchen. Probably we should wait and see how Ansor log looks like and include it into the design. We could have @merrymercy comment on this.

In the high level, I suggest we have five fields: target, workload, config, results, version. The only change is taking the target out of the original input field, while having the workload describes the computation.

I agree that protobuf is good for this purpose. But I’d prefer that we still output the log into a text format so that it’ll be easy to quickly check the details.

My thoughts are that the suggested change to add Python structure shouldn’t necessarily depend on what the log format will look like, so I don’t think there is a need to wait for the Ansor log format. (I imagine Ansor coders have their hands full, and that they’d prefer to consider polish later in their process).

The main value add of this proposal is to enable clearer conversations about schema changes in the future.

For example, @haichen is this an accurate summary of your suggested changes?

class AutoTVMLog:
  target: str                     # added
  workload: Workload              # modified from "input: Input"
  config: Config
  result: Results
  version: str
  tvm_version: str

class Workload:                   # added
  task_name: str
  args: List[Argument]
  kwargs: Dict[str, Any]
1 Like

Probably we can canonicalize the target (e.g., a protobuf buffer) instead of a string as well. We can refer the target format to [RFC] TVM Target Specification. @tqchen

I’ve thought about this some more, and I’m changing my stance with respect to ProtoBuf. While adding a Python class schema is a less invasive change than introducing ProtoBuf and allows us to stick to the current log format exactly, protos do have the added benefit of being language-neutral. Also, it will also be likely moving forward that sticking to “industry standard” practices (as @mdw-octoml indicated) will enable even more clarity around schema changes, and enforce to some extent more backwards compatibility than we’ve seen so far.

To that end, here is a resummarization of the proposed schema in .proto. Comments are left for modifications. Note this will certainly require an update from 0.2 -> 0.3 schema format and implementation details may change slightly. I would also send a PR to tophub accordingly if people agree to this change.

syntax = "proto3";
package autotvm.log;
import "google/protobuf/any.proto";

message Target {
  // For now this is the string representation of a target; e.g. "llvm -mcpu=broadwell"
  // This should be replaced once the rfc "TVM Target specification" is finalized
  string target_string = 1;
}

message AutoTVMLog {
  Target target = 1;
  Workload workload = 2;
  Config config = 3;
  Result result = 4; 
  string version = 5;
  string tvm_version = 6;
}

message Workload {
  string task_name = 1;
  repeated Argument args = 2;
  // kwargs is no longer included as it is unused
}

message Argument {
  oneof arg {
    Tensor tensor = 1;
    // Possible tuple values are not well specified and may require more sorting out
    // https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/task/task.py#L43-L63
    Tuple tuple = 2;
    string value = 3;
  }
}

message Tensor {
  string name = 1;
  repeated uint32 shape = 2;
  string dtype = 3;
}

message Tuple {
  repeated google.protobuf.Any values = 1;
}

message Config {
  string code_hash = 1;
  repeated Entity entities = 2;
  uint32 index = 3;
}

message Entity {
  // Entities are previously output as `[["tile_ow", "sp", [-1, 1]], <other_entities>]`
  // The proposed encoding clarifies entity type in the schema itself instead of as a string
  string knob_name = 1;
  oneof entity {
    SplitEntity split = 2;
    ReorderEntity reorder = 3;
    AnnotateEntity annotate = 4;
    OtherOptionEntity other_option = 5;
  }
}

message SplitEntity {
  repeated int32 size = 1;
}

message ReorderEntity {
  repeated uint32 order = 1;
}

message AnnotateEntity {
  repeated string annotations = 1;
}

message OtherOptionEntity {
  google.protobuf.Any value = 1;
}

message Result {
  repeated float costs = 1;
  int32 error_no = 2;
  float all_cost = 3;
  float timestamp = 4;
}

As an example, the json will look like

{
  "target": {
    "target_string": "llvm -mcpu=broadwell"
  },  
  "workload": {
    "task_name": "conv2d_x86_64",
    "args": [{"tensor": {"name": "tensor_name","shape": [1,2,3],"dtype": "float32"}}]
  },  
  "config": {
    "code_hash": "codehashtest",
    "entities": [{"knob_name": "tile_ic","split": {"size": [4,32]}}],
    "index": 1
  },  
  "version": "0.3",
  "tvm_version": "todo get tvm version"
}

To avoid breaking workflows that assume readable log output by default, I suggest we simply add “protobuf” as an encode/decode/file logging option in https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/record.py. The default serialization format will still be “json”, but all serialization schemes will be backed with the proto-generated schema. @haichen @jroesch @tqchen what do you think?

The proposal looks good. notably, the config will need to evolve as we migrate to ansor, so perhaps we could try to keep it opaque, or find a way to upgrade later.

I think the main benefit of keeping the ProtoBuf opaque is avoiding the unnecessary effort of fleshing out a schema that will change very soon. However, since I have a full specification described here already, I prefer to go ahead with it, unless there other concerns I have missed.

I suggest that the process for upgrading this schema should be opening an RFC like this one (ideally linking a PR with the desired .proto changes).

I would also like to point out some caveats with ProtoBuf usage.

  • It’s highly encouraged that proto fields are never removed, but instead marked with a “deprecated” flag unless you are aware you will break backwards compatibility.

For the ansor changes, if we are deprecating autotvm 1.0 entirely, I think it would be ok to remove fields as needed. If that’s the case, the case for a fully specified schema as the resolution for this RFC makes more sense, as it would be good for people to have an explicit schema to refer to for pre-ansor logs.

cc @merrymercy @zhiics @haichen @FrozenGene @comaniac @ajtulloch @antinucleon @junrushao

The proto representation looks good to me. I have a couple of suggestions based on prior experience designing proto-based data formats.

  • I recommend the use of enums rather than strings for values that are constrained to a small, fixed-size set. For example, the dtype field in the Tensor message should probably (I think!) be an enum.

  • I don’t know the use case for the google.protobuf.Any fields in the spec, but in general I would recommend making these specific message types or ‘oneof’ fields whenever possible.

  • There may be places that you wish to tighten up the semantics of the existing log format, rather than simply encoding the existing format as a proto. For example, I would recommend being explicit about the meaning of the ‘version’ field (e.g., should this be a SemVer-type version string?). Likewise, use of a float value for timestamps can lead to imprecision, unless timestamp means something different here than it does in most other systems – uint64 storing microseconds since the epoch, or a string holding an ISO-8601 formatted timestamp would be better.

  • For the case of the Config message, if you believe it will soon change or differ based on new functionality coming along, consider using a oneof field with a single submessage for the existing Config.

Some comments on the dtype, the dtype field in Tensor is actually quite flexible(goes beyond the enumeration since arbitary vector length, bitwidth and customized data type is also allowed). So perhaps string, or making a structured variant makes sense. So we can continue use string for simplicity and consistency with the python side of the repr, alternatively one could design a further composite encoding, but that will involves parsing printing of the typestr, which could be overkill here.

1 Like

I see. In my experience, it is worth making this a structured type, even if it seems painful at first. In the long run, having to maintain custom parsing logic for just one of your fields (where the others are all structured) ends up being a maintenance burden. I’m a strong advocate for using structured types as they were intended to be used.

In this case the parsing is already necessary and builtin, because the numpy convention uses the string for dtype. So we are trying to build compatibility for interpolating with something that already exists. The types on the c++ side is structured.

Gotcha. In that case I think it’s important to document that the format of the field is the type string used by numpy.

Difference between the logs for Ansor and AutoTVM

There are two major differences between ansor’s log and autotvm’s log

  1. The workload for Ansor is a subgraph defined by multiple tvm.compute, while the workload for autotvm is a single operator. To index log quickly, Ansor stores a hash value of the subgraph as the workload key.
  2. Ansor saves the whole serialized schedule as config (in json format), while autotvm only stores the parameters.

However, Ansor’s new log format can still fit into the @tqchen 's design of top-level fields.

Other thoughts

  1. The current log file is an append-able text file, where one line corresponds to one log item. I can edit it with a text editor. If we use a binary format, I want this property to be preserved.
  2. If we make the log longer and more readable, there will be a lot of redundancy in the file. For example, for a single tuning job, the same long target string will appear in all lines. Do we have methods to compress it?

General Comments

IMHO, @merrymercy’s comments on log files are valuable. Many users now look into the log file for the information they need, and even manually modify some logs for experiments or optimizations. This can be achieved because 1) the log files are in text format, and 2) one config (line) in a log file is in a reasonable length. As a result, at high level I agree with @anwang’s proposal that keeps the log file in JSON format but uses proto-generated schema to (de)serialize it. IIUC, this approach still allows users to modify the log file manually if needed.

On the other hand, one point I have for the current proposal is for workload. In terms of the semantic, the workload mentioned in the proposal is more like a task, as it has task_name and args. A workload should be a list of input tensors which is independent to tasks. Here is a complete example of conv2d task:

"task": {
  "task_name": "conv2d_NCHWc.x86",
  "args": [{"tensor": {"name": "data","shape": [1,3,224,224],"dtype": "float32"}},
           {"tensor": {"name": "weight","shape": [32,3,3,3],"dtype": "float32"}},
           [1, 1], [1, 1, 1, 1], [1, 1], "NCHW", "NCHW", "float32"]
}, 

In addition, one problem is that args is just a list of task arguments, so it’s hard for people to understand the actual meaning. I’d be great if we could also improve the task initialization process to take keyword arguments instead of position arguments. As a result, we could have:

"task": {
  "task_name": "conv2d_NCHWc.x86",
  "args": {"data": {"tensor": {"name": "data","shape": [1,3,224,224],"dtype": "float32"}},
           "weight": {"tensor": {"name": "weight","shape": [32,3,3,3],"dtype": "float32"}},
           "strides": [1, 1],
           "pooling": [1, 1, 1, 1],
           "dilation": [1, 1],
           "data_layout": "NCHW",
           "output_layout": "NCHW",
           "dtype": "float32"}
}, 

Ansor’s Log Format

As @merrymercy mentioned, since Ansor is targeting to a subgraph instead of a single operator, the task_name would be an issue. The current approach using hashed subgraph is definitely not user friendly, and we cannot re-establish the subgraph by interpreting its hash value. A better solution would be providing a utility to serialize compute DAG as a string, and another utility to deserialize the string back to the compute DAG.

addressing @mdw-octoml’s points:

  • I will add a comment addressing the semantics of the dtype field in the proto.
  • I will further refine the spec to avoid Any. I originally included google.protobuf.Any to capture the current tuple argument semantics, which seemingly supports arbitrary nesting here https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/task/task.py#L43-L63. It looks like stakeholders prefer to improve the format rather than use it as a snapshot, so this will warrant further discussion.
  • re: tightening up the proto semantics. I will add comments to the proto to elucidate the following: version refers to log format schema version as a SemVer string. An example of tvm_version is “0.7.dev1” and afaik that doesn’t follow SemVer, but I will comment on the expected format. I agree that timestamp should be an ISO-8601 formatted timestamp and will make this change.
  • It looks like Config will have some drastic changes, so I will convert the Config message to containing a oneof field.

@comaniac I will change workload to task. Since ansor is not op-based, I think it makes sense to keep the workload syntax to prepare for ansor’s log format changes.

I agree that the list-based representation of arguments is less than ideal – currently it’s hard to understand the semantics of any particular argument. If we go with a “kwargs” approach I think we should not support “arbitrary” kwargs, since the proto would necessarily need to look like

message Task {
  string task_name = 1;
  map<string, google.protobuf.Any> args = 2;
}

or

message Task {
  string task_name = 1;
  map<string, Argument> args = 2;
}

The “arbitrary kwarg” approach doesn’t restrict the type of a particular argument in any meaningful way, and I feel the point of formalizing a schema is to add these restrictions. I think it would be better to have a full enumeration of the possible arguments for the task. @comaniac what do you think? Is the example you provided an exhaustive representation of possible arguments? If not and you agree that we should restrict possible arguments, could you provide or point me to where I can find the right enumeration?

@anwang Ansor also has a “task” concept. A task is not necessary to be just for one operator. It just means a “tuning” task. As a result, I still vote for task.

In addition, I don’t think full enumeration is proper for several reasons.

  1. Full enumeration will lose the flexibility when adding new tasks.

  2. It would make the log too long and tedious, because the task arguments (attributes) are very different. For example, this is the task arguments for conv2d_NCHWc.x86:

And this is dense.nopack.x86:

You can basically search for autotvm.register_topi_compute in TOPI to see all task function arguments. Unless we can also canonicalize the task arguments, it seems impractical have a full enumeration argument list.

Consequently, IMHO, supporting arbitrary kwargs arguments would be more practical.

I see. Thanks for clarifying @comaniac, I agree with your comments.

Addressing @merrymercy’s points:

  • One possible solution to the redundancy of repeating items such as target string would be to encode something like this: message AutoTVMLogs{ string target; repeated AutoTVMLog; ...} where the inner AutoTVMLog no longer indicates the target string. However, this change would make it more difficult to adhere to the “one record per line” json standard AutoTVM currently holds. For simplicity I prefer keeping the redundancy, but since I haven’t worked very closely with the logs myself, I will defer to others’ takes.
  • The proposed implementation will allow manipulation of readable json.
  • The major differences you indicated can modify the proto as desired when ansor is ready.

Here is an updated proposal of the protobuf given everyone’s feedback.

syntax = "proto3";
package autotvm.log;
import "google/protobuf/any.proto";

message Target {
  // For now this is the string representation of a target; e.g. "llvm -mcpu=broadwell"
  // This should be replaced once the rfc "TVM Target specification" is finalized
  string target_string = 1;
}

message AutoTVMLog {
  // The compilation target
  Target target = 1;
  // Represents a tuning task
  Task task = 2;
  // The configuration used by this task
  Config config = 3;
  // Tuning results
  Result result = 4; 
  // SemVer string describing the AutoTVM log format version
  string version = 5;
  // SemVer string with qualifiers attached as a suffix. e.g. "0.7.dev1"
  string tvm_version = 6;
}

message Task {
  // Human-readable task name
  string task_name = 1;
  // Map of keyword arguments where the key indicates argument name
  map<string, Argument> args = 2;
}

message Argument {
  oneof arg {
    Tensor tensor = 1;
    // Possible tuple values are not well specified and may require more sorting out
    // https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/task/task.py#L43-L63
    Tuple tuple = 2;
    string value = 3;
  }
}

message Tensor {
  repeated uint32 shape = 1;
  // Indicates a numpy dtype
  string dtype = 2;
}

message Tuple {
  repeated google.protobuf.Any values = 1;
}

// Config for AutoTVM v1
message Config_v1 {
  // code hash
  string code_hash = 1;
  repeated Entity entities = 2;
  uint32 index = 3;
}

message Config {
  oneof config {
    Config_v1 config_v1 = 1;
  }
}

message Entity {
  // Entities are previously output as `[["tile_ow", "sp", [-1, 1]], <other_entities>]`
  // The proposed encoding clarifies entity type in the schema itself instead of as a string
  string knob_name = 1;
  oneof entity {
    SplitEntity split = 2;
    ReorderEntity reorder = 3;
    AnnotateEntity annotate = 4;
    OtherOptionEntity other_option = 5;
  }
}

message SplitEntity {
  repeated int32 size = 1;
}

message ReorderEntity {
  repeated uint32 order = 1;
}

message AnnotateEntity {
  repeated string annotations = 1;
}

message OtherOptionEntity {
  google.protobuf.Any value = 1;
}

message Result {
  // The measured runtime costs of this configuration
  repeated float costs = 1;
  // The error type defined by MeasureErrorNo
  int32 error_no = 2;
  // End-to-end cost of benchmarking, including rpc, compilation, test runs
  float all_cost = 3;
  // ISO-8601 formatted timestamp
  string timestamp = 4;
}

One further question I have is regarding the Tuple argument. It is serialized arbitrarily in branches that include possible recursion here https://github.com/apache/incubator-tvm/blob/master/python/tvm/autotvm/task/task.py#L53-L54 and it’s unclear to me what these different serializations should map to in logical structures. Could someone (perhaps @haichen) clarify what each branch is meant to represent? Everything that I’ve marked Tuple below represents a structure that is unclear to me.

if isinstance(x, tensor.Tensor):  # message Tensor { shape, dtype }
    return ('TENSOR', get_const_tuple(x.shape), x.dtype)
if isinstance(x, (tuple, list, container.Array)):  # message Tuple { repeated Any } 
    return tuple([_encode(a) for a in x])
if isinstance(x, (str, int, float, np.int, np.float, expr.Var)):  # message Tuple { repeated Any } 
    return x
if isinstance(x, (expr.StringImm, expr.IntImm, expr.FloatImm)):  # message Tuple { repeated Any }
    return x.value
if isinstance(x, runtime.container.String):  # string value
    return str(x)