[BYOC, CUTLASS] Dealing with Constants in C source-gen based BYOC

The recently merged CUTLASS BYOC relies on C-codegen based BYOC infra to JIT generate and compile C++ template classes.

Currently it doesn’t support Constants embedded in an external function and instead requires all weight and bias parameters etc to be passed in at runtime. This caused a problem for me, when I apply CUTLASS BYOC to a real model: I need to run constant folding to turn fp32 bias parameters into fp16 for pattern matching purpose and sending fp16 tensors to CUTLASS. For that, I need to bind parameters to the module by bind_params_by_name, which embeds constant to the external functions like this, which is not supported by CUTLASS BYOC right now:

def @tvmgen_default_cutlass_main_267(%cutlass_267_i0: Tensor[(1024, 1024), float16], %cutlass_267_i1: Tensor[(4096, 1024), float16], Inline=1, Compiler="cutlass", global_symbol="tvmgen_default_cutlass_main_267", Primitive=1) -> Tensor[(1024, 4096), float16] {
  %9 = fn (%FunctionVar_8_0: Tensor[(1024, 1024), float16], %FunctionVar_8_1: Tensor[(4096, 1024), float16], %FunctionVar_8_2: Tensor[(4096), float16], PartitionedFromPattern="nn.dense_add_multiply_cast_erf_cast_multiply_add_multiply_", Composite="cutlass.dense_bias_gelu_fp16") -> Tensor[(1024, 4096), float16] {
    %1 = nn.dense(%FunctionVar_8_0, %FunctionVar_8_1, units=None, out_dtype="float16") /* ty=Tensor[(1024, 4096), float16] */;
    %2 = add(%1, %FunctionVar_8_2) /* ty=Tensor[(1024, 4096), float16] */;
    %3 = multiply(%2, meta[relay.Constant][0] /* ty=float16 */) /* ty=Tensor[(1024, 4096), float16] */;
    %4 = cast(%3, dtype="float32") /* ty=Tensor[(1024, 4096), float32] */;
    %5 = erf(%4) /* ty=Tensor[(1024, 4096), float32] */;
    %6 = cast(%5, dtype="float16") /* ty=Tensor[(1024, 4096), float16] */;
    %7 = multiply(%6, meta[relay.Constant][1] /* ty=float16 */) /* ty=Tensor[(1024, 4096), float16] */;
    %8 = add(%7, meta[relay.Constant][2] /* ty=float16 */) /* ty=Tensor[(1024, 4096), float16] */;
    multiply(%8, %2) /* ty=Tensor[(1024, 4096), float16] */
  };
  // meta[relay.Constant][3] is the bias constant, not supported by CUTLASS BYOC for now
  %9(%cutlass_267_i0, %cutlass_267_i1, meta[relay.Constant][3] /* ty=Tensor[(4096), float16] */) /* ty=Tensor[(1024, 4096), float16] */
}

So I now need to deal with Constants. I think embedding all constants into C-source is infeasible for models like BERT-large which I’m working with. Alternative I think of is to somehow “unbind” constants after constant folding. But this requires modifying signatures of external functions and passing additional parameters inside main module, for which I don’t see an easy way to achieve.

My questions:

UPDATE: For the particular case I’ve been working with, replacing one is_constant() in my pattern with wildcard() allowed me to avoid the need for running constant folding before pattern matching. So for now, I’m unblocked.

But I still wonder if is realistic not to support Constant at all in a BYOC codegen…

Your solution makes sense to me. This mechanism is used for the case that a BYOC backend attempts to manage the constant values with certain processes, such as layout transform. It works well for other codegens (e.g., JSON), but as you pointed out, we never really solve this problem for C codegen.

IMHO, we could have a specialized mechanism for C codegen to manage constants. For example, we could let C codegen serialize the constants to a separate artifact file, and encapsulate it along with the generated/compiled engines, and load them to the memory at the first execution.

On the other hand, the reason that BYOC backends may need to manage constants by themselves is because the processed constants may violate the typing (e.g., layout or data type), so another approach is to let C codegen register/update constants to metadata module. This should be done via constant updater:

For the second question that uses JSON runtime, in this case the flow may look like the following, which is similar to TensorRT:

  1. In codegen, simply output JSON graph and constants.
  2. In runtime, at the first iteration, run the C codegen according to the JSON graph and input data, and profile/compile the generated C code to be executable kernels. As you can imagine, the first iteration may be very slow in this case.
  3. Cache and execute kernels.
  4. In the rest iterations, simply use the compiled kernels as it is.

CUTLASS does seem to support specialized layouts for gemm / conv2d. If we want to make use of them and if the layout transform cannot be expressed by relay.transform, then I think we really need to take in Constants.

This makes sense, the problem I imagine would be that C-codegen doesn’t really have a proper runtime. So I guess all logic for managing constants will be written in a big string and compiled together with the actual offload calls. I don’t want to write code like that :sweat_smile:

So I think we need to think about switching to Json runtime if a need for a proper handling of Constants ever comes up. The difficulty I see is that unlike C-codegen based BYOC or TensorRT, we have to manage compilation and loading of the compiled lib ourselves (call nvcc directly from the Json runtime and use dlopen etc to retrieve a handle to a compiled function).

Maybe what we need is something like NVRTC for cutlass.

Yeah I can see the difficulty you mentioned, and it might be possible that nvcc is not available in runtime if the model is deployed to an edge device.

A combined approach would be leveraging the third BYOC option: custom codegen/runtime. Specifically, we still generate the C/CUDA kernel and compile them using NVCC at the compile time, but instead of using the C source module you’re currently using, we treat the generated/compiled kernels as “graphs”. Meanwhile, we also serialize the constants to a JSON file. Thus, our artifacts include compiled kernels (in binary) and constants (in JSON). This is sort of similar to Xilinx Vitis-AI and Arm Ethos-N backends, which generate a binary/bit-stream in the desired format, and use their own runtime for execution.

In addition, we make a runtime engine that loads the compiled kernels and deserializes the constants. In this way, the runtime could still be light-weight and should be easy to implement, because all it needs to do is invoking the corresponding kernel by its symbol and feeding the right data entries. We don’t need to have a JSON interpreter to traverse the JSON subgraph and generate the engine like TensorRT.

btw, I’m also curios how @Laurawly deals with the specialized weight layout with the C codegen.

1 Like

@masahi

There is another option you could take here.

The wildcard() actually works here because the constant remains in @main function of the IRModule. In the partition_for_* function where the full IRModule is visible (along with @main and external functions) you could actually mutate the constants within external function and hoist them out of the external function prior to calling the relay.build(…).

In my understanding, all the relay.Constants that remain in the main will be handled in executor codegen and they will be passed when the external function is called.

Interesting, maybe I can do something before partitioning is applied.

Hi @masahi, We are using constant extraction in CMSIS-NN flow what @manupa-arm has suggested above. Please take a look here:

2 Likes