Then you could consider the BYOC flow. You could refer to a recent effort that integrates NVIDIA CUTLASS with BYOC flow and the C codegen:
Then you could consider the BYOC flow. You could refer to a recent effort that integrates NVIDIA CUTLASS with BYOC flow and the C codegen: