Default schedule for custom target

We have a custom target and a few custom strategies/schedules, but want to revert to the generic strategy for most operators, (softmax, for example). For such operators, since there is no custom strategy, it reverts to “default_strategy” which contains this line:

   if target.kind.name not in ("llvm", "c"):
        raise RuntimeError("schedule not registered for '%s'" % target)

Sure enough, this error gets thrown since my target is not “llvm” or “c”. I tried using relay.build with target set to a few other custom targets (hexagon, stackvm) and the same error is thrown.

Am I doing something wrong or is it just a requirement that any target that’s not c or llvm must register strategies for every operator instead of using defaults?

Also it seems to me that the idea of ‘target’ and the code generation method for a target is conflated somewhat. Our target generates C that’s compiled by our offline C compiler, but we have customized the C backend so we can’t just make the target “c -cpu=whatever”; to have a custom backend you need a distinct target kind.

-Alan

hi @adavis,

it seems like you might need to define a new DeviceAPI impl for your custom PE, or take the BYOC or tensorize route to model the custom PE in TVM.

Also it seems to me that the idea of ‘target’ and the code generation method for a target is conflated somewhat.

This is unfortunately a bit true.

Our target generates C that’s compiled by our offline C compiler, but we have customized the C backend so we can’t just make the target “c -cpu=whatever”; to have a custom backend you need a distinct target kind.

Could you say more about why/how you’ve customized the c backend? It seems like you don’t want to do that if you still need to emit code for generic operator strategies.

Andrew

Thanks Andrew.

Sorry, what is PE?

I’ll briefly describe our flow. This is all experimental and exploratory of course. Our target is a fully programmable DSP. Of the current upstream TVM targets, it’s closest to Hexagon. We’re using TVM to generate C for our proprietary compiler. The reason we went to TVM in the first place is all the power and flexibility of tensor expressions and TIR lowering passes, so we’d like to leverage that for code generation rather than BYOC. We will have custom schedules for some operators but it would be nice to build that up incrementally and use the defaults otherwise.

As for the backend, our target has some unconventional ways of doing DMA, and a specialized prefetch mechanism where the access pattern is pre-programmed outside the loop, eliminating any address calculation from the inner loop. Both of these require some specialized setup, and are exposed through a C++ API that uses C++ templates to move some of the setup computation to compile time. That’s the main reason we need a custom backend – we want to generate C++, and TIR’s type system has no representation for anything beyond basic C types. So the lowering passes insert intrinsics, in the form of @call_extern, which are translated by the backend to C++ declarations and expressions. Conceivably we could use C macros or something but there may be other things we’d like to do in the backend – our compiler is somewhat picky about code shapes to do a good job.

-Alan

@adavis

sorry-- PE: physical execution unit (e.g. just a generic name for cpu, accelerator, etc)

thanks for the clarifying explanation. I think you should follow this discussion on splitting the BYOC lower and generate apart. This is something we’re working on, but don’t have yet.

I almost suggested you implement a CUDA-like codegen which inherits from CodegenC, but which can generate the C++ primitives you want. I don’t think this would ultimately work out that great because you’d need to model the C++ target as a separate device and I don’t think scheduling would fall back on the host device properly. But you might be able to get that to work if you hack at it enough. It likely wouldn’t be upstreamable in that form.

Outside of the TVM C++ runtime, we are missing a specification to interact with DMA, so as you mentioned it needs to happen via call_extern for now. Some initial discussion of heterogeneous compute with AOT/C runtime is happening here. I expect we will resolve this in the near future, but it needs e.g. RFC and community feedback to properly add. I’ll definitely be sure to loop you in as that work gets traction. If you can share, it would be helpful to understand the interface your DMA engine presents and whether you’ve been successful using the built-in TVM prefetch.

-Andrew