[RFC][OpenCLML] OpenCL ML integration into TVM as BYOC

Summary:

OpenCL ML is an extension (cl_qcom_ml_ops) over OpenCL spec developed by Qualcomm to accelerate the machine learning at operation level. OpenCL SDK is publicly available at OpenCL Machine Learning Acceleration on Adreno GPU - Qualcomm Developer Network. OpenCL ML leverages deep knowledge of Adreno GPU for significant performance benefits. It offers C based DNN API with compatibility to most of the standard frameworks. Its standard OpenCL features like command queues, buffers, events and supports FP16 and FP32 data types. CLML API calls can be interleaved with other OpenCL kernels (i.e., TVM generated kernels) and dispatched to the same command queue. This extension is compatible with existing OpenCL extensions for importing memory, controlling performance and data access.

Motivation:

The current OpenCL backend of TVM is very generic and not optimized well for Adreno performance capabilities. Adreno GPU has quite a few proprietary and standard OpenCL paths. OpenCL ML extension offers accelerated ML operations via an SDK interface.

With TVM having the entire framework of frontends, graph level optimizations and OpenCL ML having kernels that perform best on Adreno GPU, in this work we aim to integrate OpenCLML SDK into TVM as a BYOC. This effort brings best of both worlds where TVM handling high level optimizations, sub graphs are scheduled on OpenCL ML based on the support and the operators not supported by OpenCL ML will take TVM’s default OpenCL path. Good thing here is we don’t need separate OpenCL workspaces or command queues for both paths, instead they can share the command queues. Also, data (DLTensor) transfer across subgraphs is seamless with OpenCL ML API’s.

Guide-level Explanation:

This RFC aims to introduce OpenCLML runtime as a BYOC option into TVM. In terms of usage, it’s very similar to other BYOC integrations we have in TVM.

Along with all other options we use for OpenCL target, here we introduce the below build options in config.cmake

USE_CLML # (ON/OFF) This enabled CLML code gen for compilation

USE_CLML_GRAPH_EXECUTOR # (ON/OFF) This enables CLML runtime

Btw, OpenCLML SDK provides replacement for default libOpenCL.so. Hence, we don’t need a separate option to point OpenCLML SDK instead just point OpenCLML SDK path for USE_OPENCL.

Introduces front end helper API as “tvm.relay.op.contrib.clml”. This will help to partition the graph and annotating the subgraphs to OpenCL CLML target.

Given mod and params that represents TVM Module and params the below API does partitioning based on OpenCLML support.

mod = clml.partition_for_clml(mod, params)

Post above partitioning we just follow standard relay.build process.

Talking about runtime, OpenCL ML runtime compilation is same as OpenCL compilation for Android target. Just that USE_OPENCL points to OpenCL ML SDK.

Reference-level Explanation:

Like any other BYOC implementation this RFC enhances/introduces a frontend helper API for partitioning, a codegen for CLML and CLML runtime.

Frontend:

Front end implements tvm.relay.op.contrib.clml user API partition_for_clml and is_clml_runtime_enabled for partitioning the relay graph to OpenCLML path. It also contains clml specific patten table definition and other transform helpers required for CLML target.

Codegen:

CLML codegen built over JSONSerializer. Thanks to JSONSerializer for all the infra here and one can focus only on target specific parsing and JSON Node generation. The codegen exports relay.ext.clml, relay.op.is_clml_runtime_enabled into TVM global space.

Runtime:

OpenCLML Runtime is again extended over JSONRuntimeBase and implements OpenCL ML initialization, implementation of CLML API invocation corresponding to CLML annotated layers.

OpenCLML runtime support is verified by looking for cl_qcom_ml_ops into OpenCL extension list.

OpenCLML doesn’t define a new open context instead it reused the context defined by OpenCL runtime through global API device_api.opencl.

OpenCLML has its own CLML tensor objects called cl_ml_tensor_memory_desc_qcom. The runtime defines the copy API from OpenCL to CLML Tensors within the same OpenCL work space without bringing the data back to host.

OpenCLML supports tuning too which generally produces a tuning cache file and reuses for later runs. This implementation supports looking for environment variable CLML_IS_TUNNING_RUN set to 0/1 to run for tuning and also supports CLML_TUNNING_CACHE to set the tuning cache file location.

Drawbacks:

OpenCLML is supported by Snapdragon devices only with extension cl_qcom_ml_ops. Seamless copy from OpenCL to CLML is supported for clBuffers now. Using Image objects on TVM may have challenges for direct copy within OpenCL context.

Rationale and alternatives

OpenCL ML uses Adreno specific proprietary and public optimization paths and outperforms TVM generated OpenCL kernels by a big difference.

Prior Art:

There exists an ongoing development for texture support on Adreno devices [RFC] Texture memory support.

Unresolved questions

How do we deal with sub graphs with tiny layers? This is the case where not offloading the tiny layer performs better than accelerator.

Future Possibilities:

Integrating OpenCLML into TVM gives an end-to-end compiler stack for Snapdragon platform with Adreno GPU target. Operator support evolves along with OpenCL ML SDK releases from Qualcomm.

6 Likes

I try to use cl_qcom_ml_ops ext on XiaoMi 11. However, when calling clQueryMLInterfaceVersionsQCOM, it just return CL_OUT_OF_HOST_MEMORY. How could I fix it ?

CLML compatible devices are starting from Snapdragon 8 Gen 1 onwards.

Oh, so snapdragon 888 is not supported? That’s the one I have…

I heard that Snapdragon 888 or Snapdragon 8 Gen 1 after been rooted, and set setenforce to 0, may be call of clQueryMLInterfaceVersionsQCOM will return CL_SUCCESS. However, I do not have rooted phone or Snapdragon 8 Gen 1

Hi @srkreddy1238 ,

When we are trying to build TVM with CLML support, the following error is being thrown, is this an issue with any recent changes or are we missing anything?

tvm/include/tvm/runtime/data_type.h:191:24: error: expected unqualified-id before ‘int’
  191 |   static DataType Bool(int lanes = 1) { return DataType::UInt(1, lanes); }
      |                        ^~~
tvm/include/tvm/runtime/data_type.h:191:24: error: expected ‘)’ before ‘int’
  191 |   static DataType Bool(int lanes = 1) { return DataType::UInt(1, lanes); }

CLML version: 3.0 CMake version: 3.26

Thanks & Regards,

Kuladeep.

Hi Kiladeep,

Can you share detailed log ?

Is it caused from CLML source files (clml_runtime.cc , clml/codegen.cc …etc ? Also, Are you building through ci_adreno docker ?

https://tvm.apache.org/docs/how_to/deploy/adreno.html?highlight=adreno#development-environment

check this for successful compilation. You need latest OpenCL sdk (CLML SDK) from qualcomm devloper network as dependency here.

Good afternoon @srkreddy1238,

Thanks for sharing this. We will try this.

Best regards, Kuladeep.

1 Like

Hello @srkreddy1238 , Thanks for sharing the steps. Now we are able to run the resnet50 model with OPENCL ML BYOC. Can you please suggest how can we profile the performance metrics? We followed the steps given in tvm/apps/cpp_clml at main ¡ apache/tvm ¡ GitHub But there are no timing info here.

cpp_clml tool is just a debug tool. It helps to debug any issues related to OpenCLML issues.

I am yet to document perf. and profiling related aspects for CLML. I can guide few of them here.

We may get single CLML sub graph for entire network or some times the network may get split across CLML and Native OpenCL too. This can be inspected by looking into graph.json.

Regarding profiling we can profile individual layers of the graph (both Native OpenCL and also CLML). Native OpenCL layer profiling is possible by using tvmc tool for running. Alternatively you can use debug_executor (python/tvm/contrib/debugger/debug_executor.py) from python interface too. OpenCLML offloaded layer can be profiled by exporting CLML_PROFILING on the device.

Now, for overall best performance we need to make sure CLML layers and also Native OpenCL are tuned to it’s best.

If we see any tunable (AutoTVM) layers in graph json which are planned in native OpenCL path we need to tune them using AutoTVM (We can use tuning cache of same network w/o CLML offloading).

CLML also support tuning. CLML tuning cache is controlled by env. variables CLML_IS_TUNING_RUN (0 or 1) and CLML_TUNING_CACHE (cache filepath on disk). We can run the sample first by enabling tuning and later just provide the tuned cache via CLML_TUNING_CACHE. A better solution is on it’s way to automate and handle CLML tuning cache implicitely while build.

Given, tuned cache for both native OpenCL and CLML paths, altering the network (if possible) to avoid unnecessary context switch or fallback to native OpenCL for very simple layers would result in best performing network with CLML.

Btw, current BYOC solution always gives priority to CLML while planning any layer execution. Heterogenous execution by using Collage framework helps to result a best performing across multiple paths. Part of this work is already mainlined https://github.com/apache/tvm/pull/13450. This uses buffers for native path instead of Textures. Texture support here is planned to mainline soon.