TVM Monthly - May 2024

As discussed by the TVM PMC, our goal is to provide a monthly summary of the project so users and developers can get a better understanding of the goings on of the TVM community.

Feedback and suggestions are welcomed so that we can further improve these updates.

Community

  • #17018 - New committer: Balint Cristian

RFCs

Add the Khronos Neural Network Exchange Format (NNEF) as a frontend to TVM Relay and Relax. Link about discussion.

  • #108 - [RFC] Add NNEF frontend

We continue to improve Relax, TIR, Frontend and other Runtimes.

Disco

  • #17035 - [QoL] Implement broadcast/scatter methods for Session
  • #16997 - [Runtime]Restore checks for hangup of disco pipe
  • #16992 - [Bugfix]Handle NDArray larger than OS buffer for pipe
  • #16978 - Implement num_workers property for disco.Session
  • #16989 - Treat hangup of disco worker process as kShutdown
  • #16993 - Allow allocation that only exists on worker0
  • #16979 - Expose disco.Session.shutdown through the python API

Dlight

  • #17026 - Perf improvement for low_batch_gemv on Metal
  • #17016 - Update Adreno GEMV Rules
  • #16972 - [GPU] Enhance opencl thread limit for schedules
  • #16973 - [GPU] Improved gemv outer fallback schedule
  • #16958 - Check for target in function attributes

Frontend

  • #16961 - [Bugfix][ONNX] Improve broadcast and batch_matmul conversion

LLVM

  • #16966 - [SVE] Add support for representing and creating buffer-level predicates
  • #17001 - [SVE] Use only powers of two as possible vscale values
  • #16962 - [SVE] Add codegen support for vscale_range() function attribute
  • #16968 - Stringref API deprecation fixes
  • #16965 - [SVE] Add get_active_lane_mask builtin

Metal

  • #17025 - Support metal device profiling

ROCm

  • #17037 - [Runtime]Enable ROCm host memory support

Relax

  • #17033 - [Bugfix] Apply FuseOps to nested DataflowBlock
  • #17032 - [Bugfix] Annotate ComputePrimValue output as host function
  • #17034 - [Bugfix] Bind symbolic variables in R.match_cast
  • #16960 - [UnitTest] Validate IRModule with multiple targets
  • #16995 - [KVCache] Support KVCache decode from forked sequence and pop more tokens
  • #16959 - [Transform] Handle identical PrimFunc with distinct VDevice
  • #16589 - [Unity] Check for transpose and dynamic shape in AdjustMatmulOrder
  • #16988 - [KVCache] Fix the aux data syncing order of paged KV cache
  • #16922 - [BugFix]change FuseOpsByPattern strategy to pattern-match maximal subgraph
  • #16982 - [Unity][BYOC] Use arith.Analyzer to check batch equality of matmul in cublas
  • #16955 - Implement relax.op.view
  • #16971 - Support nested ModuleList in nn.Module
  • #16826 - Express dynamic arguments of strided_slice as arguments

Relay

  • #16996 - fixed to make TupleGetItem inherits the previous span

Runtime

  • #17045 - Fix PagedKVCache for PopN and enhance tests
  • #16998 - Compatibility with dmlc::Stream API changes
  • #17036 - Use preferred host memory (pinned memory) in KV cache
  • #16994 - Allow query of available device memory through DeviceAPI

TIR

  • #17039 - Fix Bug in VectorizeLoop
  • #17030 - Fix Shuffle rewrite
  • #16947 - Support narrow dtype for let binding

TOPI

  • #17040 - Fix SME conv2d schedule import and intrin argument
  • #17003 - [SME]Add conv2d NHWC SME fp32 schedule
  • #17007 - [Testing] Enable conv2d NHWC fp16 topi testing for arm_cpu
  • #16977 - Remove blockIdx.z in topi sort

TVMScript

  • #16967 - Fix error reporting inside Macro func

cuda & cutlass & tensorrt

  • #16980 - [Cuda] Skip FreeDataSpace when CUDA driver is in inconsistent state

web

  • #17031 - Fix string to uint8 array for special characters
  • #17028 - Add dtype and offset for CreateView in runtime

Misc

  • #16981 - [SME] Add scalable fp16->fp32 dense schedule
  • #17029 - [Contrib] Implement NDArray cache update
  • #17027 - [picojson] Let objects be ordered when serializing
  • #17021 - [WebGPU] Update error messages to be more user-friendly
  • #17010 - Support multinomial_from_uniform dispatch
  • #16999 - [USMP] add missing const specifier for global_const_workspace
  • #17005 - [WebGPU] Handle device OOM in createBuffer
  • #16921 - [SME] Introduce scalable fp32 dense schedule
  • #16957 - chore: remove repetitive words
  • #16909 - [QoL][IR] Provide std::hash and std::equal_to for IR Variable types
  • #16987 - [JVM] Automatic Compatibility of JVM AttachCurrentThread
  • #16974 - [CUBLAS][FP8] Enable R.matmul + R.multiply offloading
  • #16896 - [CUBLAS] Enable offloading of R.matmul + R.dequantize