TVM Monthly - March 2024

As discussed by the TVM PMC, our goal is to provide a monthly summary of the project so users and developers can get a better understanding of the goings on of the TVM community.

Feedback and suggestions are welcomed so that we can further improve these updates.

RFCs

This new RFC explores how TVM can be utilized to generate code for the SME ISA to achieve improved inference performance on supported Arm®-based hardware implementing the SME extension.

  • #107 Scalable Matrix Extension enablement

We continue to improve Relax, TIR, Frontend and other Runtimes .

AOT

  • #16749 - [SME]Add Fixed Virtual Platform (FVP) functional testing infrastructure

Arith

  • #16735 - [Fixup] Require feature flag for tighter inequality bounds
  • #16588 - Provide tighter ConstIntBounds for special cases

BugFix

  • #16820 - [Fix] PAPI docs
  • #16793 - [Fix] fix for numpy 2.0 compatibility
  • #16789 - [Cutlass] Remove a typo in cutlass build
  • #16790 - [Fix] Fix build errors with VS2022
  • #16780 - [Fix] Fix numpy dtype map
  • #16773 - [Fix] Fix the purity flag of “vm.call_tir_dyn” and “kill” ops
  • #16775 - [Fix][Dlight] (Low-batched-)GeMV on small spatial loops
  • #16770 - [Hotfix] Revert driver API pass ordering that breaks MLC, mark failing test
  • #16771 - [Fix] Remove redundant “remove_all_unused” in IPC memory lowering
  • #16742 - [TIR] Fix cache_read update buffer region
  • #16752 - [Fix] Lazy import of “psutil” in disco process pool
  • #16746 - [Fix][Builtin] Fix “GetQueryPosition” of PagedKVCache
  • #16726 - [TIR] Avoid overwrite of unmanaged buffer allocations
  • #16728 - [Fix] Introduce TVM_DEBUG_WITH_ABI_CHANGE to warn ABI changes in debug mode
  • #16714 - [Fix] PagedKVCache fetching compute stream when copy stream is needed
  • #16704 - [Fix][Arith] Fix canonical simplification of LE
  • #16703 - [Fix][Relax] Fix top-p/top-k sampling kernel
  • #16684 - [SLM] Produce well-formed Relax for nn.modules.KVCache
  • #16682 - [TIR] Handle AttrStmt of upcoming tir.Var in ConvertSSA
  • #16660 - [TIR] Fix duplicate AllocateConst in CacheReadWrite schedule primitive
  • #16659 - add the default value for DFT in ONNX frontend

CI

  • #16765 - [AOT][Testing] Improve output mismatch information on test failure
  • #16661 - add merge_with_main in unity

Community

  • #16695 - Add new key for release signing

Docker

  • #16755 - [SME]Add Fixed Virtual Platform (FVP) and toolchain install

Docs

  • #16792 - [Doc] Fix set_axis_separator example

Frontend

  • #16711 - [Relax]Add op tanh, exp, negative, and permute
  • #16669 - [Relax][Onnx] add sum and globalavgpool 1d/3d op
  • #16681 - [Relax][Onnx] support MaxPool1/2/3D and AveragePool1/2/3D
  • #16651 - [PaddlePaddle] PaddlePaddle model with NCHW data format that supports quantization
  • #16654 - [Relax][NN] Add support for Conv3D

Hexagon

  • #16762 - [VM]Cache operations when bypass mode is enabled
  • #16706 - [VM] Add buffers to dma_wait builtin

LLVM

  • #16812 - Fix compilation failure due to minor change
  • #16808 - [Runtime]Fix errors during loading of target tags
  • #16748 - Lack of DWARF type is not an error
  • #16696 - [SVE] Add codegen support for scalable buffer accesses
  • #15964 - [RUNTIME] Add optional LLVM ORCJIT runtime executor

MetaSchedule

  • #16725 - Make the opt_level of tune_relay() adjustable

Metal

  • #16713 - [RUNTIME]Provide richer runtime when error happens

OpenCL & CLML

  • #16672 - [CLML] Fix build TVM with CLML on MacOS

Relax

  • #16815 - Enable capturing symbolic shapes in cuda graph
  • #16642 - Allow R.Prim(‘bool’) in relax::If and assert_op
  • #16796 - Unit-test for structural equal of recursive function
  • #16732 - Allow composition of DFPattern replacements
  • #16783 - Improve CanonicalizeBindings in DataflowVar edge case
  • #16721 - Implement operators to inspec DLTensor::strides and offset
  • #16730 - Refactor PatternRewriter into separate Block/Expr mutators
  • #16756 - [IR]Improve highlighting in assert_structural_equal
  • #16779 - Improve malform error msg
  • #16569 - [Unity][Parser] Check well-formedness in the parser
  • #16759 - [Pass] Lowering passes for GPU IPC memory and allreduce
  • #16697 - Implement relax.transform.TopologicalSort
  • #16658 - Normalize use of void-type variable to inline R.tuple()
  • #16691 - CUDA graph rewrite treating StringImm as static
  • #16685 - Implement StructInfoPattern for dataflow pattern matching
  • #16584 - [Unity][TIR] Clear struct info when specializing PrimFunc
  • #16676 - Remove the legalization of cumsum/cumprob
  • #16674 - Eager free original weights in transform_params
  • #16675 - add sample_indices in sampling
  • #16648 - [Runtime] Support Unpack API for NDArrayCache

Runtime

  • #16804 - Introduce MSCCLPP with NCCL equivalent interface
  • #16809 - Add “TVM_DLL” to NVTX header
  • #16768 - [OPENCL] Bugfix for ciImage create with host ptr
  • #16750 - CUDA IPC Memory support and custom allreduce kernels
  • #16738 - [Refactor]Always specify device in allocator interface
  • #16716 - Ensure NDArray.CopyTo(Device) always sync
  • #16705 - Add TVM_DLL to memory manager functions
  • #16692 - PagedKVCache execute data copy on a separate stream
  • #16647 - [RPC] Fix FreeObject in minrpc server
  • #16667 - [Builtin] Using float32 accumulation in attention kernel

TIR

  • #16767 - [Driver] Use BindTarget to specify target for FP8 legalization
  • #16723 - Implement max/min_value for fp8 data types
  • #16655 - Improve well-formed check’s handling of match buffer
  • #16673 - Support Vector Reinterpret Calls
  • #16560 - Enhance and fix tensorize schedule for some case

TOPI

  • #16652 - improve inclusive_scan for thrust

TVMScript

  • #16641 - Allow use of relax.Expr with void type as a statement
  • #16663 - Infer T.reads() for DeclBuffer nodes

cuda & cutlass & tensorrt

  • #16818 - [Cutlass] Fix usage of cuda stream for group gemm
  • #16788 - [Cutlass] Add check for group gemm param shapes
  • #16787 - [Codegen, Cuda] Add overload for fp8x4 e5m2 <-> half4 conversion
  • #16751 - [Cutlass] Add group gemm kernels
  • #16736 - [Target][CUDA] Allow non-numeric arch as needed for latest gpu
  • #16548 - [TIR][CUDA] Add native FP8 support to codegen

micoNPU

  • #16266 - [microNPU][ETHOSU] Add fixed point for tanh
  • #16680 - [microNPU][ETHOSU] Fix LUT size for int16 activations

web

  • #16791 - Add kv_state and rnn_state to wasm_runtime
  • #16722 - Implement linear congruential generator, make runtime seedable
  • #16650 - Seperate parallel shard download and iterative shard loading
  • #16694 - Initial support for asyncify

Misc

  • #16800 - [BugTIR]fix error merging shared memory for ptx_cp_async
  • #16822 - [VM] Recycle VMFrame
  • #16813 - [KVCache] Support forking sequence at specific posotion
  • #16786 - [Codegen] Add check to disable invalid reinterpret
  • #16816 - [Cmake] Allow using custom CCCL path for thrust
  • #16784 - [SLM] Add unit tests for SLM to Relax exporter
  • #16814 - Fix includes of custom allreduce kernel
  • #16806 - [Debug] Improve error message in VMShapeLower
  • #16802 - [Debug] Improve error messages in LiftTransformParams
  • #16425 - [Target] Use LLVM target parser for determining Arm(R) A-Profile Architecture features
  • #16715 - [Disco] Propagate structlog/logging config to workers
  • #16797 - [3rdparty] AUTO mode for custom all-reduce strategy
  • #16761 - [SME] Add support for inserting processor state annotations
  • #16778 - [Analysis] Allow calls to GlobalVar in @R.function
  • #16745 - [IR] Default to empty attributes, instead of NULL
  • #16777 - Revert “[SLM] Allow modules to define pre-processing of weights”
  • #16776 - [Contrib] Remove thrust “built but not used” warning
  • #16757 - [SLM] Allow modules to define pre-processing of weights
  • #16763 - [CONTRIB] Add nm symbol dump
  • #16717 - Enable Shared Function in LiftTransformParam Pass
  • #16731 - [Dlight] Fix GeMV shared memory estimation
  • #16729 - [Builtin] Sliding window and sink support for PagedKVCache
  • #16724 - Fix cpp_rtvm cmake build on Windows
  • #16513 - [Target] Automatically detect system triple when not specified by the user
  • #16710 - [CMake] Add “USE_FLASHINFER” to libinfo
  • #16702 - [MSC][M5.2] Enable quantize && prune with gym by wrapper
  • #16699 - [Transform] Remove R.Object parameters after LazyTransformParams
  • #16701 - [Dlight] Add fallback for low batch gemv with outer reduction
  • #16618 - [Disco] Propagate structlog configuration to disco workers
  • #16668 - [MSC][M5.1] Build wrapper to support compression
  • #16693 - [Contrib] Support NDArray cache taking generator
  • #16412 - [Lint] Add check to prevent usage of #include
  • #16678 - [Dlight] LowBatchGemv rule only apply to function with spatial symbolic var
  • #16689 - [DeviceAPI] Support “GetCurrentStream”
  • #16690 - Use target name instead of node name as function name
  • #16683 - [skip ci] Fix wasm exception flag
  • #16609 - Minor update docs instructions
  • #16656 - Simplify Windows CMake Command
  • #16666 - [KVCache] Fix the reference counter in sequence fork
  • #16662 - Fixing workload comment
1 Like