Compiled Vulkan output does not match

angrycrab · October 28, 2022, 6:09am

Hi, I found that the output of deployment model using vulkan backend was wrong. The models I were using are from https://github.com/tianweiy/CenterPoint in onnx format.

GPU:

Fri Oct 28 06:05:09 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8    18W / 290W |   1431MiB /  8192MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1170      G   /usr/lib/xorg/Xorg                102MiB |
|    0   N/A  N/A      3728      G   /usr/lib/xorg/Xorg                558MiB |
|    0   N/A  N/A      3858      G   /usr/bin/gnome-shell               84MiB |
|    0   N/A  N/A      5043      G   ...AAAAAAAAA= --shared-files       73MiB |
|    0   N/A  N/A      5323      G   ...181820525847910195,131072      494MiB |
|    0   N/A  N/A      7175      G   ...AAAAAAAAA= --shared-files      103MiB |
+-----------------------------------------------------------------------------+

Vulkan:

==========
VULKANINFO
==========

Vulkan Instance Version: 1.2.135


Instance Extensions: count = 18
===============================
	VK_EXT_acquire_xlib_display            : extension revision 1
	VK_EXT_debug_report                    : extension revision 10
	VK_EXT_debug_utils                     : extension revision 2
	VK_EXT_direct_mode_display             : extension revision 1
	VK_EXT_display_surface_counter         : extension revision 1
	VK_KHR_device_group_creation           : extension revision 1
	VK_KHR_display                         : extension revision 23
	VK_KHR_external_fence_capabilities     : extension revision 1
	VK_KHR_external_memory_capabilities    : extension revision 1
	VK_KHR_external_semaphore_capabilities : extension revision 1
	VK_KHR_get_display_properties2         : extension revision 1
	VK_KHR_get_physical_device_properties2 : extension revision 2
	VK_KHR_get_surface_capabilities2       : extension revision 1
	VK_KHR_surface                         : extension revision 25
	VK_KHR_surface_protected_capabilities  : extension revision 1
	VK_KHR_wayland_surface                 : extension revision 6
	VK_KHR_xcb_surface                     : extension revision 6
	VK_KHR_xlib_surface                    : extension revision 6

Layers: count = 6
=================
VK_LAYER_KHRONOS_validation (Khronos Validation Layer) Vulkan version 1.2.135, layer version 1:
	Layer Extensions: count = 3
		VK_EXT_debug_report        : extension revision 9
		VK_EXT_debug_utils         : extension revision 1
		VK_EXT_validation_features : extension revision 2
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 3
			VK_EXT_debug_marker     : extension revision 4
			VK_EXT_tooling_info     : extension revision 1
			VK_EXT_validation_cache : extension revision 1

VK_LAYER_LUNARG_api_dump (LunarG API dump layer) Vulkan version 1.2.135, layer version 2:
	Layer Extensions: count = 0
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 1
			VK_EXT_tooling_info : extension revision 1

VK_LAYER_LUNARG_device_simulation (LunarG device simulation layer) Vulkan version 1.2.135, layer version 1:
	Layer Extensions: count = 0
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 1
			VK_EXT_tooling_info : extension revision 1

VK_LAYER_LUNARG_monitor (Execution Monitoring Layer) Vulkan version 1.2.135, layer version 1:
	Layer Extensions: count = 0
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 1
			VK_EXT_tooling_info : extension revision 1

VK_LAYER_LUNARG_screenshot (LunarG image capture layer) Vulkan version 1.2.135, layer version 1:
	Layer Extensions: count = 0
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 1
			VK_EXT_tooling_info : extension revision 1

VK_LAYER_LUNARG_vktrace (Vktrace tracing library) Vulkan version 1.2.135, layer version 1:
	Layer Extensions: count = 0
	Devices: count = 1
		GPU id = 0 (NVIDIA GeForce RTX 3070 Ti)
		Layer-Device Extensions: count = 0

Presentable Surfaces:
=====================

Device Groups:
==============
Group 0:
	Properties:
		physicalDevices: count = 1
			NVIDIA GeForce RTX 3070 Ti (ID: 0)
		subsetAllocation = 0

	Present Capabilities:
		NVIDIA GeForce RTX 3070 Ti (ID: 0):
			Can present images from the following devices: count = 1
				NVIDIA GeForce RTX 3070 Ti (ID: 0)
		Present modes: count = 1
			DEVICE_GROUP_PRESENT_MODE_LOCAL_BIT_KHR


Device Properties and Extensions:
=================================
GPU0:
VkPhysicalDeviceProperties:
---------------------------
	apiVersion     = 4206786 (1.3.194)
	driverVersion  = 2140487808 (0x7f954080)
	vendorID       = 0x10de
	deviceID       = 0x2482
	deviceType     = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName     = NVIDIA GeForce RTX 3070 Ti

The script I was using:

import onnx
import numpy as np
import onnxruntime as ort
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor

model_encoder = "/path_to_model/pts_voxel_encoder_centerpoint.onnx"
model_head = "/path_to_model/pts_backbone_neck_head_centerpoint.onnx"
onnx_encoder = onnx.load(model_encoder)
onnx_head = onnx.load(model_head)

x = np.ones((40000,32,9), dtype=np.float32)
# x = np.zeros((1,32,560,560), dtype=np.float32)

ort_sess = ort.InferenceSession(onnx_encoder.SerializeToString())
out_onnx = ort_sess.run(None, {'input_features': x})


target = "vulkan"

input_name = "input_features"
shape_dict = {input_name: x.shape}

mod, params = relay.frontend.from_onnx(onnx_encoder, shape_dict)

with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))

dtype = "float32"
module.set_input(input_name, x)
module.run()
output_shape = (40000, 1, 32)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

print(tvm_output.shape)
# for i in range(len(out_onnx)):
# idx = "output_" + str(0)
result = out_onnx[0] - tvm_output
print(result[np.where(result > 0.0001)])
print(len(result[np.where(result > 0.0001)]))

Please let me know if anyone knows how to make it work, or is it a bug? Thank you.

masahi · October 28, 2022, 7:26am

Where did you download the onnx model? If you did the export yourself, can you upload the model somewhere?

I’m aware that there are correctness issues when using our vulkan backend on an NV driver. On AMD output is always correct.

angrycrab · October 28, 2022, 7:44am

@masahi Thanks for the reply. I just uploaded the mode files.

masahi · October 28, 2022, 8:11am

okay I’ve got the output

[]
0

on AMD RX6600 XT. Tested on two different drivers (RADV and AMDVLK).

To see our VK backend is at least functional on your NV card, you can try running https://github.com/apache/tvm/blob/main/apps/topi_recipe/gemm/cuda_gemm_square.py. This is a plain GEMM test, so if it doesn’t pass then something is really off.

angrycrab · October 28, 2022, 8:34am

Thanks for the update. The gemm test passed.

Device cuda
average time cost of 10 runs = 1.35956 ms, 12636.3 GFLOPS.
Skip because opencl is not enabled
Skip because rocm is not enabled
Device nvptx
average time cost of 10 runs = 1.49608 ms, 11483.3 GFLOPS.
Device vulkan
[08:28:05] /home/tvm/src/runtime/profiling.cc:102: Warning: No timer implementation for vulkan, using default timer instead. It may be inaccurate or have extra overhead.
average time cost of 10 runs = 1.63795 ms, 10488.7 GFLOPS.

But I got

[0.40189907 0.17182952 0.93319166 ... 0.1776176  1.1928043  1.3329372 ]
428032

from my testing script. Do you have any idea how should I investigate further?

masahi · October 28, 2022, 9:21am

This is difficult. We don’t know if this is a TVM or driver’s problem. If the latter we cannot do anything.

To debug, I’d dump each intermediate output using both vulkan and x86 backends, and look for where the result diverge. I heard you can do such dump using debug_executor, but I haven’t tried it.

angrycrab · October 31, 2022, 4:32am

Thank you so much for the information. I’ll see what I could do.

masahi · November 3, 2022, 7:45am

I just remembered this PR https://github.com/apache/tvm/pull/12646 which might be helpful in accuracy debugging. We can binary-search the problematic op.