Add Evaluators to Debug Executor

Hi,

for internal testing I would like to add additional evaluators to the debug executor. Currently the RunIndividual function is able to measure the execution time of individual layers using the TimeEvaluator.

However, I would like to add the capability to measure power consumption, clock speeds, etc for CUDA GPUs using the Nvidia Library.

But as I am going through the code, I was not able to find the implementation of the RPCTimeEvaluator for the C++ Runtime. I found the implementation for the C runtime by @areusch, but am not sure, if this is used across runtimes/executors?

Thanks in advance :slight_smile:

1 Like

hi @max1996, it would be great to add some ability to track additional metrics. i think @tkonolige was working on some perf stuff and may know more about what’s available and what’s not.

The RPCTimeEvaluator for C++ is here.

1 Like

@max1996 You will want the profile routine on the debug executor (tvm/debug_executor.py at main · apache/tvm · GitHub). With this PR (https://github.com/apache/tvm/pull/7983), you will be able to measure performance counters for the cuda code.

1 Like

Thank you,

that seems to be exactly what I am looking for.

Will it work with the RPC infrastructure as well?

Yes, I believe it does.

1 Like

Hey @tkonolige ,

can you give me a hint on how to build your PR?

I build PAPI for my targets before pulling it and set the flag USE_PAPI in the config.cmake to the path of my papi .pc file, but CMake is not able to find the papi module

Thanks in advance :slight_smile:

EDIT: Ok, I found the problem, I gave the path to the papi.pc file, but it only works with the path to the directory. But during the make process the papi.h file cannot be found

Can you provide more information, like what the error is and where it is occurring?

1 Like

Hi @tkonolige , during the build process the papi.h header file, which is part of the PAPI installation could not be found. I added include_directories(${USE_PAPI}/../../include) as a workaround, which seemed to be working fine for now.

EDIT: I am sorry, but I have some problems with getting all the metrics during the profile step. I set the environment variable to give me some performance information using Nvidia’s NVML during the usage of a CUDA target with

os.environ["TVM_PAPI_GPU_METRICS"] = "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:graphics_clock;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:sm_clock;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:memory_clock;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:allocated_memory;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:pstate;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:power;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:temperature;"

however, when I am looking into the output of tvm.runtime.GraphModule.profile it only shows the duration in us.

I tested the NVML capabilities of my PAPI installation before and they worked as expected and TVM has been compiled with the USE_PAPI flag as well.

I’ve updated my branch, so an environment variable is no longer used to set the metrics. Here is an example of how to do it now: incubator-tvm/test_runtime_profiling.py at profiler_papi · tkonolige/incubator-tvm · GitHub

1 Like

Hi @tkonolige , I tried it again with the most recent version, but it seems like I am still doing something wrong. I am using the debug executor and follow mostly this guide otherwise (of course with a computer that contains Nvidia GPUs).

after module.run() I added:

metrics = ["nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:graphics_clock",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:sm_clock",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:memory_clock",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:allocated_memory",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:pstate",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:power",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:temperature"
    ]

    data = tvm.nd.array(image.astype("float32"))
    module.set_input("data", tvm.nd.array(image.astype("float32")))
    test_data = module.profile(collectors[tvm.runtime.profiling.PAPIMetricCollector({ctx: metrics})])

but the returned data does not contain anything but the execution time of the layers

I haven’t tried the nvml papi component yet. Could you just try cuda::event::elapsed_cycles_sm:device=0. Also can you try running locally instead of over RPC.

1 Like

I was using it with a LocalSession, but removed the RPC context entirely to check, if something changes and am now getting an error message instead of just the execution time:

Traceback (most recent call last):
  File "/home/max/collector/model_loader_mxnet_vision.py", line 321, in <module>
    test_data = debug_g_mod.profile(collectors=[tvm.runtime.profiling.PAPIMetricCollector({dev: metrics})])
  File "/home/max/papi_tvm/tvm/python/tvm/contrib/debugger/debug_executor.py", line 292, in profile
    return self._profile(collectors)
  File "/home/max/papi_tvm/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  4: TVMFuncCall
  3: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::runtime::profiling::Report (tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)>::AssignTypedLambda<tvm::runtime::GraphExecutorDebug::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)#5}>(tvm::runtime::GraphExecutorDebug::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)#5})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  2: tvm::runtime::GraphExecutorDebug::Profile(tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)
  1: tvm::runtime::profiling::Profiler::Profiler(std::vector<DLDevice, std::allocator<DLDevice> >, std::vector<tvm::runtime::profiling::MetricCollector, std::allocator<tvm::runtime::profiling::MetricCollector> >)
  0: tvm::runtime::profiling::PAPIMetricCollectorNode::Init(tvm::runtime::Array<tvm::runtime::profiling::DeviceWrapper, void>)
  File "/home/max/papi_tvm/tvm/src/runtime/contrib/papi/papi.cc", line 199
PAPIError: -7 Event does not exist: cuda::event::elapsed_cycles_sm:device=0.

I suspect that I made a mistake while compiling TVM for PAPI as these events are listed when executing./papi_native_avail

To compile your PR, I added USE_PAPI to the config.cmake and set it to the path of the installation directory of PAPI. In addition I included the include directory of PAPI to access its header files as it would not compile otherwise.

The PAPI code is now in main. Could you try it from there? If it still doesn’t work, can you run python3 -m pytest tests/python/unittest/test_runtime_profiling.py and verify that it works?

1 Like

thank you,

I retried it with the current master branch and set USE_PAPI to the path of the directory containing the package config files, but am now unable to compile it as papi.h cannot be found. So I added the PAPI include directory manually again.

But even with this changes the tests are all failing.

test_papi[cuda] fails with PAPIError: in function PAPI_start(event_set) -14 Unknown error code

and

test_papi[llvm] fails with PAPIError: in function PAPI_set_opt(PAPI_INHERIT, &opt) -15 Permission level does not permit operation

EDIT: To check, if it might be a problem with my local PAPI installation, I replaced it with the precompiled package that is available from Ubuntu’s package manager, however, I am still running into the same problems (cannot find header file & tests failing)

EDIT #2: I reinstalled PAPI to the default location, which resolved the problem of the missing header file, however, the tests are still failing with the same error message

I believe the papi package on ubuntu is too old. I would use commit 65833547f68ca3f52535c672e768b7f70427a44a from the git repo.

The permission error is cause by you not having enough permission to collect hardware counters. Use sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid' to fix this. Note that enabling this may worsen the security of your device. You probably have a similar issue with cuda. Try sudo modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0. If that doesn’t work, you can try adding options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\" to /etc/modprobe.d/nvidia-kernel-common.conf.

I switched to the branch stable-6.0, but will also try the commit you suggested.

The tests are still failing, the CUDA test is still failing with “unknown error”, while the LLVM test fails with “Event does not exist: PAPI_FP_OPS”. According to papi_avail this seems to be true, as it its output is “PAPI_FP_OPS 0x80000066 No No Floating point operations”

However, if I try to access cuda or nvml information through the TVM debug runtime profiling, it claims, that these events do not exists, despite being listet when calling papi_native_avail. (the full error message: PAPIError: -7 Event does not exist: cuda::event::elapsed_cycles_sm:device=0.)

I am not sure, if it is just a problem with my papi installation, or what I have done wrong here…

Have you run the cuda tests in the papi directory? Do those work?

1 Like

if you mean the tests in the PAPI components folder “cuda”, yes, those seem to be working as expected. (HelloWorld & the multiGPU tests output a result or that they have passed the test, while the *.cu files seem to have syntax errors).

I tested everything with PAPI master, the commit you recommended and stable-6.0.

The NVML tests are working as well. I am not sure, if it might be a permission problem and do not know, what else to try.

EDIT: I used gdb to look into the function that fails (papi.cc PAPIMetricCollectorNode->Init) and it seems to fail at:

int e = PAPI_add_named_event(event_set, metric.c_str());

e is set to -7 (“Event does not exist”) afterwards, however, the same metric/event is tested in the CUDA test cases of PAPI and works just fine.

I tested it again with a CPU target & a different metric and it worked. The CPU I am using (Haswell Refresh) does not support the FP metric that has been used in the testcase for the PAPI integration, as it has been disabled due to some hardware issues.

I am able to access some CUDA events and metrics, if I run my python script as root, rapl and nvml however are still not working.

Oh, I have a guess at why nvml support does not work. Right now, the profiler automatically sets the PAPI component depending on the device type. Try removing this line tvm/papi.cc at main · apache/tvm · GitHub and see if things work.

1 Like