Add Evaluators to Debug Executor

max1996 · May 19, 2021, 9:13am

Hi,

for internal testing I would like to add additional evaluators to the debug executor. Currently the RunIndividual function is able to measure the execution time of individual layers using the TimeEvaluator.

However, I would like to add the capability to measure power consumption, clock speeds, etc for CUDA GPUs using the Nvidia Library.

But as I am going through the code, I was not able to find the implementation of the RPCTimeEvaluator for the C++ Runtime. I found the implementation for the C runtime by @areusch, but am not sure, if this is used across runtimes/executors?

Thanks in advance

areusch · May 19, 2021, 3:16pm

hi @max1996, it would be great to add some ability to track additional metrics. i think @tkonolige was working on some perf stuff and may know more about what’s available and what’s not.

The RPCTimeEvaluator for C++ is here.

tkonolige · May 19, 2021, 4:36pm

@max1996 You will want the profile routine on the debug executor (tvm/debug_executor.py at main · apache/tvm · GitHub). With this PR (https://github.com/apache/tvm/pull/7983), you will be able to measure performance counters for the cuda code.

max1996 · May 20, 2021, 7:15am

Thank you,

that seems to be exactly what I am looking for.

Will it work with the RPC infrastructure as well?

tkonolige · May 20, 2021, 3:35pm

Yes, I believe it does.

max1996 · June 24, 2021, 8:31am

Hey @tkonolige ,

can you give me a hint on how to build your PR?

I build PAPI for my targets before pulling it and set the flag USE_PAPI in the config.cmake to the path of my papi .pc file, but CMake is not able to find the papi module

Thanks in advance

EDIT: Ok, I found the problem, I gave the path to the papi.pc file, but it only works with the path to the directory. But during the make process the papi.h file cannot be found

tkonolige · June 28, 2021, 11:41pm

Can you provide more information, like what the error is and where it is occurring?

max1996 · June 29, 2021, 5:47am

Hi @tkonolige , during the build process the papi.h header file, which is part of the PAPI installation could not be found. I added include_directories(${USE_PAPI}/../../include) as a workaround, which seemed to be working fine for now.

EDIT: I am sorry, but I have some problems with getting all the metrics during the profile step. I set the environment variable to give me some performance information using Nvidia’s NVML during the usage of a CUDA target with

os.environ["TVM_PAPI_GPU_METRICS"] = "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:graphics_clock;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:sm_clock;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:memory_clock;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:allocated_memory;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:pstate;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:power;nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:temperature;"

however, when I am looking into the output of tvm.runtime.GraphModule.profile it only shows the duration in us.

I tested the NVML capabilities of my PAPI installation before and they worked as expected and TVM has been compiled with the USE_PAPI flag as well.

tkonolige · June 29, 2021, 9:13pm

I’ve updated my branch, so an environment variable is no longer used to set the metrics. Here is an example of how to do it now: incubator-tvm/test_runtime_profiling.py at profiler_papi · tkonolige/incubator-tvm · GitHub

max1996 · July 13, 2021, 7:16am

Hi @tkonolige , I tried it again with the most recent version, but it seems like I am still doing something wrong. I am using the debug executor and follow mostly this guide otherwise (of course with a computer that contains Nvidia GPUs).

after module.run() I added:

metrics = ["nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:graphics_clock",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:sm_clock",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:memory_clock",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:allocated_memory",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:pstate",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:power",
                "nvml:::NVIDIA_GeForce_GTX_980_Ti:device_0:temperature"
    ]

    data = tvm.nd.array(image.astype("float32"))
    module.set_input("data", tvm.nd.array(image.astype("float32")))
    test_data = module.profile(collectors[tvm.runtime.profiling.PAPIMetricCollector({ctx: metrics})])

but the returned data does not contain anything but the execution time of the layers

tkonolige · July 13, 2021, 4:47pm

I haven’t tried the nvml papi component yet. Could you just try cuda::event::elapsed_cycles_sm:device=0. Also can you try running locally instead of over RPC.

max1996 · July 14, 2021, 12:00pm

I was using it with a LocalSession, but removed the RPC context entirely to check, if something changes and am now getting an error message instead of just the execution time:

Traceback (most recent call last):
  File "/home/max/collector/model_loader_mxnet_vision.py", line 321, in <module>
    test_data = debug_g_mod.profile(collectors=[tvm.runtime.profiling.PAPIMetricCollector({dev: metrics})])
  File "/home/max/papi_tvm/tvm/python/tvm/contrib/debugger/debug_executor.py", line 292, in profile
    return self._profile(collectors)
  File "/home/max/papi_tvm/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 237, in __call__
    raise get_last_ffi_error()
tvm._ffi.base.TVMError: Traceback (most recent call last):
  4: TVMFuncCall
  3: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::TypedPackedFunc<tvm::runtime::profiling::Report (tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)>::AssignTypedLambda<tvm::runtime::GraphExecutorDebug::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)#5}>(tvm::runtime::GraphExecutorDebug::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::{lambda(tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)#5})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  2: tvm::runtime::GraphExecutorDebug::Profile(tvm::runtime::Array<tvm::runtime::profiling::MetricCollector, void>)
  1: tvm::runtime::profiling::Profiler::Profiler(std::vector<DLDevice, std::allocator<DLDevice> >, std::vector<tvm::runtime::profiling::MetricCollector, std::allocator<tvm::runtime::profiling::MetricCollector> >)
  0: tvm::runtime::profiling::PAPIMetricCollectorNode::Init(tvm::runtime::Array<tvm::runtime::profiling::DeviceWrapper, void>)
  File "/home/max/papi_tvm/tvm/src/runtime/contrib/papi/papi.cc", line 199
PAPIError: -7 Event does not exist: cuda::event::elapsed_cycles_sm:device=0.

I suspect that I made a mistake while compiling TVM for PAPI as these events are listed when executing./papi_native_avail

To compile your PR, I added USE_PAPI to the config.cmake and set it to the path of the installation directory of PAPI. In addition I included the include directory of PAPI to access its header files as it would not compile otherwise.

tkonolige · July 14, 2021, 6:18pm

The PAPI code is now in main. Could you try it from there? If it still doesn’t work, can you run python3 -m pytest tests/python/unittest/test_runtime_profiling.py and verify that it works?

max1996 · July 15, 2021, 6:27am

thank you,

I retried it with the current master branch and set USE_PAPI to the path of the directory containing the package config files, but am now unable to compile it as papi.h cannot be found. So I added the PAPI include directory manually again.

But even with this changes the tests are all failing.

test_papi[cuda] fails with PAPIError: in function PAPI_start(event_set) -14 Unknown error code

and

test_papi[llvm] fails with PAPIError: in function PAPI_set_opt(PAPI_INHERIT, &opt) -15 Permission level does not permit operation

EDIT: To check, if it might be a problem with my local PAPI installation, I replaced it with the precompiled package that is available from Ubuntu’s package manager, however, I am still running into the same problems (cannot find header file & tests failing)

EDIT #2: I reinstalled PAPI to the default location, which resolved the problem of the missing header file, however, the tests are still failing with the same error message

tkonolige · July 15, 2021, 4:58pm

I believe the papi package on ubuntu is too old. I would use commit 65833547f68ca3f52535c672e768b7f70427a44a from the git repo.

The permission error is cause by you not having enough permission to collect hardware counters. Use sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid' to fix this. Note that enabling this may worsen the security of your device. You probably have a similar issue with cuda. Try sudo modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0. If that doesn’t work, you can try adding options nvidia \"NVreg_RestrictProfilingToAdminUsers=0\" to /etc/modprobe.d/nvidia-kernel-common.conf.

max1996 · July 16, 2021, 6:35am

I switched to the branch stable-6.0, but will also try the commit you suggested.

The tests are still failing, the CUDA test is still failing with “unknown error”, while the LLVM test fails with “Event does not exist: PAPI_FP_OPS”. According to papi_avail this seems to be true, as it its output is “PAPI_FP_OPS 0x80000066 No No Floating point operations”

However, if I try to access cuda or nvml information through the TVM debug runtime profiling, it claims, that these events do not exists, despite being listet when calling papi_native_avail. (the full error message: PAPIError: -7 Event does not exist: cuda::event::elapsed_cycles_sm:device=0.)

I am not sure, if it is just a problem with my papi installation, or what I have done wrong here…

tkonolige · July 16, 2021, 4:51pm

Have you run the cuda tests in the papi directory? Do those work?

max1996 · July 19, 2021, 6:19am

if you mean the tests in the PAPI components folder “cuda”, yes, those seem to be working as expected. (HelloWorld & the multiGPU tests output a result or that they have passed the test, while the *.cu files seem to have syntax errors).

I tested everything with PAPI master, the commit you recommended and stable-6.0.

The NVML tests are working as well. I am not sure, if it might be a permission problem and do not know, what else to try.

EDIT: I used gdb to look into the function that fails (papi.cc PAPIMetricCollectorNode->Init) and it seems to fail at:

int e = PAPI_add_named_event(event_set, metric.c_str());

e is set to -7 (“Event does not exist”) afterwards, however, the same metric/event is tested in the CUDA test cases of PAPI and works just fine.

max1996 · July 20, 2021, 6:41am

I tested it again with a CPU target & a different metric and it worked. The CPU I am using (Haswell Refresh) does not support the FP metric that has been used in the testcase for the PAPI integration, as it has been disabled due to some hardware issues.

I am able to access some CUDA events and metrics, if I run my python script as root, rapl and nvml however are still not working.

tkonolige · July 21, 2021, 7:15pm

Oh, I have a guess at why nvml support does not work. Right now, the profiler automatically sets the PAPI component depending on the device type. Try removing this line tvm/papi.cc at main · apache/tvm · GitHub and see if things work.