Dear community,
I’m using kl_divergence to quantize a quite big in-house network. I’ve implemented a mechanism to feed it pickle input frames which I generate from the reference implementation. Since the network inputs are quite large, the resulting (binary-encoded) pickle files grow to around 14MBs per frame… Currently I’m feeding around 157 frames (around 2.2GBs in total), where the quantizer fails with the following error:
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (5) /home/buecs/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f969a57db25]
[bt] (4) /home/buecs/tvm/build/libtvm.so(+0x402c34) [0x7f9699d55c34]
[bt] (3) /home/buecs/tvm/build/libtvm.so(+0x402aa7) [0x7f9699d55aa7]
[bt] (2) /home/buecs/tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule const&, tvm::transform::PassContext const&) const+0x389) [0x7f9699d557d9]
[bt] (1) /home/buecs/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule const&, tvm::transform::PassContext const&) const+0x10f) [0x7f9699d549af]
[bt] (0) /home/buecs/tvm/build/libtvm.so(+0xc25f8b) [0x7f969a578f8b]
File "/home/buecs/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 78, in cfun
rv = local_pyfunc(*pyargs)
File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 191, in wrapped_func
input_scale_func = _kl_scale(mod, dataset)
File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 102, in _kl_scale
scales += list(pool.map(_find_scale_by_kl, samples))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
I was trying to play with the calibrate_chunk_by parameter, but so far non of the tried value settings remove this error.
Has anyone encounter a similar error before? If yes, how could I solve / mitigate this? An expert opinion of for example @vinx13 would be much appreciated!
Hi @vinx13, thank you very much for your suggestion. Avoiding multiprocessing in deed mitigates the problem of sending too large data chunks from a parent to a child process. Of course the calibration is slower, but this was expected.
However, I’m hitting now the system’s memory barrier (!!! 64GB ram + 64GB swap !!!) with only ~250 pickle input calibration frames.
When looking into the calibration mechanism in TVM, the flow is as follows (please correct me if I’m wrong):
All (pickle) calibration frames are loaded initially at once
The scales are calculated per frame
Would it be possible to modify this flow so that it requires less peak memory, somehow like this:
Load 1. (pickle) calibration frame
Calculate scales
Load 2. (pickle) calibration frame
Calculate scales
…
I guess this would need to modify this loop in a particular way.
Thank you very much in advance & Best regards,
Robert
I think what you are trying to do is exactly what chunk_by parameter is for. Try smaller number than 250. If you calibrate on CUDA, the overhead from interleaved feature generation and scale calculation should be negligible.
Try smaller number. It seems your input is so big the RAM is already saturated with 100 inputs.
I suggest start with chunk_by = the number of cores and enable multiprocessing. This should give maximal parallelism while keeping mem usage low. And gradually make it bigger while observing how mem usage grows.
There is not much disadvantage in using small chunk_by, other than there would be more recompute. chunk_by = 1000 doesn’t make any faster and it is a bad idea. The overhead from more recompute can be hidden if you use CUDA in calibration (because a single inference is so cheap). Note that even if your final target is x86, you can still use CUDA target for calibration.
Thank you @masahi for the great advise! With “calibrate_chunk_by” = 16 (number of cores) the peak memory demand became significantly lower (~30GB, no swap) while the speed of calibration increased by far (due to no swapping).
Hm… Actually after kl_divergence is executed, I get the following error with low “calibrate_by_chunk” values (<=45):
ValueError: Traceback (most recent call last):
[bt] (4) /home/buecs/tvm/build/libtvm.so(TVMFuncCall+0x65) [0x7f356df85245]
[bt] (3) /home/buecs/tvm/build/libtvm.so(+0x446d74) [0x7f356d699d74]
[bt] (2) /home/buecs/tvm/build/libtvm.so(tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x30d) [0x7f356d69890d]
[bt] (1) /home/buecs/tvm/build/libtvm.so(tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const+0x118) [0x7f356d699938]
[bt] (0) /home/buecs/tvm/build/libtvm.so(+0xd2de6b) [0x7f356df80e6b]
File "/home/buecs/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 78, in cfun
rv = local_pyfunc(*pyargs)
File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 190, in wrapped_func
input_scale_func = _kl_scale(mod, dataset)
File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 99, in _kl_scale
for samples in collect_stats(mod, dataset, chunk_by):
File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 92, in collect_stats
yield [np.concatenate(output).reshape(-1) for output in outputs]
File "/home/buecs/tvm/python/tvm/relay/quantize/_calibrate.py", line 92, in <listcomp>
yield [np.concatenate(output).reshape(-1) for output in outputs]
File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate
Dear @masahi, dear @vinx13,
After further debugging, I believe that either I have found a bug in TVM, or I’m using the quantizer in a way which is not supported (but also not limited). The problem can be reproduced with the off-the-shelf TVM tutorial! If you make the following change to this tutorial:
def quantize(mod, params, data_aware):
if data_aware:
- with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max'):
+ with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max', calibrate_chunk_by=46):
mod = relay.quantize.quantize(mod, params, dataset=calibrate_dataset())
else:
with relay.quantize.qconfig(calibrate_mode='global_scale', global_scale=8.0):
mod = relay.quantize.quantize(mod, params)
return mod
You use any other value below 47 to reproduce the problem. Why 47: because this is the value of num_outputs in the following loop:
Could anyone please comment on this issue @vinx13, @masahi?
Thank you very much in advance &
Best regards,
Robert
Not sure if this is a bug. You are responsible for setting the right calibrate_chunk_by. Setting calibrate_chunk_by=46 when num output is 47 doesn’t make sense. Please read the code to understand the purpose of this parameter.
Hi @masahi, thank you for your reply. I do understand, but something is not right here: integrating exactly this test setup into the stock tutorial code leads me to the same error:
def quantize(mod, params, data_aware):
if data_aware:
- with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max'):
+ num_cpu = multiprocessing.cpu_count()
+ with relay.quantize.qconfig(calibrate_mode='kl_divergence', weight_scale='max', calibrate_chunk_by=num_cpu):
mod = relay.quantize.quantize(mod, params, dataset=calibrate_dataset())
else:
with relay.quantize.qconfig(calibrate_mode='global_scale', global_scale=8.0):
mod = relay.quantize.quantize(mod, params)
return mod
Just to give you the values range: in my case num_cpu=10 while the num_outputs=47 Would you be so kind to try it out in your environment? It would be great if someone could reproduce the issue… I’d really appreciate it!
Dear @masahi, dear @vinx13 I was hoping that you could try out the change in my previous comment to be able to reproduce the issue. Please try it out if you find a little bit of free time.
Thank you very much in advance!
Best regards,
Robert
Ok I took a look at your problem. I got the error you saw too, but this is not a bug in calibration code, but it is due to the calibration dataset used in the tutorial.
Since the dataset is defined as a generator, you can only consume it once. But if you use calib_by param, we need to run multiple passes over calibration dataset, so the dataset should be list or other data structures that can be traversed multiple times.
If you replace calibrate_dataset() function in the tutorial with below, it should work.
def calibrate_dataset():
val_data, batch_fn = get_val_data()
val_data.reset()
calib_data = []
for i, batch in enumerate(val_data):
data, _ = batch_fn(batch)
calib_data.append({'data': data})
if i > calibration_samples:
break
return calib_data