[Relay][training] Resource usage problem during training process

cacher · August 2, 2019, 1:52pm

I tried to implement a loop of simulating a training process, and refer to gradient of dense operator from https://discuss.tvm.ai/t/training-in-relay/2712 (thanks for their nice job), and I build a simple network and ran it.

import numpy as np

import tvm
from tvm import relay
from tvm.relay.transform import gradient
from tvm.relay.testing import run_infer_type


def test_loop():
    iters = 200000
    x_shape = (1000, 1000)
    w_shape = (1000, 1000)
    
    x_data = np.random.rand(*x_shape).astype("float32")
    w_data = np.random.rand(*w_shape).astype("float32")
    
    x = relay.var("x", shape=x_shape)
    w = relay.var("w", shape=w_shape)
    
    z = relay.nn.dense(x, w)
    z = relay.add(z, w)
    z = relay.nn.relu(z)
    z = relay.log(z)

    fwd_func = run_infer_type(relay.Function([x, w], z))
    bwd_func = run_infer_type(gradient(fwd_func))
    
    intrp = relay.create_executor(ctx=tvm.context('llvm', 0), target='llvm')
    evaluator = intrp.evaluate(bwd_func)
    
    for i in range(iters):
        res = evaluator(**{'x': x_data, 'w': w_data})
        print('i == {}'.format(i))
    
    
if __name__ == "__main__":
    test_loop()

I found 2 problems related to resource usage:
(1) The memory utilization increasingly goes up to over 90% from 0.xx% during performing process (total memory is 256GB)
(2) The system CPU utilization is higher than that of user mode

For problem (1), I assume there was a memory leakage in the process, and I wonder if someone could tell me the reason of problem (2), thank you vary much ^.^
I’m just starting to get familiarized with the TVM stack, so I apologize if I miss some obvious things. Look forward to see you reply.