I run some tests on bert model , and found that the seq_len varies from to 64 to 512 when using padding. the performance may be affected a lot by the changes of seq length. The model has only one input, input_ids [batch_size, seq_length].
so I compile serveral libraries, but the device memory will grow to the multiple of the origin model’s needs. Is there any methods to handle this?
I don’t need full support of dynamic shape, but support several case of dynamic shape. One compiled library could run serveral seq length cases: (128, 256, 384).
If it has to compiled serveal engines, is there any mechanism to combine the weights that has has the same contents.