I was wondering about the shape inference logic of dense op. Usually in a network, a flatten op precedes the dense op, so the input of the dense op is 2D. If there is no real flatten op in the network before dense op, and the input data X
is 4D, it should be first flattened implicitly before applying multiplication XW^T
. But in the 4D case, it seems the logic here is not consistent with what has been documented and implemented:
- **data**: `(x1, x2, ..., xn, input_dim)`
- **weight**: `(units, input_dim)`
- **bias**: `(units,)`
- **out**: `(x1, x2, ..., xn, units)`
Per documentation and implementation, an input data of shape (32, 3, 224, 224)
with units=10
, would have an output shape (32, 3, 224, 10)
, which does not seem to be correct.
Am I misinterpreting something here?