The Neural Matthew Effect: Low Effective Degrees of Freedom in Training
I use the Neural Matthew Effect to describe a recurring pattern in deep network training: a small number of directions, modules, or connections carry most of the learning signal. In parameter space, this appears as low-rank updates; in gradient space, as concentration along dominant directions; at the functional level, as the strengthening of existing circuits; and at the structural level, as increasingly uneven interactions among neurons. Throughout this article, low rank does not mean strictly low algebraic rank. It means low effective rank. For neural network weights, gradients, and the Hessian, strict rank is often unstable: an arbitrarily small perturbation can turn a matrix into full rank, but that does not mean all directions are equally important. The more useful question is not “how many singular values are nonzero?”, but “how many directions contain most of the spectral mass?” ...