- tags
- Convolutional neural networks, Neural network training
- source
- (Ye et al. 2020)

## Summary

This paper introduces so-called *Network Deconvolution*, advertised as a way to remove pixel-wise and channel-wise correlation in deep neural networks.

The authors base their new operator on the optimal configuration for \(L_2\) linear regression, where gradient descent converges in one single step if and only if:

\[ \frac{1}{N}X^t X = I \] where \(X\) is the feature matrix and \(N\) the number of samples.

This means that input features should be normalized and uncorrelated for gradient descent to converge the fastest. This can be achieved, either by correcting the gradient with the Hessian matrix, or manipulating input features so as to normalize them and remove correlations.

An algorithm to construct this deconvolution operator is introduced. \(D \approx (Cov + \epsilon \cdot I) ^{-\frac{1}{2}}\), where \(Cov = \frac{1}{N}(X-\mu)^t (X-\mu)\).

Actual deconvolution operator approximation is done with some sampling to accelerate computations. After training, a running average of \(D\) is frozen and can be used for evaluation.

Deconvolution is presented as a unification of commonly used normalization techniques such as channel-wise decorrelation or BatchNorm.

The authors report what looks like pretty consistent improvement over BatchNorm on image classification tasks. This improvement is not very large however (less than top-5 1% accuracy on ImageNet).

## Comments

This paper demonstrates what seems like a generalization of deep learning training acceleration techniques rooted in somewhat theoretically sound ideas. These theoretical ideas are based on the linear version of the problem however, which doesn’t translate well to deep networks most of the time. Reviews are positive overall, and the paper may set a new standard for these normalization techniques, although comparison with recent work apart from BatchNorm is lacking in the paper.

These acceleration techniques seem to only have empirical foundations for deep learning as of today, and may well be rendered useless by some new training algorithm someday.