Revisiting Weight Initialization of Deep Neural Networks

Maciej Skorski (University of Luxembourg)*; Martin Theobald (University of Luxembourg); Alessandro. Temperoni (University of Luxembourg)


The proper initialization of weights is crucial for the effective training and fast convergence of deep neural networks (DNNs). Prior work in this area has mostly focused on the principle of balancing the variance among weights per layer to maintain stability of (i) the input data propagated forwards through the network, and (ii) the loss gradients propagated backwards, respectively. This prevalent heuristic is however agnostic of dependencies among gradients across the various layers and captures only first-order optimization effects. In this paper we develop an approximated chain rule for Hessians which along with the principle of controlling the norm of the layers' Hessians, can be used to explain, generalize and improve upon existing initialization schemes. We demonstrate both theoretically and empirically the accuracy of our approximation as well as provide evidence that the Hessian norm is a useful diagnostic tool which helps to rigorously initialize and evaluate initialization schemes. The experiments were performed over a variety of DNN architectures, including (i) shallow networks like simplified GoogleNet, and (ii) deep networks like EfficientNet and ResNet and on a range of image datasets, such as FashionMNIST, CIFAR-10, SVHN and Flowers.