Don’t Peek: Deep Learning without looking … at test data

What is the purpose of a theory ?  To explain why something works.  Sure.  But what good is a theory (i.e VC) that is totally useless in practice ?  A good theory makes predictions.

Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks.  Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior.  That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:

For a given N\times M weight matrix \mathbf{W} , we form the correlation matrix \mathbf{X}


and then compute the M eigenvalues \lambda  of \mathbf{X}


We call the histogram of eigenvalues \rho_{emp}(\lambda) the Empirical Spectral Density (ESD).  It can nearly always be fit to a power law


We call the Power Law Universal because 80-90% of the exponents \alpha lie in range


For fully connected layers, we just take \mathbf{W} as is.  For Conv2D layers with shape (N,M,i,j)   we consider all i\times j 2D feature maps of shape N\times M .  For any large, modern, pretrained DNN, this can give a large number of eigenvalues.  The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.


As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4.  Although compared to the FC layers, for the Conv2D layers, we do see more exponents \alpha<2 .   We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is

Are power law exponents correlated with better generalization accuracies ?  … YES they are!

We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including

  • Inception V3 vs V4
  • SqueezeNet V1.0 vs V1.1
  • The DenseNet models
  • The ResNext101 models
  • The sequence of (larger) Resnet models, including Resnet18, 34, 50, 101, & 152
  • 2 other  ResNet implementations, CaffeResnet101 and FbResnet152

To compare these model versions, we can simply compute the average power law exponent Avg(\alpha) , averaged across all FC weight matrices and Conv2D feature maps. This is similar to consider the product norm, which has been used to test VC-like bounds for small NNs.   In nearly every case, smaller Avg(\alpha) is correlated with better test accuracy (i.e. generalization performance).

The only significant caveats are:

  1. The VGG models behave very differently, showing exactly the reverse trend !
  2. The smaller ResNet models (ResNet10, 18, …) also show the reverse trend.

Predicting the test accuracy is a complicated task, and IMHO simple theories , with loose bounds, are unlikely to be useful in practice.  Still, I think we are on the right track

Lets first look at the DenseNet models


Here, we see that as Test Accuracy increases, the average power law exponent generally decreases. And this is across 4 different models.

The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 Avg(\alpha)   is larger than InceptionV4.

Now consider the Resnet models, which are increasing in size and have more architectural differences between them:

Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents.  The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a s larger Avg(\alpha) than FbResnet152, but they are very close.  Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller Avg(\alpha)   across a wide range of architectures.

These results are easily reproduced with this notebook.

This is an amazing result !

You can think of the power law exponent as a kind of information metric–the smaller \alpha , the more information is in this layer weight matrix.

Suppose you are training a DNN and trying to optimize the hyper-parameters.  I believe by looking at the power law exponents of the layer weight matrices, you can  predict which variation will perform better–without peeking at the test data.

In addition to DenseNet, Inception, ResNext, SqueezeNet,  and the (larger) ResNet models, we have even more positive results are available here on ~40 more DNNs across ~10 more different architectures, including MeNet, ShuffleNet, DPN, PreResNet, DenseNet, SE-Resnet, SqueezeNet, and MobileNet, MobileNetV2, and FDMobileNet.

I hope it is useful to you in training your own Deep Neural Networks.  And I hope to get feedback from you as to see how useful this is in practice.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s