Why Deep Learning Works: Self Regularization in Neural Networks
Presented Thursday, December 13, 2018
The slides are available on my slideshare.
The supporting tool, WeightWatcher, can be installed using:
pip install weightwatcher
]]>DON’T PEEK: DEEP LEARNING WITHOUT LOOKING … AT TEST DATA
The idea…suppose we want to compare 2 or more deep neural networks (DNNs). Maybe we are
Can we determine which DNN will generalize best–without peeking at the test data?
Theory actually suggests–yes we can!
We just need to measure the average log norm of the layer weight matrices
where is the Frobenius norm
The Frobenius norm is just the sum of the square of the matrix elements. For example, it is easily computed in numpy as
np.linalg.norm(W,ord='fro')
where ‘fro’ is the default norm.
It turns out that is amazingly correlated with the test accuracy of a DNN. How do we know ? We can plot vs the reported test accuracy for the pretrained DNNs, available in PyTorch. First, we look at the VGG models:
The plot shows the 4 VGG and VGG_BN models. Notice we do not need the ImageNet data to compute this; we simply compute the average log Norm and plot with the (reported Top 5) Test Accuracy. For example, the orange dots show results for the pre-trained VGG13 and VGG13_BN ImageNet models. For each pair of models, the larger the Test Accuracy, the smaller . Moreover, the correlation is nearly linear across the entire class of VGG models. We see similar behavior for …
Across 4/5 pretrained ResNet models, with very different sizes, a smaller generally implies a better Test Accuracy.
It is not perfect–ResNet 50 is an outlier–but it works amazingly well across numerous pretrained models, both in pyTorch and elsewhere (such as the OSMR sandbox). See the Appendix for more plots. What is more, notice that
the log Norm metric is completely Unsupervised
Recall that we have not peeked at the test data–or the labels. We simply computed for the pretrained models directly from their weight files, and then compared this to the reported test accuracy.
Imagine being able to fine tune a neural network without needing test data. Many times we barely have enough training data for fine tuning, and there is a huge risk of over-training. Every time you peek at the test data, you risk leaking information into the model, causing it to overtrain. It is my hope this simple but powerful idea will help avoid this and advance the field forward.
A recent paper by Google X and MIT shows that there is A Surprising Linear Relationship [that] Predicts Test Performance in Deep Networks. The idea is to compute a VC-like data dependent complexity metric — — based on the Product Norm of the weight matrices:
Usually we just take as the Frobenius norm (but any p-norm may do)
If we take the log of both sides, we get the sum
So here we just form the average log Frobenius Norm as measure of DNN complexity, as suggested by current ML theory
And it seems to work remarkably well in practice.
We can also understand this through our Theory of Heavy Tailed Implicit Self-Regularization in Deep Neural Networks.
The theory shows that each layer weight matrix of (a well trained) DNNs resembles a random heavy tailed matrix, and we can associate with it a power law exponent
The exponent characterizes how well the layer weight matrix represents the correlations in the training data. Smaller is better.
Smaller exponents correspond to more implicit regularization, and, presumably, better generalization (if the DNN is not overtrained). This suggests that the average power law would make a good overall unsupervised complexity metric for a DNN–and this is exactly what the last blog post showed.
The average power law metric is a weighted average,
where the layer weight factor should depend on the scale of . In other words, ‘larger’ weight matrices (in some sense) should contribute more to the weighted average.
Smaller usually implies better generalization
For heavy trailed matrices, we can work out a relation between the log Norm of and the power law exponent :
where we note that
So the weight factor is simply the log of the maximum eigenvalue associated with
In the paper will show the math; below we present numerical results to convince the reader.
This also explains why Spectral Norm Regularization Improv[e]s the Generalizability of Deep Learning. The smaller gives a smaller power law contribution, and, also, a smaller log Norm. We can now relate these 2 complexity metrics:
We argue here that we can approximate the average Power Law metric by simply computing the average log Norm of the DNN layer weight matrices. And using this, we can actually predict the trends in generalization accuracy — without needing a test data set!
The Power Law metric is consistent with the recent theoretical results, but our approach and the intent is different:
But the biggest difference is that we apply our Unsupervised metric to large, production quality DNNs.
We believe this result will have large applications in hyper-parameter fine tuning DNNs. Because we do not need to peek at the test data, it may prevent information from leaking from the test set into the model, thereby helping to prevent overtraining and making fined tuned DNNs more robust.
We have built a python package for Jupyter Notebooks that does this for you–the weight watcher. It works on Keras and PyTorch. We will release it shortly.
Please stay tuned! And please subscribe if this is useful to you.
We use the OSMR Sandbox to compute the average log Norm for a wide variety of DNN models, using pyTorch, and compare to the reported Top 1 Errors. This notebook reproduces the results.
All the ResNet Models
DenseNet
SqueezeNet
DPN
In the plot below, we generate a number of heavy tailed matrices, and fit their ESD to a power law. Then we compare
The code for this is:
N, M, mu = 100, 100, 2.0 W = np.random.pareto(a=mu,size=(N,M)) normW = np.linalg.norm(W) logNorm2 = 2.0*np.log10(normW) X=np.dot(W.T,W)/N evals = np.linalg.eigvals(X) l_max, l_min = np.max(evals), np.min(evals) fit = powerlaw.Fit(evals) alpha = fit.alpha ratio = logNorm2/np.log10(l_max)
Below are results for a variety of heavy tailed random matrices:
The plot shows the relation between the ratios and the empirical power law exponents . There are three striking features; the linear relation
In our next paper, we will drill into these details and explain further how this relation arises and the implications for Why Deep Learning Works.
My recent talk at the French Tech Hub Startup Accelerator
]]>Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks. Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior. That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:
For a given weight matrix , we form the correlation matrix
and then compute the M eigenvalues of
We call the histogram of eigenvalues the Empirical Spectral Density (ESD). It can nearly always be fit to a power law
We call the Power Law Universal because 80-90% of the exponents lie in range
For fully connected layers, we just take as is. For Conv2D layers with shape we consider all 2D feature maps of shape . For any large, modern, pretrained DNN, this can give a large number of eigenvalues. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.
As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4. Although compared to the FC layers, for the Conv2D layers, we do see more exponents . We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is
Are power law exponents correlated with better generalization accuracies ? … YES they are!
We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including
To compare these model versions, we can simply compute the average power law exponent , averaged across all FC weight matrices and Conv2D feature maps. This is similar to consider the product norm, which has been used to test VC-like bounds for small NNs. In nearly every case, smaller is correlated with better test accuracy (i.e. generalization performance).
The only significant caveats are:
Predicting the test accuracy is a complicated task, and IMHO simple theories , with loose bounds, are unlikely to be useful in practice. Still, I think we are on the right track
Lets first look at the DenseNet models
Here, we see that as Test Accuracy increases, the average power law exponent generally decreases. And this is across 4 different models.
The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 is larger than InceptionV4.
Now consider the Resnet models, which are increasing in size and have more architectural differences between them:
Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a s larger than FbResnet152, but they are very close. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller across a wide range of architectures.
These results are easily reproduced with this notebook.
This is an amazing result !
You can think of the power law exponent as a kind of information metric–the smaller , the more information is in this layer weight matrix.
Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data.
In addition to DenseNet, Inception, ResNext, SqueezeNet, and the (larger) ResNet models, we have even more positive results are available here on ~40 more DNNs across ~10 more different architectures, including MeNet, ShuffleNet, DPN, PreResNet, DenseNet, SE-Resnet, SqueezeNet, and MobileNet, MobileNetV2, and FDMobileNet.
I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice.
]]>
One broad question we can ask is:
How is information concentrated in Deep Neural Network (DNNs)?
To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.
In a previous post, we formed the Singular Value Decomposition (SVD) of the weight matrices of the linear, or fully connected (FC) layers. And we saw that nearly all the FC Layers display Power Law behavior. And, in fact, this behavior is Universal across models both ImageNet and NLP models.
But this only part of the story. Here, we ask related question–do well trained DNNs weight matrices lose Rank ?
Lets say is an matrix. We can form the Singular Value Decomposition (SVD):
The Matrix Rank , or Hard Rank, is simply the number of non-zero singular values
which express the decrease in Full Rank M.
Notice the Hard Rank of the rectangular matrix is the dimension of the square correlation matrix .
In python, this can be computed using
rank = numpy.linalg.matrix_rank(W)
Of course, being a numerical method, we really mean the number of singular values above some tolerance …and we can get different results depending on if we use
See the numpy documentation on matrix_rank for details.
Here, we will compute the rank ourselves, and use an extremely loose bound, and consider any . As we shall see, DNNs are so good at concentrating information that it will not matter
If all the singular values are non-zero, we say is Full Rank. If one or more , then we say is Singular. It has lost expressiveness, and the model has undergone Rank collapse.
When a model undergoes Rank Collapse, it traditionally needs to be regularized. Say we are solving a simple linear system of equations / linear regression
The simple solution is to use a little linear algebra to get the optimal values for the unknown
But when is Singular, we can not form the matrix inverse. To fix this, we simply add some small constant to diagonal of
So that all the singular values will now be greater than zero, and we can form a generalized pseudo-inverse, called the Moore-Penrose Inverse
This procedure is also called Tikhonov Regularization. The constant, or Regularizer, sets the Noise Scale for the model. The information in is concentrated in the singular vectors associated with larger singular values , and the noise is left over in the those associated with smaller singular values :
In cases where is Singular, regularization is absolutely necessary. But even when it is not singular, Regularization can be useful in traditional machine learning. (Indeed, VC theory tells us that Regularization is a first class concept)
But we know that Understanding deep learning requires rethinking generalization. Which leads to the question ?
Do the weight matrices of well trained DNNs undergo Rank Collapse ?
Answer: They DO NOT — as we now see:
We can easily examine the numerous pre-trained models available in PyTorch. We simply need to get the layer weight matrices and compute the SVD. We then compute the minimum singular value and compute a histogram of the minimums across different models.
for im, m in enumerate(model.modules()): if isinstance(m, torch.nn.Linear): W = np.array(m.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape) _, svals, _ = np.linalg.svd(W) minsval=np.min(svals) ...
We do this here for numerous models trained on ImageNet and available in pyTorch, such as AlexNet, VGG16, VGG19, ResNet, DenseNet201, etc.– as shown in this Jupyter Notebook.
We also examine the NLP models available in AllenNLP. This is a little bit trickier; we have to install AllenNLP from source, then create an analyze.py command class, and rebuild AllenNLP. Then, to analyze, say, the AllenNLP pre-trained NER model, we run
allennlp analyze https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.04.26.tar.gz
This print out the ranks (and other information, like power law fits), and then plot the results. The code for all this is here.
Notice that many of the AllenNLP models include Attention matrices, which can be quite large and very rectangular (i.e. = ), as compared to the smaller (and less rectangular) weight matrices used in the ImageNet models (i.e. ),.
Note: We restrict our analysis to rectangular layer weight matrices with an aspect ratio , and really larger then 1.1. This is because the Marchenko Pastur (MP) Random Matrix Theory (RMT) tells us that only when. We will review this in a future blog.
For the ImageNet models, most fully connected (FC) weight matrices have a large minimum singular value . Only 6 of the 24 matrices looked at have –and we have not carefully tested the numerical threshold–we are just eyeballing it here.
For the AllenNLP models, none of the FC matrices show any evidence of Rank Collapse. All of the singular values for every linear weight matrix are non-zero.
It is conjectured that fully optimized DNNs–those with the best generalization accuracy–will not show Rank Collapse in any of their linear weight matrices.
If you are training your own model and you see Rank Collapse, you are probably over-regularizing.
it is, in fact, very easy to induce Rank Collapse. We can do this in a Mini version of AlexNet, coded in Keras 2, and available here.
To induce rank collapse in our FC weight matrices, we can add large weight norm constraints to the FC1 linear layer, using the kernel_initializer=…
... model.add(Dense(384, kernel_initializer='glorot_normal', bias_initializer=Constant(0.1),activation='relu', kernel_regularizer=l2(...)) ...
We train this smaller MiniAlexnet model on CIFAR10 for 20 epochs, save the final weight matrix, and plot a histogram of the eigenvalues of the weight correlation matrix
.
We call the Empirical Spectral Density (ESD). Recall that the eigenvalues are simply the square of the singular values
.
Here is what happens to when we turn up the amount of L2 Regularization from 0.0003 to 0.0005. The decreases from 0.0414 to 0.008.
As we increase the weight norm constraints, the minimum eigenvalue approaches zero
Note that adding too much regularization causes nearly all of the eigenvalues/singular values to collapse to zero–as well as the norm of the matrix.
We conjecture that DNNs have zero singular/eigenvalues because there is too much regularization on the layer.
And that…
Fully optimized Deep Neural Networks do not have Rank Collapse
We believe this is a unique property of DNNs, and related to how Regularization works in these models. We will discuss this and more in an upcoming paper
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
by Charles H. Martin (Calculation Consulting) and Michael W. Mahoney (UC Berkeley).
And presented at UC Berkeley this Monday at the Simons Institute
and see our long form paper
Please stay tuned and subscribe to this blog for more updates
Here we are just looking at the distribution to estimate the rank loss. We could be more precise..
In the numpy.linalg.matrix_rank() funtion, “By default, we identify singular values less than S.max() * max(M.shape) * eps
as indicating rank deficiency” when using SVD
But there is some ambiguity here as well, since there is a different default from Numerical Recipes. I will leave it up to the reader to select the best rank loss metric and explore further. And I would be very interested in your findings.
We have computed the minimum singular value for all the Conv2D layers in the ImageNet models deployed with pyTorch. This covers nearly ~7500 layers across ~40 different models.
Very generously, we can say there is rank collapse with . Only 10%-13% of the layers show any form of rank collapse, using this simple heuristic, as easily seen on a log histogram.
In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior. Specifically, for each FC weight matrix , we compute the eigenvalues of the correlation matrix
For every FC matrix, the eigenvalue frequencies, or Empirical Spectral Density (ESD), can be fit to a power law
where the exponents all lie in
Remarkably, the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices!
We define a random matrix by defining a matrix of size , and drawing the matrix elements from a random distribution. We can choose a
or a
In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like. But first, let’s see what model works best.
First, lets look at the ESD for AlexNet for layer FC3, and zoomed in:
Recall that AlexNet FC3 fits a power law with exponent $\alpha\sim&bg=ffffff $ , so we also plot the ESD on a log-log scale
Notice that the distribution is linear in the central region, and the long tail cuts off sharply. This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far. We now ask…
What kind of Random Matrix would make a good model for this ESD ?
We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q, and plot the histogram of their eigenvalues.
N, M = 1000, 500 Q = N / M W = np.random.normal(0,1,size=(M,N)) # X shape is M x M X = (1/N)*np.dot(W.T,W) evals = np.linalg.eigvals(X) plot.hist(evals, bins=100,density=True)
Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)
We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function
W=np.random.pareto(mu,size=(N,M))
Heavy Tailed Random matrices have a very ESDs. They have very long tails–so long, in fact, that it is better to plot them on a log log Histogram
Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?
Lets overlay the ESD of fat-tailed W with the actual empirical from AlexNet for layer FC3
We see a pretty good match to a Fat-tailed random matrix with .
Turns out, there is something very special about being in the range 2-4.
Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, called University Classes. The 3 different values of each represent a different Universality Class:
In particular, if we draw from any heavy tailed / power law distribution, the empirical (i.e. finite size) eigenvalue density is likewise a power law (PL), either globally, or at least locally.
What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges of . And the amazing thing is that
the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of the Fat-Tailed Universality Classes of Random Matrix Theory
But this is a little tricky to show, because we need to show that we fit to the theoretical . We now look at the
RMT tells us that, for , the ESD takes the limiting for
, where
And this works pretty well in practice for the Heavy Tailed Universality Class, for . But for any finite matrix, as soon as , the finite size effects kick in, and we can not naively apply the infinite limit result.
RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails — the fluctuations in the maximum eigenvalue . Specifically, we have
For standard, Gaussian RMT, the (near the bulk edge) is governed by the famous Tracy Widom. And for , RMT is governed by the Tau Four Moment Theorem.
But for , the tail fluctuations follow Frechet statistics, and the maximum eigenvalue has Power Law finite size effects
In particular, the effects of M and Q kick in as soon as . If we underestimate , (small Q, large M), the power law will look weaker, and we will overestimate in our fits.
And, for us, this affects how we estimate from and assign the Universality Class
Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q. We fit each ESD to a Power Law. We then plot , as fit, to .
The red lines are predicted by Heavy Tailed RMT (MP) theory, which works well for Heavy Tailed ESDs with . For Fat Tails, with , the finite size effects are difficult to interpret. The main take-away is…
We can identify finite size matrices W that behave like the the Fat Tailed Universality Class of RMT () with Power Law fits, even with exponents , ranging upto 4 (and even upto 5-6).
It is amazing that Deep Neural Networks display this Universality in their weight matrices, and this suggests some deeper reason for Why Deep Learning Works.
In statistical physics, if a system displays a Power Laws, this can be evidence that it is operating near a critical point. It is known that real, spiking neurons display this behavior, called Self Organized Criticality
It appears that Deep Neural Networks may be operating under similar principles, and in future work, we will examine this relation in more detail.
The code for this post is in this github repo on ImplicitSelfRegularization
]]>
In pretrained, production quality DNNs, the weight matrices for the Fully Connected (FC ) layers display Fat Tailed Power Law behavior.
Deep Neural Networks (DNNs) minimize the Energy function defined by their architecture. We define the layer weight matrices , biases, , and activations functions , giving
We train the DNN on a labeled data set (d,y), giving the optimization problem
We call this the Energy Landscape because the DNN optimization problem is only parameterized by the weights and biases. Of course, in any real DNN problem, we do have other adjustable parameters, such as the amount of Dropout, the learning rate, the batch size, etc. But these regularization effects simply change the global
The Energy Landscape function changes on each epoch–and do we care about how. In fact, I have argued that must form an Energy Funnel:
But here, for now, we only look at the final result. Once a DNN is trained, what is left are the weights (and biases). We can reuse the weights (and biases) of a pre-trained DNN to build new DNNs with transfer learning. And if we train a bunch of DNNs, we want to know which one is better ?
But, practically, we would really like to identify a very good DNN without peaking at the test data, since every time we peak, and retrain, we risk overtraining our DNN.
I now show we can at least start do this by looking the weights matrices themselves. So let us look at the weights of some pre-trained DNNs.
Pytorch comes with several pretrained models, such as AlexNet. To start, we just examine the weight matrices of the Linear / Fully Connected (FC) layers.
pretrained_model = models.alexnet(pretrained=True) for module in pretrained_model.modules(): if isinstance(module, nn.Linear): ...
The Linear layers have the simplest weight matrices ; they are 2-dimensional tensors, or just rectangular matrices.
Let be an matrix, where . We can get the matrix from the pretraing model using:
W = np.array(module.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape)
How is information concentrated in . For any rectangular matrix, we can form the
which is readily computed in scikit learn. We will use the faster TruncatedSVD method, and compute singular values :
from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(n_components=M-1, n_iter=7, random_state=10) svd.fit(W) svals = svd.singular_values_
(Technically, we do miss the smallest singular value doing this, but that’s ok. It won’t matter here, and we can always use the pure svd method to be a exact)
We can, alternatively form the eigenvalues of the correlation matrix
The eigenvalues are just the square of the singular values.
Notice here we normalize them by N.
evals = (1/N)*svals*svals
We now form the Empirical Spectral Density (ESD), which is, formally
This notation just means compute a histogram of the eigenvalues
import matplotlib.pyplot as plt plt.hist(evals, bins=100, density=True)
We could also compute the spectral density using a Kernel Density Estimator (KDE); we save this for a future post.
We now look at the ESD of
Here, we examine just FC3, the last Linear layer, connecting the model to the labels. The other linear layers, FC1 and FC2, look similar. Below is a histogram for ESD. Notice it is very sharply peaked and has long tail, extending out past 40.
We can get a better view of the heavy tailed behavior by zooming in.
The red curve is a fit of the ESD to the Marchenko Pastur (MP) Random Matrix Theory (RMT) result — it is not a very good fit. This means ESD does not resemble Gaussian Random matrix. Instead, it is looks heavy tailed. Which leads to the question…
(Yes do, as we shall see…)
Physicists love to claim they have discovered data that follows a power law:
But this is harder to do than it seems. And statisticians love to point this out. Don’t be fooled–we physicists knew this; Sornette’s book has a whole chapter on it. Still, we have to use best practices.
The first thing to do: plot the data on a log-log histogram, and check that this plot is linear–at least in some region. Let’s look at our ESD for AlexNet FC3:
Yes, it is linear–in the central region, for eigenvalue frequencies between roughly ~1 and ~100–and that is most of the distribution.
Why is not linear everywhere? Because it is finite size–there are min and max cutoffs. In the infinite limit, a powerlaw diverges at , and the tail extends indefinitely as . In any finite size data set, there will be an and .
Second, fit the data to a power law, with and in mind. The most widely available and accepted method the Maximum Likelihood Estimator (MLE), develop by Clauset et. al., and available in the python powerlaw package.
import powerlaw fit = powerlaw.Fit(evals, xmax=np.max(evals)) alpha, D = fit.alpha, fit.D
The D value is a quality metric, the KS distance. There are other options as well. The smaller D, the better. The table below shows typical values of good fits.
The powerlaw package also makes some great plots. Below is a log log plot generated for our fit of FC3, for the central region of the ESD. The filled lines represent our fits, and the dotted lines are actual power lawPDF (blue) and CCDF (red). The filled lines look like straight lines and overlap the dotted lines–so this fit looks pretty good.
Is this enough ? Not yet…
We still need to know, do we have enough data to get a good estimate for , what are our error bars, and what kind of systematic errors might we get?
We can calibrate the estimator by generating some modest size (N=1000) random power law datasets using the numpy Pareto function, where
and then fitting these with the PowerLaw package. We get the following curve
The green line is a perfect estimate. The Powerlaw package overestimates small and underestimates large . Fortunately, most of our fits lie in the good range.
A good fit is not enough. We also should ensure that no other obvious model is a better fit. The power law package lets us test out fit against other common (long tailed) choices, namely
For example, to check if our data is better fit by a log normal distribution, we run
R, p = fit.distribution_compare('powerlaw', 'lognormal', normalized_ratio=True)
and R and the the p-value. If if R<0 and p <= 0.05, then we can conclude that a power law is a better model.
Note that sometimes, for , the best model may be a truncated power law (TPL). This happens because our data sets are pretty small, and the tails of our ESDs fall off pretty fast. A TPL is just a power law (PL) with an exponentially decaying tail, and it may seem to be a better fit for small data, fixed size sets. Also, the TPL also has 2 parameters, whereas the PL has only 1, so it is not unexpected that the TPL would fit the data better.
Below we fit the linear layers of the many pretrained models in pytorch. All fit a power law (PL) or truncated power law (TPL)
The table below lists all the fits, with the Best Fit and the KS statistic D.
Notice that all the power law exponents lie in the range 2-4, except for a couple outliers.
This is pretty fortunate since out PL estimator only works well around this range. This is also pretty remarkable because it suggests there is…
There is a deep connection between this range of exponents and some (relatively) recent results in Random Matrix Theory (RMT). Indeed, this seems to suggest that Deep Learning systems display the kind of Universality seen in Self Organized systems, like real spiking neurons. We will examine this in a future post.
Training DNNs is hard. There are many tricks one has to use. And you have to monitor the training process carefully.
Here we suggest that the ESDs of the Linear / FC layers of a well trained DNN will display power law behavior. And the exponents will be between 2 and 4.
A counter example is Inception V3. Both FC layers 226 and 302 have unusually large exponents. Looking closer at Layer 222, the ESDs is not a power law at all, but rather bi-model heavy tailed distribution.
We conjecture that as good as Inception V3 is, perhaps it could be further optimized. It would be interesting to see if anyone can show this.
The commonly accepted method uses Maximum Likelihood Estimator (MLE). This
The conditional probability is defined by evaluating on a data set (of size n)
The log likelihood is
Which is easily reduced to
The maximum likelihood occurs when
The MLE estimate, for a given , is:
We can either input , or search for the best possible estimate (as explained in the Clauset et. al. paper).
Notice, however, that this estimator does not explicitly take into account. And for this reason it seems to be very limited in its application. A better method, such as developed recently by Thurner (but not yet coded in python), may prove more robust for a larger range of exponents.
Similar results arise in the Linear and Attention layers in NLP models. Below we see the power law fits for 85 Linear layers in the 6 pretrained models from AllenNLP
80% of the linear layers have
We can naively extend these results to Conv2D layers by extracting all the Conv2D layers directly from the 4-index Tensors back, giving several rectangular 2D matrices per layer. Doing this, and repeating the fits, we find that 90% of the Conv2D matrices fit a power law with .
A few Conv2D of layers show very high exponents. What could this mean?
It will be very interesting to dig into these results and determine if the Power Law exponents display Universality across layers, architectures, and data sets.
Even more pretrained models are now available in pyTorch…45 models and counting. I have analyzed as many as possible, giving 7500 layer weight matrices (including the Conv2D slices!)
The results are conclusive..the ~80% of the exponents display Universality. An amazing find. With such great results, soon we will release an open source package so you can do analyze your own models.
Code for this study, and future work, is available in this repo.
Research talks are available on my youtube channel. Please Subscribe
]]>
An early talk describing details in this paper
Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models.
We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization
whereas large, modern deep networks display a new kind of heavy tailed self-regularization.
We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training.
Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs.
We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.
Last year, ICLR 2017, a very interesting paper came out of Google claiming that
Understanding Deep Learning Requires Rethinking Generalization
Which basically asks, why doesn’t VC theory apply to Deep Learning ?
In response, Michael Mahoney, of UC Berkeley, and I, published a response that says
Rethinking generalization requires revisiting old ideas [from] statistical mechanics …
In this post, I discuss part of our argument, looking at the basic ideas of Statistical Learning Theory (SLT) and how they actually compare to what we do in practice. In a future post, I will describe the arguments from Statistical Mechanics (Stat Mech), and why they provide a better, albeit more complicated, theory.
Our paper is a formal discussion of my Quora post on the subject
Of course, it would be nice to have the original code so the results could be reproduced–no such luck. The code is owned by Google–hence the name ‘Google paper’
And, Chiyuan Zhang, one of the authors, has kindly put together a PyTorch github repo–but it is old pyTorch and does not run There is, however, a nice Keras package that does the trick.
In the traditional theory of supervised machine learning (i.e. VC, PAC), we try to understand what are the sufficient conditions for an algorithm to learn. To that end, we seek some mathematical guarantees that we can learn, even in the worst possible cases.
What is the worst possible scenario ? That we learn patterns in random noise, of course.
“In theory, theory and practice are the same. In practice, they are different.”
The Google paper analyzes the Rademacher complexity of a neural network. What is this and why is it relevant ? The Rademacher complexity measures how much a model fits random noise in the data. Let me provide a practical way of thinking about this:
Imagine someone gives you training examples, labeled data , and asks you to build a simple model. How can you know the data is not corrupted ?
If you can build a simple model that is pretty accurate on a hold out set, you may think you are good. Are you sure ? A very simple test is to just randomize all the labels, and retrain the model. The hold out results should be no better than random, or something is very wrong.
Our practical example is a variant of this:
By regularizing the model, we expect we can decrease the model capacity, and avoid overtraining. So what can happen ?
Notice in both cases, the difference in training error and generalization error stayed nearly fixed or at least bounded
It appears that some deep nets can fit any data set, no matter what the labels ? Even with standard regularization. But computational learning theories, like VC and PAC theory, suggests that there should be some regularization scheme that decreases the capacity of the model, and prevent overtraining.
So how the heck do Deep Nets generalize so incredibly well ?! This is the puzzle.
To get a handle on this, in this post, I review the salient ideas of VC/PAC like theories.
the other day, a friend asked me, “how much data do I need for my AI app ?”
My answer: “it’s complicated !”
For any learning algorithm (ALG), like an SVM, Random Forest, or Deep Net, ideally, we would like to know the number of training examples necessary to learn a good model.
ML Theorists sometimes call this number the Sample Complexity. In the Statistical Mechanics, it is called the loading parameter (or just the load) .
Statistical learning theory (SLT), i.e. VC theory, provides a way to get the sample complexity. Moreover, it also tells us how bad a machine learning model might be, in the worst possible case. It states the difference between the test and training error should be characterized by simply the VC dimension .–which measures the effective capacity of the model, and the number of training examples .
.
Statistics is about how frequencies converge to probabilities. That is, if we sample N training examples from a distribution , we seek a gaurentee that the accuracy can be made arbitrarily small () with a very high probability
We call our model a class of functions . We associate the empirical generalization error with the true model error (or risk) , which would be obtained in the proper infinite limit. And the training error with the empirical risk . Of course, ; we want to know how bad the difference could possibly be, by bounding it
Also, in our paper, we use , but here we will follow the more familiar notation.
I follow the MIT OCW course on VC theory, as well as chapter 10 of Engle and Van der Broeck, Statistical Mechanics of Learning (2001).
In most problems, machine learning or otherwise, we want to predict an outcome. Usually a good estimate is the expected value. We use concentration bounds to give us some gaurentee on the accuracy. That is, how close is the expected value is to the actual outcome.
SLT , however, does not try to predict an expected or typical value. Instead, it uses the concentration bounds to explain how regularization provides a control over the capacity of a model, thereby reducing the generalization error. The amazing result of SLT is that capacity control emerges as a first class concept, arising even in the most abstract formulation of the bound–no data, no distributions, and not even a specific model.
In contrast, Stat Mech seeks to predict the average, typical values for a very specific model, and for the entire learning curve. That is, all practical values of the adjustable knobs of the model, the load , the Temperature, etc., in the Thermodynamic limit. Stat Mech does not provide concentration bounds or other guarantees, but it can provide both a better qualitative understanding of the learning process, and even good quantitative results for sufficiently large systems.
(In principle, the 2 methods are not totally incompatible, as one could examine the behavior of a specific bound in something like an effective Thermodynamic limit. See Engle …, chapter 10)
For now, let’s see what the bounding theorems say:
If we are estimating a single function , which does not depend on the data, then we just need to know the number of training examples m
The Law of Large Numbers (Central Limit Theorem) tells us then that the difference goes to zero in the limit .
This is a direct result of Hoeffding’s inequality, which is a well known bound on the confidence of the empirical frequencies converging to a probability.
However, if our function choices f depend on the data, as they do in machine learning, we can not use this bound. The problem: we can overtrain. Formally, we might find a function that optimizes training error but performs very poorly on test data , causing the bound to diverge. As in the pathological case above.
So we consider a Uniform bound over all functions , meaning we consider maximum (supremum) deviation.
We also now need to know the number of adjustable parameters N. For a finite size function class, and a general (unrealizable) , and we have
.
The is a 2-sided version of this theorem also, but the salient result is that the new term appears because we need all N bounds to hold simultaneously. We consider the maximum deviation, but we are bounded by the worst possible case. And not even typical worst case scenarios, but the absolute worst–causing the bound to be very loose. Still at least the deviations are bounded–in the finite case.
What happens when the function class is infinite? As for Kernel methods and Neural Networks. How do we treat the two limits ?
In VC theory, we fix m, and define distribution-independent measure, the VC growth function , which tells us how many ways our training data can be classified (or shattered) by functions in
,
where is the set of all ways the data can be classified by the functions
Using this, we can bound even infinite size classes , with a finite growth function.
WLOG, we only consider binary classification. We can treat more general models using the Rademacher complexity instead of the VC dimension–which is why the Google paper talked so much about it.
For any , and for any random draw of the data, we have, with probability at least
.
So the simpler the function class, the smaller the true error (Risk) should be.
In fact, for a finite , we have a tighter bound than above, because .
Note that we usually see the VC bounds in terms of the VC dimension…but VC theory provides us more than bounds; it tells us why we need regularization.
We define the VC Dimension , which is simply the number of training examples m, for which
The VC dim, and the growth function, measure the effective size of the class . By effective, we mean not just the number of functions, but the geometric projection of the class onto finite samples.
If we plot the growth function, we find it has 2 regimes:
: exponential growth
: polynomial growth
This leads to formal bounds on infinite size function classes (by Vapnik, Sauer, etc) based on the VC dim (as we mention in our paper)
Let be a class of functions with finite VC dim . Then for all
If has VC dim , and for all , with probability at least
Since bounds the risk of not generalizing, when we have more training examples than the VC dim (), the risk grows so much slower. So to generalize better, decrease , the effective capacity of .
Regularization–reducing the model complexity–should lead to better generalization.
So why does regularization not seem to work as well for Deep Learning ? (At least, that is what the Google paper suggests!)
Note: Perfectly solvable, or realizable, problems may have a tighter bound, but we ignore this special case for the present discussion.
Before we dive in Statistical Mechanics, I first mention an old paper by Vapnik, Levin and LeCun, Measuring the VC Dimension of a Learning Machine (1994) .
It is well known that the VC bounds are so loose to be of no practical use. However, it is possible to measure an effective VC dimension–for linear classifiers. Just measure the maximal difference in the error while increasing size of the data, and fit it to a reasonable function. In fact, this effective VC dim appears to be universal in many cases. But…[paraphrasing the last paragraph of the conclusion]…
“The extension of this work to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network…[because] it is likely…the search will be confined to a subset of [these] functions…The capacity of this set can be much lower than the capacity of the whole set…[and] may change with the number of observations. This may require a theory that considers the notion of a non-constant capacity with an ‘active’ subset of functions”
So even Vapnik himself suspected, way back in 1994, that his own theory did not directly apply to Neural Networks!
And this is confirmed in recent work looking at the empirical capacity of RNNs.
And the recent Google paper says things are even weirder. So what can we do ?
We argue that the whole idea of looking at worst-case-bounds is at odds with what we actually do in practice because we take a effectively consider a different limit than just fixing m and letting N grow (or vice versa).
Very rarely would we just add more data (m) to a Deep network. Instead, we usually increase the size of the net (N) as well, because we know that we can capture more detailed features / information from the data. So, win practice, we increase m and N simultaneously.
In Statistical Mechanics, we also consider the join limit …but with the ratio m/N fixed.
The 2 ideas are not completely incompatible, however. In fact, Engle… give nice example of applying the VC Bounds, in a Thermodynamic Limit, to a model problem.
In contrast to VC/PAC theories, which seek to bound the worst case of a model, Statistical Mechanics tries to describes the typical behavior of a model exactly. But typical does not just mean the most probable. We require that atypical cases be made arbitrarily small — in the Thermodynamic limit.
This works because the probabilities distributions of relevant thermodynamic quantities, such as the most likely average Energy, become sharply peaked around their maximal values.
Many results are well known in the Statistical Mechanics of Learning. The analysis is significantly more complicated but the results lead to a much richer structure that explains many phenomena in deep learning.
In particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).
For a typical neural network, can identify 3 phases of the system, controlled by the load parameter , the amount of training data m, relative to the number of adjustable network parameters N (and ignoring other knobs)
The view from SLT contrasts with some researchers who argue that all deep nets are simply memorizing their data, and that generalization arises simply because the training data is so large it covers nearly every possible case. This seems very naive to me, but maybe ?
Generally speaking, memorization is akin to prototype learning, where only a single examples is needed to describe each class of data. This arises in certain simple text classification problems, which can then be solved using Convex NMF (see my earlier blog post).
In SLT, over-training is a completely different phase of the system, characterized by a kind of pathological non-convexity–an infinite number of (degenerate) local minimum, separated by infinitely high barriers. This is the so-called (mean field) Spin-Glass phase.
SLT Overtraining / Spin Glass Phase has an infinite number of minima
So why would this correspond to random labellings of the data ? Imagine we have a binary classifier, and we randomize labels. This gives new possible labellings:
Different Randomized Labellings of the Original Training Data
We argue in our paper, that this decreases the effective load . If is very small, then will not change much, and we stay in the Generalization phase. But if is of order N (say 10%), then may decrease enough to push us into the Overtraining phase.
Each Randomized Labelling corresponds to a different Local Minima
Moreover, we now have new, possibly unsatisfiable, classifications problems. So the solutions will be nearly degenerate. But many of these will be difficult to learn because the many of label(s) are wrong, so the solutions could have high Energy barriers — i.e. they are difficult to find, and hard to get out of.
So we postulate by randomizing a large fraction of our labels, we push the system into the Overtraining / Spin Glass phase–and this is why traditional VC style regularization can not work–it can not bring us out of this phase. At least, that’s the theory.
In my next paper / blog post, I will describe how to examine these ideas in practice. We develop a method to when phase transitions arise in the training of real world neural networks. And I will show how we can observe what I postulated 3 years ago–the Spin Glass of Minimal Frustration, and how this changes the simple picture from the Spin Glass / Statistical Mechanics of Learning. Stay tuned !
It would be dishonest to present statistical mechanics as a general theory of learning that resolves all problems untreatable by VC theory. In particular, however, is the problem with theoretical analysis of online learning.
Of course, statistical mechanics only rigorously applies to the limit . Statistics requires infinite sampling, but does provides results for finite . But even in the infinite case, there are issues.
While online learning (ala SGD) is certainly amenable to a stat mech approach, there is a fundamental problem with saddle points that arise in the Thermodynamic limit. (See section 9.8, and also chapter 13, Engle and Van den Broeck)
It has been argued that an online algorithm may get trapped in saddle points which result from symmetries in the architecture of the network, or the underlying problem itself. In these cases, the saddle points may act as fixed points along an unstable manifold, causing a type of symmetry-induced paralysis. And while any finite size system may escape such saddles, in the limit , the dynamics can get trapped in an unstable manifold, and learning fails.
This is typically addressed in the context of symmetry-induced replica symmetry breaking in multilayer perceptrons, which would need be the topic of another post.
Computational Learning Theory (i.e VC theory) frames a learning algorithm or model as a infinite series of growing, distribution-independent hypothesis spaces , of VC dimension , such that
This lets us consider typical case behavior, when the limit converges (the so-called self-averaging case). And we can even treat typical worst case behaviors. And since we are not limited to the over loose bounds, the analysis matches reality much closer.
And we do this for a spin glass model of deep nets, we see much richer behavior in the learning curve. And we argue, this agrees more with what we see in practice
Let’s formalize this in the context of .
Let us call class of the hypothesis, or function classes, . Notationally, sometimes we see . Here, we distinguish between our abstract, theoretical hypothesis and the actual functions the algorithm sees
A hypothesis represents, say, the set of all possible weights for neural network. Or, more completely, the set of all weights, and learning rates, and dropout rates, and even initial weight distributions. It is a function . This may be confusing because, conceptually, we will want hypothesis with random labels–but we don’t actually do this. Instead, we measure the Rademacher complexity, which is a mathematical construct to let us measure all possible label randomizations for a given hypothesis .
We also fix the function class , so that learning process is, conceptually, a fixed point iteration over this fixed set:
and not some infinite set. This is critical because the No Free Lunch Theorem says that the worse-case sample complexity of an infinite size model is infinite.
For a given distribution , define the expected risk of a hypothesis as expected Loss over the actual training data
and the optimal risk as the maximum error (infimum) we encounter for all possible hypothesis–like in our practical case (2) above.
As above, we identify the training error with the data dependent expected risk, and the generalization error as the maximum risk we could ever expect, giving:
Recall we want the minimum N training examples necessary to learn. What is necessary ?
Suppose we pick a set of n labeled pairs. We define a hypothesis in by applying our ALG on this set: . And assume that we drew our set from some distribution, so
Then, for all , we seek a positive integer , such that for all
Notice that explicitly depends on the distribution, accuracy, and confidence.
As in the Central Limit Theorem, we consider the limit where the number of training examples .
But the No Free Lunch Theorem says that unless we restrict on the hypothesis/function space , there always exist “bad” distributions for which the sample complexity is arbitrarily large.
Of course, in practice, we know for that very large data sets, and very high capacity models, we need to regularize our models to avoid overtraining. At least we thought we knew this.
Moreover, by restricting the complexity of , we expect to produce more uniformly consistent results. And this is what we do above.
We don’t necessarily need all the machinery of microscopic Statistical Mechanics and Spin Glasses (ala Engle…) to develop SLT-like learning bounds. There is a classic paper, Retarded Learning: Rigorous Results from Statistical Mechanics, which shows how to get similar results using a variational approach.
]]>
Here are the associated slides
If you enjoyed this presentation, let me invite you to subscribe to my YouTube channel
]]>