https://twimlai.com/meetups/implicit-self-regularization-in-deep-neural-networks/
]]>My Collaborator did a great job giving a talk on our research at the local San Francisco Bay ACM Meetup
Michael W. Mahoney UC Berkeley
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a “size scale” separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that—all else being equal—DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Joint work with Charles Martin of Calculation Consulting, Inc.
Bio: https://www.stat.berkeley.edu/~mmahoney/
Michael W. Mahoney is at the UCB in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics. He has worked and taught at Yale University in the Math department, Yahoo Research, and Stanford University in the Math department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), He was on the National Research Council’s Committee on the Analysis of Massive Data. He co-organized the Simons Institute’s fall 2013 program on the Theoretical Foundations of Big Data Analysis, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the lead PI for the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. He holds several patents for work done at Yahoo Research and as Lead Data Scientist for Vieu Labs, Inc., a startup re-imagining consumer video for billions of users.
More information is available at https://www.stat.berkeley.edu/~mmahoney/
Long version of the paper (upon which the talk is based): https://arxiv.org/abs/1810.01075http://www.meetup.com/SF-Bay-ACM/http://www.sfbayacm.org/
]]>Why Deep Learning Works: Self Regularization in Neural Networks
Presented Thursday, December 13, 2018
The slides are available on my slideshare.
The supporting tool, WeightWatcher, can be installed using:
pip install weightwatcher
]]>DON’T PEEK: DEEP LEARNING WITHOUT LOOKING … AT TEST DATA
The idea…suppose we want to compare 2 or more deep neural networks (DNNs). Maybe we are
Can we determine which DNN will generalize best–without peeking at the test data?
Theory actually suggests–yes we can!
We just need to measure the average log norm of the layer weight matrices
where is the Frobenius norm
The Frobenius norm is just the sum of the square of the matrix elements. For example, it is easily computed in numpy as
np.linalg.norm(W,ord='fro')
where ‘fro’ is the default norm.
It turns out that is amazingly correlated with the test accuracy of a DNN. How do we know ? We can plot vs the reported test accuracy for the pretrained DNNs, available in PyTorch. First, we look at the VGG models:
The plot shows the 4 VGG and VGG_BN models. Notice we do not need the ImageNet data to compute this; we simply compute the average log Norm and plot with the (reported Top 5) Test Accuracy. For example, the orange dots show results for the pre-trained VGG13 and VGG13_BN ImageNet models. For each pair of models, the larger the Test Accuracy, the smaller . Moreover, the correlation is nearly linear across the entire class of VGG models. We see similar behavior for …
Across 4/5 pretrained ResNet models, with very different sizes, a smaller generally implies a better Test Accuracy.
It is not perfect–ResNet 50 is an outlier–but it works amazingly well across numerous pretrained models, both in pyTorch and elsewhere (such as the OSMR sandbox). See the Appendix for more plots. What is more, notice that
the log Norm metric is completely Unsupervised
Recall that we have not peeked at the test data–or the labels. We simply computed for the pretrained models directly from their weight files, and then compared this to the reported test accuracy.
Imagine being able to fine tune a neural network without needing test data. Many times we barely have enough training data for fine tuning, and there is a huge risk of over-training. Every time you peek at the test data, you risk leaking information into the model, causing it to overtrain. It is my hope this simple but powerful idea will help avoid this and advance the field forward.
A recent paper by Google X and MIT shows that there is A Surprising Linear Relationship [that] Predicts Test Performance in Deep Networks. The idea is to compute a VC-like data dependent complexity metric — — based on the Product Norm of the weight matrices:
Usually we just take as the Frobenius norm (but any p-norm may do)
If we take the log of both sides, we get the sum
So here we just form the average log Frobenius Norm as measure of DNN complexity, as suggested by current ML theory
And it seems to work remarkably well in practice.
We can also understand this through our Theory of Heavy Tailed Implicit Self-Regularization in Deep Neural Networks.
The theory shows that each layer weight matrix of (a well trained) DNNs resembles a random heavy tailed matrix, and we can associate with it a power law exponent
The exponent characterizes how well the layer weight matrix represents the correlations in the training data. Smaller is better.
Smaller exponents correspond to more implicit regularization, and, presumably, better generalization (if the DNN is not overtrained). This suggests that the average power law would make a good overall unsupervised complexity metric for a DNN–and this is exactly what the last blog post showed.
The average power law metric is a weighted average,
where the layer weight factor should depend on the scale of . In other words, ‘larger’ weight matrices (in some sense) should contribute more to the weighted average.
Smaller usually implies better generalization
For heavy trailed matrices, we can work out a relation between the log Norm of and the power law exponent :
where we note that
So the weight factor is simply the log of the maximum eigenvalue associated with
In the paper will show the math; below we present numerical results to convince the reader.
This also explains why Spectral Norm Regularization Improv[e]s the Generalizability of Deep Learning. The smaller gives a smaller power law contribution, and, also, a smaller log Norm. We can now relate these 2 complexity metrics:
We argue here that we can approximate the average Power Law metric by simply computing the average log Norm of the DNN layer weight matrices. And using this, we can actually predict the trends in generalization accuracy — without needing a test data set!
The Power Law metric is consistent with the recent theoretical results, but our approach and the intent is different:
But the biggest difference is that we apply our Unsupervised metric to large, production quality DNNs.
We believe this result will have large applications in hyper-parameter fine tuning DNNs. Because we do not need to peek at the test data, it may prevent information from leaking from the test set into the model, thereby helping to prevent overtraining and making fined tuned DNNs more robust.
We have built a python package for Jupyter Notebooks that does this for you–the weight watcher. It works on Keras and PyTorch. We will release it shortly.
Please stay tuned! And please subscribe if this is useful to you.
We use the OSMR Sandbox to compute the average log Norm for a wide variety of DNN models, using pyTorch, and compare to the reported Top 1 Errors. This notebook reproduces the results.
All the ResNet Models
DenseNet
SqueezeNet
DPN
In the plot below, we generate a number of heavy tailed matrices, and fit their ESD to a power law. Then we compare
The code for this is:
N, M, mu = 100, 100, 2.0 W = np.random.pareto(a=mu,size=(N,M)) normW = np.linalg.norm(W) logNorm2 = 2.0*np.log10(normW) X=np.dot(W.T,W)/N evals = np.linalg.eigvals(X) l_max, l_min = np.max(evals), np.min(evals) fit = powerlaw.Fit(evals) alpha = fit.alpha ratio = logNorm2/np.log10(l_max)
Below are results for a variety of heavy tailed random matrices:
The plot shows the relation between the ratios and the empirical power law exponents . There are three striking features; the linear relation
In our next paper, we will drill into these details and explain further how this relation arises and the implications for Why Deep Learning Works.
My recent talk at the French Tech Hub Startup Accelerator
]]>Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks. Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior. That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:
For a given weight matrix , we form the correlation matrix
and then compute the M eigenvalues of
We call the histogram of eigenvalues the Empirical Spectral Density (ESD). It can nearly always be fit to a power law
We call the Power Law Universal because 80-90% of the exponents lie in range
For fully connected layers, we just take as is. For Conv2D layers with shape we consider all 2D feature maps of shape . For any large, modern, pretrained DNN, this can give a large number of eigenvalues. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.
As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4. Although compared to the FC layers, for the Conv2D layers, we do see more exponents . We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is
Are power law exponents correlated with better generalization accuracies ? … YES they are!
We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including
To compare these model versions, we can simply compute the average power law exponent , averaged across all FC weight matrices and Conv2D feature maps. This is similar to consider the product norm, which has been used to test VC-like bounds for small NNs. In nearly every case, smaller is correlated with better test accuracy (i.e. generalization performance).
The only significant caveats are:
Predicting the test accuracy is a complicated task, and IMHO simple theories , with loose bounds, are unlikely to be useful in practice. Still, I think we are on the right track
Lets first look at the DenseNet models
Here, we see that as Test Accuracy increases, the average power law exponent generally decreases. And this is across 4 different models.
The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 is larger than InceptionV4.
Now consider the Resnet models, which are increasing in size and have more architectural differences between them:
Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a s larger than FbResnet152, but they are very close. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller across a wide range of architectures.
These results are easily reproduced with this notebook.
This is an amazing result !
You can think of the power law exponent as a kind of information metric–the smaller , the more information is in this layer weight matrix.
Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data.
In addition to DenseNet, Inception, ResNext, SqueezeNet, and the (larger) ResNet models, we have even more positive results are available here on ~40 more DNNs across ~10 more different architectures, including MeNet, ShuffleNet, DPN, PreResNet, DenseNet, SE-Resnet, SqueezeNet, and MobileNet, MobileNetV2, and FDMobileNet.
I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice.
]]>
One broad question we can ask is:
How is information concentrated in Deep Neural Network (DNNs)?
To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.
In a previous post, we formed the Singular Value Decomposition (SVD) of the weight matrices of the linear, or fully connected (FC) layers. And we saw that nearly all the FC Layers display Power Law behavior. And, in fact, this behavior is Universal across models both ImageNet and NLP models.
But this only part of the story. Here, we ask related question–do well trained DNNs weight matrices lose Rank ?
Lets say is an matrix. We can form the Singular Value Decomposition (SVD):
The Matrix Rank , or Hard Rank, is simply the number of non-zero singular values
which express the decrease in Full Rank M.
Notice the Hard Rank of the rectangular matrix is the dimension of the square correlation matrix .
In python, this can be computed using
rank = numpy.linalg.matrix_rank(W)
Of course, being a numerical method, we really mean the number of singular values above some tolerance …and we can get different results depending on if we use
See the numpy documentation on matrix_rank for details.
Here, we will compute the rank ourselves, and use an extremely loose bound, and consider any . As we shall see, DNNs are so good at concentrating information that it will not matter
If all the singular values are non-zero, we say is Full Rank. If one or more , then we say is Singular. It has lost expressiveness, and the model has undergone Rank collapse.
When a model undergoes Rank Collapse, it traditionally needs to be regularized. Say we are solving a simple linear system of equations / linear regression
The simple solution is to use a little linear algebra to get the optimal values for the unknown
But when is Singular, we can not form the matrix inverse. To fix this, we simply add some small constant to diagonal of
So that all the singular values will now be greater than zero, and we can form a generalized pseudo-inverse, called the Moore-Penrose Inverse
This procedure is also called Tikhonov Regularization. The constant, or Regularizer, sets the Noise Scale for the model. The information in is concentrated in the singular vectors associated with larger singular values , and the noise is left over in the those associated with smaller singular values :
In cases where is Singular, regularization is absolutely necessary. But even when it is not singular, Regularization can be useful in traditional machine learning. (Indeed, VC theory tells us that Regularization is a first class concept)
But we know that Understanding deep learning requires rethinking generalization. Which leads to the question ?
Do the weight matrices of well trained DNNs undergo Rank Collapse ?
Answer: They DO NOT — as we now see:
We can easily examine the numerous pre-trained models available in PyTorch. We simply need to get the layer weight matrices and compute the SVD. We then compute the minimum singular value and compute a histogram of the minimums across different models.
for im, m in enumerate(model.modules()): if isinstance(m, torch.nn.Linear): W = np.array(m.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape) _, svals, _ = np.linalg.svd(W) minsval=np.min(svals) ...
We do this here for numerous models trained on ImageNet and available in pyTorch, such as AlexNet, VGG16, VGG19, ResNet, DenseNet201, etc.– as shown in this Jupyter Notebook.
We also examine the NLP models available in AllenNLP. This is a little bit trickier; we have to install AllenNLP from source, then create an analyze.py command class, and rebuild AllenNLP. Then, to analyze, say, the AllenNLP pre-trained NER model, we run
allennlp analyze https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.04.26.tar.gz
This print out the ranks (and other information, like power law fits), and then plot the results. The code for all this is here.
Notice that many of the AllenNLP models include Attention matrices, which can be quite large and very rectangular (i.e. = ), as compared to the smaller (and less rectangular) weight matrices used in the ImageNet models (i.e. ),.
Note: We restrict our analysis to rectangular layer weight matrices with an aspect ratio , and really larger then 1.1. This is because the Marchenko Pastur (MP) Random Matrix Theory (RMT) tells us that only when. We will review this in a future blog.
For the ImageNet models, most fully connected (FC) weight matrices have a large minimum singular value . Only 6 of the 24 matrices looked at have –and we have not carefully tested the numerical threshold–we are just eyeballing it here.
For the AllenNLP models, none of the FC matrices show any evidence of Rank Collapse. All of the singular values for every linear weight matrix are non-zero.
It is conjectured that fully optimized DNNs–those with the best generalization accuracy–will not show Rank Collapse in any of their linear weight matrices.
If you are training your own model and you see Rank Collapse, you are probably over-regularizing.
it is, in fact, very easy to induce Rank Collapse. We can do this in a Mini version of AlexNet, coded in Keras 2, and available here.
To induce rank collapse in our FC weight matrices, we can add large weight norm constraints to the FC1 linear layer, using the kernel_initializer=…
... model.add(Dense(384, kernel_initializer='glorot_normal', bias_initializer=Constant(0.1),activation='relu', kernel_regularizer=l2(...)) ...
We train this smaller MiniAlexnet model on CIFAR10 for 20 epochs, save the final weight matrix, and plot a histogram of the eigenvalues of the weight correlation matrix
.
We call the Empirical Spectral Density (ESD). Recall that the eigenvalues are simply the square of the singular values
.
Here is what happens to when we turn up the amount of L2 Regularization from 0.0003 to 0.0005. The decreases from 0.0414 to 0.008.
As we increase the weight norm constraints, the minimum eigenvalue approaches zero
Note that adding too much regularization causes nearly all of the eigenvalues/singular values to collapse to zero–as well as the norm of the matrix.
We conjecture that DNNs have zero singular/eigenvalues because there is too much regularization on the layer.
And that…
Fully optimized Deep Neural Networks do not have Rank Collapse
We believe this is a unique property of DNNs, and related to how Regularization works in these models. We will discuss this and more in an upcoming paper
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
by Charles H. Martin (Calculation Consulting) and Michael W. Mahoney (UC Berkeley).
And presented at UC Berkeley this Monday at the Simons Institute
and see our long form paper
Please stay tuned and subscribe to this blog for more updates
Here we are just looking at the distribution to estimate the rank loss. We could be more precise..
In the numpy.linalg.matrix_rank() funtion, “By default, we identify singular values less than S.max() * max(M.shape) * eps
as indicating rank deficiency” when using SVD
But there is some ambiguity here as well, since there is a different default from Numerical Recipes. I will leave it up to the reader to select the best rank loss metric and explore further. And I would be very interested in your findings.
We have computed the minimum singular value for all the Conv2D layers in the ImageNet models deployed with pyTorch. This covers nearly ~7500 layers across ~40 different models.
Very generously, we can say there is rank collapse with . Only 10%-13% of the layers show any form of rank collapse, using this simple heuristic, as easily seen on a log histogram.
In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior. Specifically, for each FC weight matrix , we compute the eigenvalues of the correlation matrix
For every FC matrix, the eigenvalue frequencies, or Empirical Spectral Density (ESD), can be fit to a power law
where the exponents all lie in
Remarkably, the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices!
We define a random matrix by defining a matrix of size , and drawing the matrix elements from a random distribution. We can choose a
or a
In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like. But first, let’s see what model works best.
First, lets look at the ESD for AlexNet for layer FC3, and zoomed in:
Recall that AlexNet FC3 fits a power law with exponent $\alpha\sim&bg=ffffff $ , so we also plot the ESD on a log-log scale
Notice that the distribution is linear in the central region, and the long tail cuts off sharply. This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far. We now ask…
What kind of Random Matrix would make a good model for this ESD ?
We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q, and plot the histogram of their eigenvalues.
N, M = 1000, 500 Q = N / M W = np.random.normal(0,1,size=(M,N)) # X shape is M x M X = (1/N)*np.dot(W.T,W) evals = np.linalg.eigvals(X) plot.hist(evals, bins=100,density=True)
Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)
We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function
W=np.random.pareto(mu,size=(N,M))
Heavy Tailed Random matrices have a very ESDs. They have very long tails–so long, in fact, that it is better to plot them on a log log Histogram
Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?
Lets overlay the ESD of fat-tailed W with the actual empirical from AlexNet for layer FC3
We see a pretty good match to a Fat-tailed random matrix with .
Turns out, there is something very special about being in the range 2-4.
Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, called University Classes. The 3 different values of each represent a different Universality Class:
In particular, if we draw from any heavy tailed / power law distribution, the empirical (i.e. finite size) eigenvalue density is likewise a power law (PL), either globally, or at least locally.
What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges of . And the amazing thing is that
the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of the Fat-Tailed Universality Classes of Random Matrix Theory
But this is a little tricky to show, because we need to show that we fit to the theoretical . We now look at the
RMT tells us that, for , the ESD takes the limiting for
, where
And this works pretty well in practice for the Heavy Tailed Universality Class, for . But for any finite matrix, as soon as , the finite size effects kick in, and we can not naively apply the infinite limit result.
RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails — the fluctuations in the maximum eigenvalue . Specifically, we have
For standard, Gaussian RMT, the (near the bulk edge) is governed by the famous Tracy Widom. And for , RMT is governed by the Tau Four Moment Theorem.
But for , the tail fluctuations follow Frechet statistics, and the maximum eigenvalue has Power Law finite size effects
In particular, the effects of M and Q kick in as soon as . If we underestimate , (small Q, large M), the power law will look weaker, and we will overestimate in our fits.
And, for us, this affects how we estimate from and assign the Universality Class
Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q. We fit each ESD to a Power Law. We then plot , as fit, to .
The red lines are predicted by Heavy Tailed RMT (MP) theory, which works well for Heavy Tailed ESDs with . For Fat Tails, with , the finite size effects are difficult to interpret. The main take-away is…
We can identify finite size matrices W that behave like the the Fat Tailed Universality Class of RMT () with Power Law fits, even with exponents , ranging upto 4 (and even upto 5-6).
It is amazing that Deep Neural Networks display this Universality in their weight matrices, and this suggests some deeper reason for Why Deep Learning Works.
In statistical physics, if a system displays a Power Laws, this can be evidence that it is operating near a critical point. It is known that real, spiking neurons display this behavior, called Self Organized Criticality
It appears that Deep Neural Networks may be operating under similar principles, and in future work, we will examine this relation in more detail.
The code for this post is in this github repo on ImplicitSelfRegularization
]]>
In pretrained, production quality DNNs, the weight matrices for the Fully Connected (FC ) layers display Fat Tailed Power Law behavior.
Deep Neural Networks (DNNs) minimize the Energy function defined by their architecture. We define the layer weight matrices , biases, , and activations functions , giving
We train the DNN on a labeled data set (d,y), giving the optimization problem
We call this the Energy Landscape because the DNN optimization problem is only parameterized by the weights and biases. Of course, in any real DNN problem, we do have other adjustable parameters, such as the amount of Dropout, the learning rate, the batch size, etc. But these regularization effects simply change the global
The Energy Landscape function changes on each epoch–and do we care about how. In fact, I have argued that must form an Energy Funnel:
But here, for now, we only look at the final result. Once a DNN is trained, what is left are the weights (and biases). We can reuse the weights (and biases) of a pre-trained DNN to build new DNNs with transfer learning. And if we train a bunch of DNNs, we want to know which one is better ?
But, practically, we would really like to identify a very good DNN without peaking at the test data, since every time we peak, and retrain, we risk overtraining our DNN.
I now show we can at least start do this by looking the weights matrices themselves. So let us look at the weights of some pre-trained DNNs.
Pytorch comes with several pretrained models, such as AlexNet. To start, we just examine the weight matrices of the Linear / Fully Connected (FC) layers.
pretrained_model = models.alexnet(pretrained=True) for module in pretrained_model.modules(): if isinstance(module, nn.Linear): ...
The Linear layers have the simplest weight matrices ; they are 2-dimensional tensors, or just rectangular matrices.
Let be an matrix, where . We can get the matrix from the pretraing model using:
W = np.array(module.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape)
How is information concentrated in . For any rectangular matrix, we can form the
which is readily computed in scikit learn. We will use the faster TruncatedSVD method, and compute singular values :
from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(n_components=M-1, n_iter=7, random_state=10) svd.fit(W) svals = svd.singular_values_
(Technically, we do miss the smallest singular value doing this, but that’s ok. It won’t matter here, and we can always use the pure svd method to be a exact)
We can, alternatively form the eigenvalues of the correlation matrix
The eigenvalues are just the square of the singular values.
Notice here we normalize them by N.
evals = (1/N)*svals*svals
We now form the Empirical Spectral Density (ESD), which is, formally
This notation just means compute a histogram of the eigenvalues
import matplotlib.pyplot as plt plt.hist(evals, bins=100, density=True)
We could also compute the spectral density using a Kernel Density Estimator (KDE); we save this for a future post.
We now look at the ESD of
Here, we examine just FC3, the last Linear layer, connecting the model to the labels. The other linear layers, FC1 and FC2, look similar. Below is a histogram for ESD. Notice it is very sharply peaked and has long tail, extending out past 40.
We can get a better view of the heavy tailed behavior by zooming in.
The red curve is a fit of the ESD to the Marchenko Pastur (MP) Random Matrix Theory (RMT) result — it is not a very good fit. This means ESD does not resemble Gaussian Random matrix. Instead, it is looks heavy tailed. Which leads to the question…
(Yes do, as we shall see…)
Physicists love to claim they have discovered data that follows a power law:
But this is harder to do than it seems. And statisticians love to point this out. Don’t be fooled–we physicists knew this; Sornette’s book has a whole chapter on it. Still, we have to use best practices.
The first thing to do: plot the data on a log-log histogram, and check that this plot is linear–at least in some region. Let’s look at our ESD for AlexNet FC3:
Yes, it is linear–in the central region, for eigenvalue frequencies between roughly ~1 and ~100–and that is most of the distribution.
Why is not linear everywhere? Because it is finite size–there are min and max cutoffs. In the infinite limit, a powerlaw diverges at , and the tail extends indefinitely as . In any finite size data set, there will be an and .
Second, fit the data to a power law, with and in mind. The most widely available and accepted method the Maximum Likelihood Estimator (MLE), develop by Clauset et. al., and available in the python powerlaw package.
import powerlaw fit = powerlaw.Fit(evals, xmax=np.max(evals)) alpha, D = fit.alpha, fit.D
The D value is a quality metric, the KS distance. There are other options as well. The smaller D, the better. The table below shows typical values of good fits.
The powerlaw package also makes some great plots. Below is a log log plot generated for our fit of FC3, for the central region of the ESD. The filled lines represent our fits, and the dotted lines are actual power lawPDF (blue) and CCDF (red). The filled lines look like straight lines and overlap the dotted lines–so this fit looks pretty good.
Is this enough ? Not yet…
We still need to know, do we have enough data to get a good estimate for , what are our error bars, and what kind of systematic errors might we get?
We can calibrate the estimator by generating some modest size (N=1000) random power law datasets using the numpy Pareto function, where
and then fitting these with the PowerLaw package. We get the following curve
The green line is a perfect estimate. The Powerlaw package overestimates small and underestimates large . Fortunately, most of our fits lie in the good range.
A good fit is not enough. We also should ensure that no other obvious model is a better fit. The power law package lets us test out fit against other common (long tailed) choices, namely
For example, to check if our data is better fit by a log normal distribution, we run
R, p = fit.distribution_compare('powerlaw', 'lognormal', normalized_ratio=True)
and R and the the p-value. If if R<0 and p <= 0.05, then we can conclude that a power law is a better model.
Note that sometimes, for , the best model may be a truncated power law (TPL). This happens because our data sets are pretty small, and the tails of our ESDs fall off pretty fast. A TPL is just a power law (PL) with an exponentially decaying tail, and it may seem to be a better fit for small data, fixed size sets. Also, the TPL also has 2 parameters, whereas the PL has only 1, so it is not unexpected that the TPL would fit the data better.
Below we fit the linear layers of the many pretrained models in pytorch. All fit a power law (PL) or truncated power law (TPL)
The table below lists all the fits, with the Best Fit and the KS statistic D.
Notice that all the power law exponents lie in the range 2-4, except for a couple outliers.
This is pretty fortunate since out PL estimator only works well around this range. This is also pretty remarkable because it suggests there is…
There is a deep connection between this range of exponents and some (relatively) recent results in Random Matrix Theory (RMT). Indeed, this seems to suggest that Deep Learning systems display the kind of Universality seen in Self Organized systems, like real spiking neurons. We will examine this in a future post.
Training DNNs is hard. There are many tricks one has to use. And you have to monitor the training process carefully.
Here we suggest that the ESDs of the Linear / FC layers of a well trained DNN will display power law behavior. And the exponents will be between 2 and 4.
A counter example is Inception V3. Both FC layers 226 and 302 have unusually large exponents. Looking closer at Layer 222, the ESDs is not a power law at all, but rather bi-model heavy tailed distribution.
We conjecture that as good as Inception V3 is, perhaps it could be further optimized. It would be interesting to see if anyone can show this.
The commonly accepted method uses Maximum Likelihood Estimator (MLE). This
The conditional probability is defined by evaluating on a data set (of size n)
The log likelihood is
Which is easily reduced to
The maximum likelihood occurs when
The MLE estimate, for a given , is:
We can either input , or search for the best possible estimate (as explained in the Clauset et. al. paper).
Notice, however, that this estimator does not explicitly take into account. And for this reason it seems to be very limited in its application. A better method, such as developed recently by Thurner (but not yet coded in python), may prove more robust for a larger range of exponents.
Similar results arise in the Linear and Attention layers in NLP models. Below we see the power law fits for 85 Linear layers in the 6 pretrained models from AllenNLP
80% of the linear layers have
We can naively extend these results to Conv2D layers by extracting all the Conv2D layers directly from the 4-index Tensors back, giving several rectangular 2D matrices per layer. Doing this, and repeating the fits, we find that 90% of the Conv2D matrices fit a power law with .
A few Conv2D of layers show very high exponents. What could this mean?
It will be very interesting to dig into these results and determine if the Power Law exponents display Universality across layers, architectures, and data sets.
Even more pretrained models are now available in pyTorch…45 models and counting. I have analyzed as many as possible, giving 7500 layer weight matrices (including the Conv2D slices!)
The results are conclusive..the ~80% of the exponents display Universality. An amazing find. With such great results, soon we will release an open source package so you can do analyze your own models.
Code for this study, and future work, is available in this repo.
Research talks are available on my youtube channel. Please Subscribe
]]>
An early talk describing details in this paper
Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models.
We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization
whereas large, modern deep networks display a new kind of heavy tailed self-regularization.
We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training.
Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs.
We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.