The WeightWatcher tool is an open-source python package that can be used to predict the test accuracy of a series similar of Deep Neural Network (DNN) — without peeking at the test data.
WeightWatcher is based on research done in collaboration with UC Berkeley on the foundations of Deep Learning. We built this tool to help you analyze and debug your Deep Neural Networks.
It is easy to install and run; the tool will analyze your model and return both summary statistics and detailed metrics for each layer:
The WeightWatcher github page lists various papers and online presentations given at UC Berkeley and Stanford, and various conferences like ICML, KDD, etc. There, and on this blog, you can find examples of how to use it.
This post describes how to select the metric for your models, and why.
You can use WeightWatcher to model a series of DNNs of either increasing size, or with different hyperparameters. But you need different metrics — vs. — for different cases:
But why do need 2 different alpha metrics? To understand this, we need to understand
Traditional machine learning theory suggests that the test performance of a Deep Neural Network is correlated with the average log Spectral Norm. That is, the test error should be bounded by the average Spectral Norm, so the smaller norm, the smaller the test error.
The Spectral Norm of an matrix is just the (square root of the ) maximum eigenvalue of its correlation matrix
We denote the (squared) Spectal Norm as:
Note: in earlier papers we (and others) also use:
WeightWatcher computes the log Spectral Norm for each layer, and defines:
which we compute by averaging over all layer weight matrices.
We compute the eigenvalues, or the Empirical Spectral Density (ESD), of each layer by running SVD directly on the layer or, for Conv2D layers, some matrix slice (see the Appendix).
It has been suggested by Yoshida and Miyato that the Spectral Norm would make a good regularizer for DNNs. The basic idea is that the test data should look enough like the training data so that if we can say something about how the DNN performs on perturbed training data, that will also say something about the test performance.
Here is the text from the paper; let. me explain how to interpret this in practical terms.
We imagine the test data must look like the training data with some small perturbation . Let us write this as:
As we train a DNN, we run several epochs of BackProp, which amounts to multiplying by a weight matrix at each layer, apply an activation function, and repeat until we get a label
To get an estimate, or bound, on the test accuracy, we can then imagine applying the matrix multiply to the perturbed training point
So if we can say something about how the DNN should perform on a perturbed training point , we can say something about the test output .
What can we say ?
When we apply an activation function , like a RELU, which acts pointwise on the data vector, and acts like an affine transformation. So the RELU+weight matrix multiply is a bounded linear operator (at least piecewise) and therefore it can be bounded by it’s Spectral Radius .
So we say that we want to learn a DNN model such that, at each layer, the action of the layer matrix-vector multiply is bounded by the Spectral Norm of its layer weight matrix. This should, in theory, give good test performance. By applying a Spectral Norm regularizer, we think we can make a DNN that is more robust to small changes in it’s input. That is, we can make it perform better on random perturbations the of training data , and, therefore, presumably, better on the test data.
From Bounds to a Regularizer
When we develop a mathematical bound , our first instinct is to develop a numerical regularizer . That is, when solving our optimization problem–minimizing the DNN Energy function –we want to prevent the solution from blowing up. Having a mathematically rigorous bound helps here since it seems to bound the BackProp optimization step on every iteration:
Notice that since the regularizer appears in the optimization problem, it must be differentiable (either directly, or using some trick).
A regularizer must also be easy to implement. For example, we could also bound the Jacobian , but this very expensive to compute, and it is difficult to apply even a norm bound of this , on every step of BackProp. It might also seem that the Spectral norm is hard to compute because one needs to run SVD, but there is a simple trick. One can approximate the maximum eigenvalue of using the Power Method, by running it for say steps, and then simply add this to the SGD update step,. There are many examples on github of this, and it has been applied, in particular, to GANs and has been shown to work very well in large scale studies.
Also, since this expression is linear in , we can also readily plug this into TensorFlow /Keras or PyTorch and use autograd to compute the derivatives. It is available as a Tensforflow Addon, and is part of the core pyTorch 1.7 package.
Spectral Norm Regularization has not been widely used (outside say GANs) because it only works well for very deep networks. See, however, this adaption for smaller DNNs called Mean Spectral Normalization.
What do we want from a theory of learning ? With WeightWatcher, we have never sought. a rigorous bound. That’s not the goal of our theory. We do not seek a bound because this decribes the worst-case behavior; we seek to understand the average-case behavior (However, what we can do repair the Spectral Norm as a metric , as shown below.)
With the average-case, we hope to able to predict the generalization error of a DNN (and without peeking at the test data). And we mean this a very practical sense, applying to very large, production quality models, both in training and fine-tuning them.
So what’s the difference between having a bound and analyzing average-case behavior ?
It’s not at all obvious that we can expect a mathematical bound to be correlated with trends in the test accuracy of real-world DNNs. It turns out, the Spectral Norm works pretty well–at least across pretrained DNNs of increasing depth.
Here is an example, showing how the Spectral Norm performs on the VGG series
We see that the average log Spectral norm correlates quite well with the test accuracy of the DNN architecture series of the pretrained VGG models. This is remarkable, since we do not have access to the training or the test data (or other information).
We have used WeightWatcher to analyze hundreds of pretrained models, of increasing depths, and using different data sets. Generally speaking, the average log Spectral Norm correlates well with the test accuracies of many different DNN series and for different data sets.
But not always. And that’s the rub.
Oddly, while the average log Spectral Norm is correlated with test error, when changing the depth of a DNN model, it turns out to be anti-correlated with test error when varying the optimization hyper-parameters. This is a classic example of Simpson’s paradox.
We have noted this, and it has also pointed by Bengio and co-workers. Indeed, an entire contest was recently set up to study this issue–the 2020 NeurIPS Predicting Generalization challenge.
Below we can see the paradox by looking at predictions for ~100 small, pre-trained VGG-like models, (provided by contest). We use WeightWatcher (version ww0.4) to compute the average log_spectral_norm
, and compare to. the reported test accuracies for the contest task2_v1
set of baseline VVG-like models:
For more details, please see the contest website details, and/or our contest post-mortem Jupyter Book and paper on the contest (coming soon).
Notice that the 2xx
models have the best test accuracies, and, correspondingly, as a group, the smallest avg logspectralnorm
. Smaller error correlates with the smaller norm metric. Likewise, the 6xx
models models have. the smallest Test Accuracies as. a group, and also, the largest avg logspectralnorm
.
This is a classic example of Simpson’s Paradox.
However, also note that, for each model group (2xx, 10xx, 6xx, & 9xx
), we can draw, roughly, a straight line that shows most of the test accuracies in that group are anti-correlated with the avg. log_spectral_norm
. Now the regression is not always great, and there are outliers, but we think the general trends hold well enough for this level of discussion (and we will drill into the details in our next paper).
This is a classic example of Simpon’s Paradox.
Here, we see a large trend, across the similar models, trained on the same dataset, but with different, depths.
When looking closely at each model group, however, we see the reverse trend.
This makes the Spectral Norm difficult to use as a general purpose metric for predicting test accuracies.
WeightWatcher to the Rescue
Using WeightWatcher, however, we can repair the average log Spectral Norm metric by computing it as a weighted average–weighted by the WeightWatcher metric.
Here is a similar plot on the same task2_v1 data,, but this time reporting the WeightWatcher average power law metric . Notice that is well correlated within each model group, as expected when changing model hyperparameters. Moreover, the is not correlated with the Test Accuracy more broadly across different depths, nor is it correlated with the average log Spectral Norm (not shown).
WeightWatcher alpha tells us how correlated a single DNN model is. And we can use to correct the average log_spectral_norm
by simply taking a weighted average (called alpha_weighted
):
If we look closely, we can see more in more detail how the weighted alpha corrects the average log_spectral_norm
metric in the VGG architecture series. Below we use WeightWatcher to plot the different metrics vs. the Top 1 Test Accuracy for the many different pre-trained VGG models.
Consider the plot on the far right, and, specifically, the pink (BN-VGG-16) and red dots (VGG-19), near test accuracy ~ 26. These are 2 models with both different depths (16 vs 19 layers) and different hyperparameter settings (BatchNorm or not). The two models have nearly the same accuracy, but a large variance between their average log_spectral_norm
. Now consider the far left plot for average alpha , which shows that and red has a smaller alpha than the pink . The VGG-19 model is more strongly correlated than BN_VGG-16. The average alpha for the red dot is ~3.5, whereas the pink dot ~ 3.85. When we combine these 2 metrics, on the middle plot (alpha_weighted), the 2 models now appear much closer together. So alpha_weighted
metric corrects the average log_spectral_norm
, reducing the variance between similar models, and making more suitable to treating models of different depth and different hyperparameter settings.
The open source WeightWatcher tool provides metrics for Deep Neural Networks that allow the user to predict (trends in) the test accuracies of Deep Neural Networks without needing the test data. The different (power law) metrics, and apply to models with different hyperparameter settings, and different depths, resp. Here, we explain why.
The average alpha metric describes the amount of correlation contained in the DNN weight matrices. Smaller correlates with better test accuracy for a single model with different hyperparemeter settings. It is a unique metric, developed from the theory on strongly correlated systems from theoretical chemistry and physics
The average weighted alpha metric is suited for treating a series of models with different depths, like the VGG series: VGG11, VGG13, VGG16, VGG19. It is a weighted average of the log Spectral Norm.
To explain why works, we have reviewed the theory and application of Spectral Norm Regularization, and the use of the average log Spectral Norm as a metric for predicting DNN test accuracies.
While theory suggests that might be able to predict the generalization performance of different pre-trained DNNs, in practice, it is correlated with test error for models with different depths, and anti-correlated for models trained with different hyperparameters. This is a classic example of Simpson’s Paradox.
We show that we can fix-up the average log_spectral_norm
(as provided in WeightWatcher) by using a weighted average, weighted by the WeightWatcher power-law layer alpha metric. And this is exactly the WeightWatcher metric alpha_weighted
:
Try it yourself on your own DNN models.
pip install weightwatcher
And let me know how it goes.
Spectral Density of 2D Convolutional Layers
We can test this theory numerically using WeightWatcher. Notice, however, that while it is obvious how to define for Dense matrix, there is some ambiguity doing this for a 2DConvolution and to make this work as a useful metric.
WeightWatcher has 2 methods for computing the SVD of a Conv2D layer, depending on the version. For a Conv2D layer, the options are
All three methods give slightly different layer Spectral Norms, with ww0.4 being the best estimator so far. The ww2x=True
option is included for back compatibility with earlier papers.
The weightwatcher tool uses power law fits to model the eigenvalue density of weight matrices of any Deep Neural Network (DNN).
The average power-law exponent is remarkably well correlated with the test accuracy when changing the number of layers and/or fine-tuning the hyperparameters. In our latest paper, we demonstrate this using a metaanalysis on hundreds of pre-trained models. This begs the question:
Why can we model the weight matrices of DNNs using power law fits ?
In theoretical chemistry and physics, we know that strongly correlated, complex systems frequently display power laws.
In many machine learning and deep learning models, the correlations also display heavy / power law tails. After all, the whole point of learning is to learn the correlations in the data. Be it a simple clustering algorithm, or a very fancy Deep Neural Network, we want to find the most strongly correlated parts to describe our data and make predictions.
For example, in strongly correlated systems, if you place an electron in a random potential, it will show a transition from delocalized to localized states, and the spectral density will display power law tails. This is called Anderson Localization, and Anderson won the Nobel Prize in Physics for this in 1977. In the early 90s, Cizeau and Bouchaud argued that a Wigner-Levy matrix will show a similar localization transition, and since then have modeled the correlation matrices in finance using their variant of heavy tailed random matrix theory (RMT). Even today this is still an area of active research in mathematics and in finance.
In my earlier days as a scientist, I worked on strongly correlated multi-reference ab initio methods for quantum chemistry. Here, the trick is to find the right correlated subspace to get a good low order description. I believe, in machine learning, the same issues arise. For this reason, I also model the correlation matrices in Deep Neural Networks using heavy tailed RMT.
Here I will show that the simplest machine learning model, Latent Semantic Analysis (LSA), shows a localization transition, and that this can be used to identify and characterize the heavy tail of the LSA correlation matrix.
Take Latent Semantic Analysis (LSA). How do we select the Latent Space? We need to select the top-K components of the TF-IDF Term-Document matrix . I believe this can be done by selecting those K eigenvalues that best fit a power law. Here is an example, using the scikit-learn 20newsgroups data:
We call ths plot the Empirical Spectral Density (ESD). This is just a histogram plot, on a log scale, of the eigenvalues of the TF-IDF correlation matrix . The correlation matrix is the square of the TF-IDF matrix
and the eigenvalues of are the singular values of , squared: .
We fit the eigenvalues to a power law (PL) using the python powerlaw package, which implements a standard MLE estimator.
fit = powerlaw.fit(eigenvalues)
The fit selects the optimal xmin=
using a brute force search, and returns the best PL exponent , and the quality of the fit (the KS-distance). The orange line displays the start of the power law tail, which contains the most strongly correlated eigenpairs.
We can evaluate the quality of the PL fit by comparing the ESD and the actual fit on log-log plot.
The blue solid line is the ESD on a log-log scale (or the PDF), and the blue dotted line is the PL fit. (The red solid line is the empirical CDF, and the red dotted line the fit). The PDF (blue) shows a very good linear fit up, except perhaps for largest eigenvalues, . Likewise, the CDF (red) shows a very good fit, up until the end of the tail. This is typical of power-law fits on real-world data and is usually best described as a Truncated Power Law (TPL), with some noise in the very far tail *(more on the noise in a future post). And the reported KS-distance , which is exceptional.
We can get even more insight into the quality of the fit by examining how the PL method selected xmin
, the start of the PL. Below, we plot the KS-distance for each possible choice of xmin
:
The optimization landscape is convex, with a clear global minimum at the orange line, which occurs at the . That is, there are 547 eigenpairs in the tail of the ESD displaying strong power-law behavior.
To form the Latent space, we select these largest 547 eigenpairs, to the right of the orange line, the start of the (truncated) power-law fit
To identify the localization transition in LSA, we can plot localization ratios in the same way, where the localization is defined as in our first paper:
def localization_ratio(v): return np.linalg.norm(v, ord=1) / np.linalg.norm(v, ord=np.inf)
We see that get an elbow curve, and the eigenvalue cutoff appears just to the right of the ‘elbow’:
Other methods include looking scree plots or even just the sorted eigenvalues themselves.
Typically, in unsupervised learning, one selects the top-K clusters, eigenpairs, etc. by looking at some so-called ‘elbow curve’, and identifying the K at the inflection point. We can make these plots too. A classic way is to plot the explained variance per eigenpair:
We see that the power-law , the orange line, occurs just to the right of the inflection point. So these two methods give similar results. No other method provides a theoretically well-defined way, however, of selecting the K components.
I suspect that in these strongly correlated systems, the power law behavior really kicks it right at / before these inflection points. So we can find the optimal low-rank approximation to these strongly correlated weight matrices by finding that subspace where the correlations follow a power-law / truncated power law distribution. Moreover, we can detect, and characterize these correlations, by both the power-law exponent , and the quality of the fit D.
And AFAIK, this has never been suggested before.
]]>We introduce the weightwatcher (ww) , a python tool for a python tool for computing quality metrics of trained, and pretrained, Deep Neural Netwworks.
pip install weightwatcher
This blog describes how to use the tool in practice; see our most recent paper for even more details.
Here is an example with pretrained VGG11 from pytorch (ww works with keras models also):
import weightwatcher as ww import torchvision.models as models model = models.vgg11(pretrained=True) watcher = ww.WeightWatcher(model=model) results = watcher.analyze() summary = watcher.get_summary() details = watcher.get_details()
WeightWatcher generates a dict that summarizes the empirical quality metrics for the model (with the most useful metrics)
summary: = { ... alpha: 2.572493 alpha_weighted: 3.418571 lognorm: 1.252417 logspectralnorm: 1.377540 logpnorm: 3.878202 ... }
The tool also generates a details pandas dataframe, with a layer-by-layer analysis (shown below)
The summary contains the Power Law exponent (), as well as several log norm metrics, as explained in our papers, and below. Each value represents an empirical quality metric that can be used to gauge the gross effectiveness of the model, as compared to similar models.
(The main weightwatcher notebook demonstrates more features )
For example, lognorm is the average over all layers L of the log of the Frobenius norm of each layer weight matrix :
lognorm: average log Frobenius Norm :=
Where the individual layer Frobenius norm, for say a Fully Connected (FC layer, may be computed as
np.log10(np.linalg.norm(W))
We can use these metrics to compare models across a common architecture series, such as the VGG series, the ResNet series, etc. These can be applied to trained models, pretrained models, and/or even fine-tuned models.
Consider the series of models VGG11, VGG11_BN, … VGG19, VGG_19, available in pytorch. We can plot the reported the various log norm metrics vs the reported test accuracies.
For a series of similar, well-trained models, all of the empirical log norm metrics correlate well with the reported test accuracies! Moreover, the Weighted Alpha and Log Norm metrics work best.
Smaller is better.
We also run a ordinary linear regression (OLS) and the root mean squared error (RMSE) , and for several other CV models that are available in the pytorch torchvision.models package.
We have tested this on over 100 well trained computer vision (CV) pre-trained models on multiple data sets (such as the ImageNet-1K subset of ImageNet). These trends how for nearly every case of well-trained models.
Notice that the RMSE for ResNet, trained on ImageNet1K, is larger than for ResNet trained on the full ImageNet, even though ResNet-1K has more models in the regression. that (19 vs 5). For the exact same model, the larger and better data set shows a better OLS fit!
We have several ideas where we hope this would be useful. These include:
We can learn even more about a model by looking at the empirical metrics, layer by layer. The results is a dataframe that contains empirical quality metrics for each layer of the model. An example output, for VGG11, is:
The columns contain both metadata for each layer (id, type, shape, etc), and the values of the empirical quality metrics for that layer matrix.
These metrics depend on the spectral properties–singular values of , or, equivalently, the eigenvalues of the correlation matrix of .
WeightWatcher is unique in that it can measure the amount of correlation, or information, that a model contains–without peeking at the training or test data. Data Correlation is measured by the Power Law (PL) exponents .
WeightWatcher computes the eigenvalues (by SVD) for each layer weight matrix , and fits the eigenvalue density (i.e. histogram to a truncated Power Law (PL), with PL exponent
In nearly every pretrained model we have examined, the Empirical Spectral Density can be fit to a truncated PL. And the PL exponent usually is in the range , where smaller is better.
Here is an example of the output of weightwatcher
of the second Fully Connected layer (FC2) in VGG11. These results can be reproduced using the WeightWatcher-VGG.ipynb notebook in the ww-trends-2020 github repo., using the options:
results = watcher.analyze(alphas=True, plot=True)
The plot below shows the ESD (Empirical Spectral Density ,), of the weight matrix , in layer FC2. Again, this is a (normalized) histogram of the eigenvalues of the correlation matrix .
The FC2 matrix is square, 512×512, and has an aspect ratio of Q=N/M=1 . The maximum eigenvalue is about 45, which is typical for many heavy tailed ESDs. And there is a large peak at 0, which is normal for Q=1. Because Q= 1, the ESD might look heavy tailed, but this can be deceiving because a random matrix with Q=1 would look similar. Still, as with nearly all well-trained DNNS, we expect the FC2 ESD to be well fit by a Power Law model, with an exponent (i.e. in the Fat Tailed Universality class), or at least, for a model that is not ‘flawed’ in some way, .
alpha . PL exponent for for W:
The smaller alpha is , for each layer, the more correlation that layer describes. Indeed, in the best performing models, all of the layer alphas approach 2 .
To check that the ESD is really heavy tailed, we need to check the Power Law (PL) fit. This is done by inspecting the weightwatcher plots.
The plot on the right shows the output of the powerlaw package, which is used to do the PL fit of the ESD. The PL exponent , which is a typical value for (moderately, i.e. Fat) Heavy Tailed ESDs. Also, the KS distance is small , which is good. We can also see this visually. The dots are the actual data, and the lines are the fits. If the lines are reasonably straight, and match the dots, and in the range the fit is good. And they are. This is a good PL fit.
As shown above, with ResNet vs ResNet-1K, the weightwatcher
tool can help you decide if you have enough data, or your model/architecture would benefit from more data. Indeed, poorly trained models, with very bad data sets, show strange behavior that you can detect using weightwatcher
Here is an example of the infamous OpenAI GPT model, originally released as a poorly-trained model –so it would not be misused. It was too dangerous to release We can compare this deficient GPT with the new and improved GPT2-small model, which has basically the same architecture, but has been trained as well as possible. Yes, they gave in an released it! (Both are in the popular huggingface
package, and weightwatcher
can read and analyze these models) Below, we plot a histogram of the PL exponents , as well as histogram of the log Spectral Norms for each layer in GPT (blue) and GPT2 (red)
These results can be reproduced using the WeightWatcher-OpenAI-GPT.ipynb notebook in the ww-trends-2020 github repo.
Notice that the poorly-trained GPT model has many unusually high values of alpha . Many are be above 6, and even range upto 10 or 12 ! This is typical of poorly trained and/or overparameterized models.
Notice that the new and improved GPT2-small does not have the unusually high PL exponents any more, and, also, the peak of the histogram distribtion is farther to the left (smaller).
Smaller alpha is always better.
If you have a poorly trained model. and you fix your model by adding more and better data, the alphas will generally settle down to below 6. Note: this can not be seen in a total average because the large values will throw the average off–to see this, make a histogram plot of alpha
What about the log Spectral Norm ? It seems to show inconsistent behavior. Above, we saw that smaller is better. But now it looks as if smaller is worse ? What is going on with this…and the other empirical Norm metrics ?
Now let’s take a deeper look at how to use the empirical log Norm metrics:
Unlike the PL exponent alpha, the empirical Norm metrics depend strongly on the scale of the weight matrix W. As such, they are highly sensitive to problems like Scale Collapse–and examining these metrics can tell us when something is potentially very wrong with our models.
First, what are we looking at ? These empirical (log) Norm metrics reported are defined using the raw eigenvalues. We can compute the eigenvalues of X pretty easily (although actually in the code we compute the singular values of W using the sklearn TruncatedSVD method.)
M = np.min(W.shape) svd = TruncatedSVD(n_components=M-1) svd.fit(W) sv = svd.singular_values_ eigen_values = sv*sv
Recall that the Frobenius norm (squared) for matrix W is also the sum of the eigenvalues of X. The Spectral Norm (squared) is just the maximum eigenvalue of X. The weighted alpha and the log (or Shatten) Norm are computed after fitting the PL exponent for the layer. In math, these are:
The weightwatcher
code computes the necessary eigenvalues, does the PowerLaw (PL) fits, and reports these, and other, empirical quality metrics, for you, both for the average (summary) and layer-by-layer (details) of each. The details dataframe has many more metrics as well, but, for now we will focus on these four.
Now, what can we do with them? We are going to look at 3 ways to identify potential problems in a DNN, which can not be seen by just looking at the test accuracy
Using the weighwatcher details dataframe
, we can plot the PL exponent alpha vs. the layer Id to get what is called a Correlation Flow plot:
Let us do this, by comparing 3 common (pretrained) computer vision models: VGG, ResNet, and DenseNet.
These results can be reproduced using the following notebooks:
Recall that good models have average PL exponents , in the Fat Tailed Universality class. Likewise, we find that, if we plot alpha vs layer_id, then good models also have stable alphas, in this range.
The VGG11 and 19 models have good alphas, all within the Fat Tailed Universality class, or smaller. And both the smaller and larger models show similar behavior. Also, noitce that the last 3 and FC layers in the VGG models all have final smaller alphas, . So while the alphas are increasing as we move down the model, the final FC layers seem to capture and concentrate the information, leading to more correlated layer weight matrices at the end.
ResNet152 is an even better example of good Correlation Flow. It has a large number of alphas near 2, contiguously, for over 200 layers. Indeed, ResNet models have been trained with over 1000 layers; clearly the ResNet architecture supports a good flow of information.
Good Correlation Flow shows that the DNN architecture is learning the correlations in the data at every layer, and implies (*informally) that information is flowing smoothly through the network.
Good DNNs show good Correlation Flow
We also find that models in an architecture series (VGG, ResNet, DenseNet, etc) all have similar Correlation Flow patterns, when adjusting for the model depth.
Bad models, however, have alphas that increase with layer_id, or behave erratically. This means that the information is not flowing well through the network, and the final layers are not fully correlated. For example, the older VGG models have alphas in a good range, but, as we go down the network, the alphas are systematically increasing. The final FC layers fix the problem, although, maybe a few residual connections, like in ResNet, might improve these old models even more.
You might think adding a lot of residual connections would improve Correlation Flow–but too many connections is also bad. The DenseNet series is an example of an architecture with too many residual connections. Here, both with the pretrained DenseNet126 and 161 we see the many , and, looking down the network layers, the are scattered all over. The Correlation Flow is poor and even chaotic, and, we conjecture, less than optimal.
Curiously, the ResNet models show good flow internally, as shown when we zoom-in, in (d) above. But the last few layers have unusually large alphas; we will discuss this phenomena now.
Advice: If you are training or finetuning a DNN model for production use, use weightwatcher
to plot the Correlation Flow. If you see alphas increasing with depth, behaving chaotically, or there are just a lot of alphas >> 6, revisit your architecture and training procedures.
When is a DNN is over-parameterized, once trained on some data ?
Easy…just look at alphas. We have found that well-trained, or perhaps fully-trained, models, should have . And the best CV models have most of their alphas just above 2.0. However, some models, such as NLP OpenAI GPT2 and BERT models, have a wider . And many models have several unusually large alphas, with latex \alpha\gg 6$. What is going on ? And how is it useful ?
The current batch of NLP Transformer models are great examples. We suspect that many models, like BERT and GPT-xl, are over-parameterized, and that to fully use them in production, they need to be fine-tuned. Indeed, that is the whole point of these models; NLP transfer learning.
Let’s take a look the current crop of pretrained OpenAI GPT-2 models, provided by the huggingface
package. We call these “good-better-best” series.
These results can be reproduced using the WeightWatcher-OpenAI-GPT2.ipynb notebook.
For both the PL exponent (a) and our Log Alpha Norm (b) , Smaller is Better. The latest and greatest OpenAI GPT2-xl model (in red) has both smaller alphas and smaller empirical log norm metrics, compared to the earlier GP2-large (orange) and GPT2-medium (green) models.
But the GPT2-xl model also has more outlier alphas:
We have seen similar behavior in other NLP models, such as comparing OpenAI GPT to GPT2-small, and the original BERT, as compared to the Distilled Bert (as discussed in my recent Stanford Lecture). We suspect that when these large NLP Trasnformer models are fine-tuned or distilled, the alphas will get smaller, and performance will improve.
Advice: So when you fine-tune your models, monitor the alphas with weightwatcher
. If they do not decrease enough, add more data, and/or try to improve the training protocols.
But you also have to be careful not to break your model, as have found that some distillation methods may do this.
Frequently one may finetune a model, for transfer learning, distillation, or just to add more data.
How can we know if we broke the model ?
We have found that poorly trained models frequently exhibit Scale Collapse, in which 1 or more layers have unusually small Spectral and/or Frobenius Norms.
This can be seen in your models by running plotting a histogram of the logspectralnorm column from the details dataframe
Recall earlier we noted the poorly-trained in the OpenAI GPT model. This is typical of many porly-trained models. Because of this, log norm metrics can not be reliable used to predict trends in accuracies on poorly-trained models.
However, we can use the empirical log Norm metrics to detect problems that can not be seen by simply looking at the training and test accuracies.
We have also observed this in some distilled models. Below we look at the ResNet20 model, before and after distillation using the Group Regularization method (as described in the Intel distiller package and provided in the model zoo). We plot the Spectral Norm (maximum eigenvalue) and PL exponent alpha vs. the layer_id (depth) for both the baseline (green) and finetuned /distiller (red) ResNet20 models.
These results can be reproduced by installing the distiller package, downloading the model zoo pretrained models, and running the WeightWatcher-Intel-Distiller-ResNet20.ipynb notebook in the distiller folder. (We do note that these are older results, and we used older versions of both distiller
and weighwatcher
, which used a different normalization on the Conv2D layers. Current results may differ although we expect to see similar trends.)
Notice that the baseline and finetuned ResNet20 have similar PL exponents (b) for all layers, but for several layers in (a), the Spectral Norm (maximum eigenvalue) collapses in value. That is, the Scale Collapses. This is bad, and characteristic of a poorly trained model like the original GPT.
Advice: if you finetune a model, use weighwatcher
to monitor the log Spectral Norms. If you see unusually small values, something is wrong.
Our latest paper is now on archive.
Please check out the github webpage for WeightWatcher and the associated papers and online talks at Stanford, UC Berkeley, and the wonderful podcasts that have invited us on to speak about the work.
If you want to get more involved, reach out to me directly at charles@calculationconsulting.com
And remember–if you need help at your company with AI, Deep learning, and Machine Learning, please reach out. Calculation Consulting
]]>For the past year or two, we have talked a lot about how we can understand the properties of Deep Neural Networks by examining the spectral properties of the layer weight matrices . Specifically, we can form the correlation matrix
,
and compute the eigenvalues
.
By plotting the histogram of the eigenvalues (i.e the spectral density ), we can monitor the training process and gain insight into the implicit regularization and convergence properties of DNN. Indeed, we have identified
Each of these phases roughly corresponds to a Universality class from Random matrix Theory (RMT). And as we shall see below, we can use RMT to develop a new theory of learning.
First, however, we note that for nearly every pretrained DNNs we have examined (over 450 in all) , the phase appears to be in somewhere between Bulk-Decay and/or Heavy-Tailed .
Moreover, for nearly all DNNs, the spectral density can be fit to a truncated power law, with exponents frequently lying in the Fat Tailed range [2-4], and the maximum eigenvalue no larger than say 100
,
Most importantly, in 80-90% of the DNN architectures studied, on average, smaller exponents correspond to smaller test errors.
Our empirical results suggest that the power law exponent can be used as (part of) a practical capacity metric. This led us to propose the metric for DNNs:
where we compute the exponent and maximum eigenvalue for each layer weight matrix (and Conv2D feature maps), and then form the total DNN capacity as a simple weighted average of the exponents. Amazingly, this metric correlates very well with the reported test accuracy of pretrained DNNs (such as the VGG models, the ResNet models, etc)
We have even built a open source, python command line tool–weightwatcher–so that other researchers can both reproduce and leverage our results
pip install weightwatcher
And we have a Slack Channel for those who want to ask questions, dig deeper, and/or contribute to the work. Email me, or ping me on LinkedIn, to join our vibrant group.
All of this leads to a very basic question:
To answer this, we will go back to the foundations of the theory of learning, from the physics perspective, and rebuild the theory using in both our experimental observations, some older results from Theoretical Physics, and (fairly) recent results in Random Matrix Theory.
Here, I am going to sketch out the ideas we are currently researching to develop a new theory of generalization for Deep Neural Networks. We have a lot of work to do, but I think we have made enough progress to present these ideas, informally, to flush out the basics.
What do we seek ? A practical theory that can be used to predict the generalization accuracy of a DNN solely by looking at the trained weight matrices, without looking at the test data.
Why ? Do you test a bridge by driving cars over it until it collapses ? Of course not! So why do we build DNNs and only rely on brute force testing ? Surely we can do better.
What is the approach ? We start with the classic Perceptron Student-Teacher model from Statistical Mechanics of the 1990s. The setup is similar, but the motivations are a bit different. We have discussed this model earlier here Remembering Generalization in DNNs. from our paper Understanding Deep Learning Requires Rethinking Generalization.
Here, let us review the mathematical setup in some detail:
We start with the simple model presented in chapter 2, Engel and Van der Brock, interpreted in a modern context.
Here, we want to do something a little different, and use the formalism of Statistical Mechanics to both compute the average generalization error, and to interpret the global convergence properties of DNNs in light of this , giving us more insight into and to provide a new theory of Why Deep Learning Works (as proposed in 2015).
Suppose we have some trained or pretrained DNN (i.e. like VGG19). We want to compute the average / typical error that our Teacher DNN could make, just by examining the layer weight matrices. Without peeking at the data.
Conjecture 1: We assume all layers are statistically independent, so that the average generalization capacity (i.e. 1.0-error) is just the product of the contributions of from each layer weight matrix .
Example: The Product Norm is a Capacity measure for DNNs from traditional ML theory.
The Norm may be Frobenius Norm, the Spectral Norm, or even their ratio, the Stable Rank.
This independence assumption is probably not a great approximation but it gets us closer to a realistic theory. Indeed, even traditional ML theory recognizes this, and may use Path Norm to correct for this. For now, this will suffice.
Caveat 1: If we take the logarithm of each side, we can write the log Capacity as the sum of the layer contributions. More generally, we will express the log Capacity as a weighted average of some (as yet unspecified) log norm of the weight matrix.
We now set up the classic Student-Teacher model for a Perceptron–with a slight twist. That is, from now on, we assume our models have 1 layer, like a Perceptron.
Let’s call our trained or pretrained DNN the Teacher T. The Teacher maps data to labels. Of course, there could be many Teachers which map the same data to the same labels. For our specific purposes here, we just fix the Teacher T. We imagine that the learning process is for us to learn all possible Student Perceptrons J that also map the data to the labels, in the same way as the Teacher.
But for a pretrained model, we have no data, and we have no labels. And that’s ok. Following Engle and Van der Brock (and also Engle’s 2001 paper ), consider the following Figure, which depicts the vector space representations of T and J.
To compute the average generalization error, we write total error is the sum of all the errors over all possible Students J for a given Teacher T. And we model this error with the inverse (arc cosine) of the vector dot product between J and T:
For our purposes, if instead of N-dim vectors, if we let T and J be NxM weight matrices, then the dot product becomes the Solid Angle . (Note: the error is no longer since J and T are matrices, not vectors, but hopefully this detail won’t matter here since we are going to integrate this term out below. This remains to be worked out)
This formalism lets us use the machinery of Statistical Mechanics to write the total error as an integral over all possible Student vectors J, namely, the phase space volume of our model:
where the first delta function enforces the normalization condition, or spherical constraints, on the Student vectors J, and the second delta function is a kind of Energy potential..
The normalization can be subsumed into a general measure, as
which actually provides us a more general expression for the generalization error (where we recall is not quite correct for matrices)
Now we will deviate from the classic Stat Mech approach of the 90s. In the original analysis, one wants to compute the phase space volume as a function of the macroscopic thermodynamic variables, such as the size of the training set, and study the learning behavior. We have reviewed this classic results in our 2017 paper.
We note that, for the simple Perception, the Student and Teachers,: , are represented as N-dimensional vectors, and the interesting physics arises in the Ising Perception, when the elements are discrete:
Continuous Perception: (unintersting behavior)
Ising Perception: (phase transitions, requires Replica theory, …)
And in our early work, we propose how to interpret the expected phase behavior in light of experimental results (at Google) that seem to require Rethinking Generalization. Here, we want to reformulate the Student-Teacher model in light of our own recent experimental studies of the spectral properties of real-world DNN weight matrices from production quality, pretrained models.
Our Proposal: We let be strongly correlated (NxM) real matrices, with truncated, Heavy Tailed ESDs. Specifically, we assume that we know the Teacher T weight matrices exactly, and seek all Student matrices J that have the same spectral properties as the Teacher.
We can think of the class of Student matrices J as all matrices that are close to T. What we really want is the best method for doing this, that hasd been tested experimentally. Fortunately, Hinton and coworkers have recently revisited Similarity of Neural Network Representations, and found the best matrix similarity method is
Canonical Correlation Analysis (CCA):
Using this, we generalize the Student-Teacher vector-vector overlap., or dot-product, to be the Solid-Angle between the J and T matrices:and plug this directly into our expression for the phase space volume . (and WLOG, we absorb the normalization N into the matrices, and now have
We now take the Laplace Transform of which allows us to integrate over all possible errors that all possible Students might make:
Note: This is different than the general approach to Gibbs Learning at non-zero Temperature (see ENgle and Van den Broeck, chapter 4). The Laplace Transform converts the delta function to a exponential, giving
Conjecture 2: We can write the layer matrix contribution to the total average generalization error as an integral over all possible (random) matrices J that resemble the actual (pre-)trained weight matrices T (as given above).
Notice this expression resembles a classical partition function from statistical field theory: ,. Except instead of integrating over the vector-valued p and q variables, we have to integrate over a class of random matrices J. The new expression for the generalization error is a like weighted average over all possible errors, (where the effective inverse Temperature is set by the scale of the empirical weight matrices |W|). This is the key observation, and requires some modern techniques to perform
These kinds of integrals traditionally appeared in Quantum Field Theory and String Theory, but also in the context of Random Matrix applied to Levy Spin Glasses, And it is this early work on Heavy Tailed Random Matrices that has motivated our empirical work. Here, to complement and extend our studies, we lay out an (incomplete) overview of the Theory.
These integrals are called Harish Chandra–Itzykson–Zuber (HCIZ) integrals. A good introductory reference on both RMT and HCIZ integrals the recent book “A First Course in Random Matrix Theory”, although we will base our analysis here on the results of the 2008 paper by Tanaka,
First, we need to re-arrange a little of the algebra. We will call A the Student correlation matrix:
and let W, X be the original weight and correlation matrices for our pretrained DNN, as above:
,
and then expand the CCA Similarity metric as
We can now express the log HCIZ integral, in using Tanaka’s result, as an expectation value of all random Student correlations matrices A that resemble X.
And this can be expressed as a sum over Generating functions that depends only the statistical properties of the random Student weight matrices A. Specifically
where is the R-Transform from RMT.
The R Transform is like an inverse Green’s function (i.e a Contour Integral), and is also a cumulant generating function. As such, we can write as a series expansion
where are Generalized Cumulants from RMT.
Now, since we expect the best Students matrices resemble the Teacher matrices, we expect the Student correlation matrix A to have similar spectral properties as our actual correlation matrices X. And this where we can use our classification of the 5+1 Phases of Training. Whatever phase X is in, we expect all the A to be in as well, and we therefore expect the R-Transform of A to have the same functional form as X.
That is, if our DNN weight matrix has a Heavy Tailed ESD
then we expect all of the students to likewise have a Heavy Tailed ESD, and with the same exponent (at least for now).
Quenched vs Annealed Averages
Formally, we just say we are averaging over all Students A. More technically, what really want to do is fix some Student matrix (i.e. say A = diagonal X), and then integrate over all possible Orthogonal transformations O of A (see 6.2.3 of Potters and Bouchaud)
Then, we integrate over all possible A~diag(X) , which would account for fluctuations in the eigenvalues. We conceptually assume this is the same as integrating over all possible Students A, and then taking the log.
The LHS is called the Quenched Average, and the RHS is the Annealed. Technically, they are not the same, and in traditional Stat Mech theory, this makes a big difference. In fact, in the original Student-Teacher model, we would also average over all Teachers, chosen uniformly (to satisfy the spherical constraints)
Here, we are doing RMT a little differently, which may not be obvious until the end of the calculation. We do not assume a priori a model for the Student matrices. That is, instead of fixing A=diag(X), we will fit the ESD of X to a continuous (power law) distribution , and then effectively sample over all A as if we had drawn the eigenvalues of A from . (In fact, I suppose we could actually do this numerically instead of doing all this fancy math–but what fun is that?).
The point is, we want to find an expression for the HCIZ integral (i.e the layer / matrix contribution to the Generalization Error) that only depends on observations of W, the weight matrix of the pretrained DNN (our Teacher network). The result only depends on the eigenvalues of X, and the R-transform of A , which is parameterized by statistical information from X.
In principle, I supposed we could measure the generalized cumulants of X,. and assume we can plug these in for . We will do something a little easier.
Let us consider 2 classes of matrices as models for X.
Gaussian (Wigner) Random Matrix: Random-Like Phase
The R-Transform for Gaussian Random matrix is well known:
Taking the integral and plugging this into the Generating function, we get
So when X is Random-Like , the layer / matrix contribution is like the Frobenius Norm (but squared), and thus average Generalization Error is given by a Frobenius Product Norm (squared).
Levy Random Matrix: Very Heavy Tailed Phase–but with
We don’t have results (yet) for the Very Heavy Tailed Phase with , but, as we have argued previously, due to finite size effects, we expect that the Very Heavy Tailed matrices appearing in DNNs will more resemble Levy Random matrices that the Random-Like Phase. So for now, we will close one eye and extend the results for to .
The R-Transform for a Levy Random Matricx has been given by Burda
Taking the integral and plugging this into the Generating function, we get
Towards our Heavy Tailed Capacity Metric
1. Let us pull the power law exponent out of the Trace, effectively ignoring cross terms in the sum over
2. We also assume we can replace the Trace of with its largest eigenvalue , which is actually a good approximation for very heavy tailed Levy matrices, when
This gives an simple expression for the HCIZ integral expression for the layer contribution to the generalization error
Taking the logarithm of both sides, gives our expression
We have now derived the our Heavy Tailed Capacity metric using a matrix generalization of the classic Student Teacher model, with the help of some modern Random Matrix Theory.
QED
I hope this has convince you that there is still a lot of very interesting theory to develop for AI / Deep Neural Networks. And that you will stay tuned for the published form of this work. And remember…
pip install weightwatcher
A big thanks to Michael Mahoney at UC Berkeley for collaborating with me on this work , and to Mirco Milletari’ (Microsoft), who has been extremely helpful. And to my good friend Matt Lee (formerly managing director at BGI/Blackrock) for long discussions about theoretical physics, RMT, quant finance, etc., for encouraging us to publish.
Podcast about this work:
Thanks to Miklos Toth for interviewing me to discuss this (Listen on SoundCloud):
]]>https://twimlai.com/meetups/implicit-self-regularization-in-deep-neural-networks/
]]>My Collaborator did a great job giving a talk on our research at the local San Francisco Bay ACM Meetup
Michael W. Mahoney UC Berkeley
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a “size scale” separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that—all else being equal—DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Joint work with Charles Martin of Calculation Consulting, Inc.
Bio: https://www.stat.berkeley.edu/~mmahoney/
Michael W. Mahoney is at the UCB in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics. He has worked and taught at Yale University in the Math department, Yahoo Research, and Stanford University in the Math department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), He was on the National Research Council’s Committee on the Analysis of Massive Data. He co-organized the Simons Institute’s fall 2013 program on the Theoretical Foundations of Big Data Analysis, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the lead PI for the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. He holds several patents for work done at Yahoo Research and as Lead Data Scientist for Vieu Labs, Inc., a startup re-imagining consumer video for billions of users.
More information is available at https://www.stat.berkeley.edu/~mmahoney/
Long version of the paper (upon which the talk is based): https://arxiv.org/abs/1810.01075http://www.meetup.com/SF-Bay-ACM/http://www.sfbayacm.org/
]]>Why Deep Learning Works: Self Regularization in Neural Networks
Presented Thursday, December 13, 2018
The slides are available on my slideshare.
The supporting tool, WeightWatcher, can be installed using:
pip install weightwatcher
]]>DON’T PEEK: DEEP LEARNING WITHOUT LOOKING … AT TEST DATA
The idea…suppose we want to compare 2 or more deep neural networks (DNNs). Maybe we are
Can we determine which DNN will generalize best–without peeking at the test data?
Theory actually suggests–yes we can!
We just need to measure the average log norm of the layer weight matrices
where is the Frobenius norm
The Frobenius norm is just the sum of the square of the matrix elements. For example, it is easily computed in numpy as
np.linalg.norm(W,ord='fro')
where ‘fro’ is the default norm.
It turns out that is amazingly correlated with the test accuracy of a DNN. How do we know ? We can plot vs the reported test accuracy for the pretrained DNNs, available in PyTorch. First, we look at the VGG models:
The plot shows the 4 VGG and VGG_BN models. Notice we do not need the ImageNet data to compute this; we simply compute the average log Norm and plot with the (reported Top 5) Test Accuracy. For example, the orange dots show results for the pre-trained VGG13 and VGG13_BN ImageNet models. For each pair of models, the larger the Test Accuracy, the smaller . Moreover, the correlation is nearly linear across the entire class of VGG models. We see similar behavior for …
Across 4/5 pretrained ResNet models, with very different sizes, a smaller generally implies a better Test Accuracy.
It is not perfect–ResNet 50 is an outlier–but it works amazingly well across numerous pretrained models, both in pyTorch and elsewhere (such as the OSMR sandbox). See the Appendix for more plots. What is more, notice that
the log Norm metric is completely Unsupervised
Recall that we have not peeked at the test data–or the labels. We simply computed for the pretrained models directly from their weight files, and then compared this to the reported test accuracy.
Imagine being able to fine tune a neural network without needing test data. Many times we barely have enough training data for fine tuning, and there is a huge risk of over-training. Every time you peek at the test data, you risk leaking information into the model, causing it to overtrain. It is my hope this simple but powerful idea will help avoid this and advance the field forward.
A recent paper by Google X and MIT shows that there is A Surprising Linear Relationship [that] Predicts Test Performance in Deep Networks. The idea is to compute a VC-like data dependent complexity metric — — based on the Product Norm of the weight matrices:
Usually we just take as the Frobenius norm (but any p-norm may do)
If we take the log of both sides, we get the sum
So here we just form the average log Frobenius Norm as measure of DNN complexity, as suggested by current ML theory
And it seems to work remarkably well in practice.
We can also understand this through our Theory of Heavy Tailed Implicit Self-Regularization in Deep Neural Networks.
The theory shows that each layer weight matrix of (a well trained) DNNs resembles a random heavy tailed matrix, and we can associate with it a power law exponent
The exponent characterizes how well the layer weight matrix represents the correlations in the training data. Smaller is better.
Smaller exponents correspond to more implicit regularization, and, presumably, better generalization (if the DNN is not overtrained). This suggests that the average power law would make a good overall unsupervised complexity metric for a DNN–and this is exactly what the last blog post showed.
The average power law metric is a weighted average,
where the layer weight factor should depend on the scale of . In other words, ‘larger’ weight matrices (in some sense) should contribute more to the weighted average.
Smaller usually implies better generalization
For heavy trailed matrices, we can work out a relation between the log Norm of and the power law exponent :
where we note that
So the weight factor is simply the log of the maximum eigenvalue associated with
In the paper will show the math; below we present numerical results to convince the reader.
This also explains why Spectral Norm Regularization Improv[e]s the Generalizability of Deep Learning. The smaller gives a smaller power law contribution, and, also, a smaller log Norm. We can now relate these 2 complexity metrics:
We argue here that we can approximate the average Power Law metric by simply computing the average log Norm of the DNN layer weight matrices. And using this, we can actually predict the trends in generalization accuracy — without needing a test data set!
The Power Law metric is consistent with the recent theoretical results, but our approach and the intent is different:
But the biggest difference is that we apply our Unsupervised metric to large, production quality DNNs.
We believe this result will have large applications in hyper-parameter fine tuning DNNs. Because we do not need to peek at the test data, it may prevent information from leaking from the test set into the model, thereby helping to prevent overtraining and making fined tuned DNNs more robust.
We have built a python package for Jupyter Notebooks that does this for you–the weight watcher. It works on Keras and PyTorch. We will release it shortly.
Please stay tuned! And please subscribe if this is useful to you.
We use the OSMR Sandbox to compute the average log Norm for a wide variety of DNN models, using pyTorch, and compare to the reported Top 1 Errors. This notebook reproduces the results.
All the ResNet Models
DenseNet
SqueezeNet
DPN
In the plot below, we generate a number of heavy tailed matrices, and fit their ESD to a power law. Then we compare
The code for this is:
N, M, mu = 100, 100, 2.0 W = np.random.pareto(a=mu,size=(N,M)) normW = np.linalg.norm(W) logNorm2 = 2.0*np.log10(normW) X=np.dot(W.T,W)/N evals = np.linalg.eigvals(X) l_max, l_min = np.max(evals), np.min(evals) fit = powerlaw.Fit(evals) alpha = fit.alpha ratio = logNorm2/np.log10(l_max)
Below are results for a variety of heavy tailed random matrices:
The plot shows the relation between the ratios and the empirical power law exponents . There are three striking features; the linear relation
In our next paper, we will drill into these details and explain further how this relation arises and the implications for Why Deep Learning Works.
My recent talk at the French Tech Hub Startup Accelerator
]]>Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks. Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior. That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:
For a given weight matrix , we form the correlation matrix
and then compute the M eigenvalues of
We call the histogram of eigenvalues the Empirical Spectral Density (ESD). It can nearly always be fit to a power law
We call the Power Law Universal because 80-90% of the exponents lie in range
For fully connected layers, we just take as is. For Conv2D layers with shape we consider all 2D feature maps of shape . For any large, modern, pretrained DNN, this can give a large number of eigenvalues. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.
As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4. Although compared to the FC layers, for the Conv2D layers, we do see more exponents . We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is
Are power law exponents correlated with better generalization accuracies ? … YES they are!
We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including
To compare these model versions, we can simply compute the average power law exponent , averaged across all FC weight matrices and Conv2D feature maps. This is similar to consider the product norm, which has been used to test VC-like bounds for small NNs. In nearly every case, smaller is correlated with better test accuracy (i.e. generalization performance).
The only significant caveats are:
Predicting the test accuracy is a complicated task, and IMHO simple theories , with loose bounds, are unlikely to be useful in practice. Still, I think we are on the right track
Lets first look at the DenseNet models
Here, we see that as Test Accuracy increases, the average power law exponent generally decreases. And this is across 4 different models.
The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 is larger than InceptionV4.
Now consider the Resnet models, which are increasing in size and have more architectural differences between them:
Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a s larger than FbResnet152, but they are very close. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller across a wide range of architectures.
These results are easily reproduced with this notebook.
This is an amazing result !
You can think of the power law exponent as a kind of information metric–the smaller , the more information is in this layer weight matrix.
Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data.
In addition to DenseNet, Inception, ResNext, SqueezeNet, and the (larger) ResNet models, we have even more positive results are available here on ~40 more DNNs across ~10 more different architectures, including MeNet, ShuffleNet, DPN, PreResNet, DenseNet, SE-Resnet, SqueezeNet, and MobileNet, MobileNetV2, and FDMobileNet.
I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice.
]]>