WeightWatcher is a work in progress, based on research into Why Deep Learning Works, and has been featured in venures like ICML, KDD, JMLR and Nature Communications.

Before we start, let me just say that it is very difficult to develop general-purpose metrics that can work for any arbitrary DNN (DNN), and, also, maintains a tool that is also back comparable with all of our work to be 100% reproducible. I hope the tool is useful to you, and we rely upon your feedback (positive and negative ) to improve the tool. so if it works for you, please let me. If not, feel free to post an issue on the Github site. Thanks again for the interest!

Given this, the first metrics we will look at will be the

How is it even possible to predict the generalization accuracy of a DNN without even needing the test data ? Or at least trends ? The basic idea is simple; for each later weight matrix, measure how non-random it is. After all, the more information the layer learns, the less random it should be.

What are some choices for such a layer capacity metric:

- Matrix Entropy:
- Distance from the initial weight matrix: (and variants of this)
- DIvergence from the ESD of the randomized weight matrix:

We define in terms of the eigenvalues of the layer correlation matrix

.

So the matrix entropy i is both a measure of layer randomness and a measure of correlation.

In our JMLR paper, however, we show that while the Matrix Entropy does decrease during training, it is not particularly informative. Still, I mention it here for completeness. (And in the next post, I will discuss the Stable Rank; stay tuned)

The next metric to consider is the Frobenius norm of the difference between the layer weight matrix and it’s specific, initialized, random value . Note, however, that this metric requires that you actually have the initial layer matrices (which we did not for our Nature paper)

The weightwatcher tool supports this metric with the distance method:

import weightwatcher as ww watcher = ww.watcher(model=your_model) <strong>distance_from_init</strong> = watcher.distances(your_model, init_model)

where init_model is your model, with the original, actual, initial weight matrices.

In our most recent paper, we evaluated the **distance_from_init** method as a generalization metric in great detal, however, in order to cut the paper down to submit (which I really hate doing), we had to remove most of this discussion, and only a table in the appendix remained. I may redo this paper, and revert it to the long form, at some point. For now, I will just present some results from that study here, that are unpublished.

These are the raw results for the task1 and task2 sets of models, described in the paper. Breifly we were given about 100 pretrained models, group into 2 tasks (corresponding to 2 different architectures), and then subgrouped again (by the number of layers in each set). Here, we see how the **distance_from_init** metric correlates against the given test accuracies–and it’s pretty good most of the time. But’s not the best metric in general.

The are a few variants of this distance metric, depending on how one defines the distance. These include:

- Frobenius norm distance.
- Cosine distance
- CKA distance

Currently, weightwatcher 0.5.5 only supports (1), but in the next minor release, we plan to include both (2) & (3).

The problem with this simple approach is it is not going to be useful if your models are overfit because the distance from init increases over time anyway–and this is exactly what we think is happening in the **task1** models from this NeurIPS contest. But it is a good sanity check on your models during training, and can be used with other metrics as a diagnostic indicator.

So the question becomes, can we somehow create a distance-from-random metric that compensates for overfitting. And that leads to…

With the new **rand_distance** metric, we mean something very different from the **distance_from_init. ** Above, we used the original instantiation of the , and constructed an *element-wise distance* metric. The **rand_distance** metric, in contrast:

- Does not require the original, initial weight matrics
- Is not an element-wise metric, but, instead, is a
**distributional metric**

So this metric is defined in terms of a distance between the distributions of the eigenvalues (i..e the ESDs) of the layer matrix and its random counterpart .

For weightwatcher, we choose to use the Jensen-Shannon divergence for this:

rand_distance = jensen_shannon_distance(esd, random_esd)

where

def jensen_shannon_distance(p, q): m = (p + q) / 2 divergence = (sp.stats.entropy(p, m) + sp.stats.entropy(q, m)) / 2 distance = np.sqrt(divergence) return distance

Moreover, there are 2 ways to construct the random layer matrix $\mathbf{W}_{rand}&bg=ffffff$ metric:

- Take any (Gaussian) Random Matrix, with the same aspect ratio as the layer weight matrix
- Permute (shuffle) the elements of

While at first glance, these may seem the same, in fact, in practice, they can be quite different. This is because while every (Normal) Random Matrix has the same ESD (up to finite size effects), the Marchenko-Pastur (MP) distribution, if the actual contains any usually large elements , then it’s ESD will behave like a Heavy-Tailed Random Matrix, and look very different from it’s random MP counterpart. For more details, see our JMLR paper.

Indeed, when contains usually large elements, we call these **Correlation Traps. **Now, I have conjectured, in an earlier post, that such Correlation Traps may be a indication of a layer being overtrained. However, some research suggests the opposite, and, that, in-fact, such large elements are needed for large, modern NLP models. The jury is still out on this, however, the weightwatcher tool can be used to resolve this question since it can easily identify such Correlation Traps in every layer. I look forward to seeing the final conclusion.

The take-a-way here is that, when a layer is well trained, we expect the ESD of the layer to be significantly different from the ESD its randomized, shuffled form. Let’s compare 2 cases:

*The rand_distance metric measure the divergence between the original and random ESDs*

In case (a), the **original ESD** (green) looks significantly different from its **randomized ESD** (red). In case (b), however, the **original** the **randomized** ESDs are much more similar. So we expect case (a) to be more well trained than case (b).

*And rand_distance works, presumably, at least in some cases, when the layer is overtrained, as in (b)*.

To compute the **rand_distance** metric, simply specify the randomize=True option, and it will be available as a layer metric in the details dataframe

import weightwatcher as ww watcher = ww.watcher(model=your_model) details = watcher.analyze(randomize=True) avg_rand_distance = details.rand_distance.mean()

Finally, let’s see how well the **rand_distance** metric actually works to predict trends in the test accuracy for a well known set of pretrained models, the VGG series. Similar to the analysis in our Nature paper, we consider how well the **average rand_distance** metric is correlated with the reported top1 errors of the series of VGG models: VGG11, VGG13, VGG16, and VGG19

Actually this is pretty good, and comparable to the results for the weightwatcher powerlaw metric weighted-alpha (). But for reasons I don’t yet understand, however, it does not work so well for the VGG_BN models (VGG with BatchNormalization). Never-the-less, I am hoping it may be useful to you, and I would love to hear if it is working for you or not. To help get started, the above results can be reproduced using weightwatcher 0.5.5 using the WWVGG-TestRandDistance.ipynb Jupyter Notebook in the WeightWatcher GitHub repo.

I’ll end part 1 of this series of blog posts with a comparison between the new **rand_distance** metric and the weighwatcher **alpha** metric, for all the layers VGG19.

(Note, I am not using the ww2x option for this analysis, but if you want to reproduce the Nature results, use ww2x=True. If you don’t , you may get crazy-large alphas that are incorrect–I will show in the next post I will discuss the alpha powerlaw (PL) estimates in more detail.)

When the weightwatcher **alpha** < 5, this means that the layers are Heavy Tailed and therefore well trained, and, as expected, alpha is correlated with the **rand_distance** metric. As expected!

I hope this has been useful to you and that you will try out the weightwather tool

pip install weightwatcher

Give it a try. And please give me feedback if it is useful. And, if interested in getting involved or just learning more, ping me to join our Slack channel. And if you need help with AI, reach out. #t**alkToChuck #theAIguy **

Let’s see how to do this. First, install the tool.

```
pip install weightwatcher
```

Second, pick a model and get a basic description of it

```
import weightwatcher as ww
watcher = ww.WeightWatcher(model=my_model)
details = watcher.analyze()
```

WeightWatcher produces a pandas dataframe, details, with layer metrics describing your model. In particular, the details dataframe contains layer names, ids, types, and warnings.

Here is an example, an analysis of the OpenAI GPT model (discussed in our recent Nature paper)

```
import transformers
from transformers import OpenAIGPTModel,GPT2Model
gpt_model = OpenAIGPTModel.from_pretrained('openai-gpt')
gpt_model.eval();
watcher = ww.WeightWatcher(model=gpt_model)
details = watcher.analyze()
```

The details dataframe now includes a wide range of information, for each layer, including a specific warnings columns

```
details[details.warning!=""][['layer_id','name','warning']]
```

For comparison, below we show the same details deatframe, but for GPT2. GPT2 is the same model as GPT, but trained with more and better data, Being a much better model, GPT2 has far fewer warnings than GPT.

That’s all there is to it. WeightWatcher provides simple layer metrics for pre-trained Deep Neural Networks, indicating simple warnings for which layers are over-trained and which layers are under-trained.

Give it a try. We are looking for early adopters needing better, faster, and cheaper AI monitoring. if it is useful to you , please let me know.

]]>- What is the best metric to evaluate your model ?
- How can you be sure you trained it with enough data ?
- And how can your customers be sure ?

WeightWatcher can help.

`pip install weightwatcher `

WeightWatcher is an open-source, diagnostic tool for evaluating the performance of (pre)-trained and fine-tuned Deep Neural Networks. It is based on state-of-the-art research into *Why Deep Learning Works*. Recently, it has been featured in Nature:

Here, we show you how to use WeightWatcher to determine if your DNN model has been trained with enough data.

In the paper, we consider the example of GPT vs GPT2. GPT is a NLP Transformer model, developed by OpenAI, to generate fake text. When it was first developed, OpenAI released the GPT model, which had specifically been trained with a small data set, making it unusable to generate fake text. Later, they realized fake text is good business, and they released GPT2, which is just like GPT. but trained with enough data to make it useful.

We can apply WeightWatcher to GPT and GPT2 and compare the results; we will see that the WeightWatcher *log* *spectral norm *and *alpha (power law)* metrics can immediately tell us that something is wrong with the GPT model. This is shown in Figure 6 of the paper;

Here we will walk through exactly how to do this yourself for the WeightWatcher Power Law (PL) alpha metric , and explain how to interpret these plots.

It is recommended to run these calculations in a Jupiter notebook, or Google Colab. (For reference, you can also view the actual notebook used to create the plots in the paper, however, this uses an older version of weightwatcher)

For this post, we provide a working notebook in the WeightWatcher github repo.

WeightWatcher understands the basic Huggingface models. Indeed, WeightWatcher supports:

- TF2.0 / Keras
- pyTorch 1.x
- HuggingFace

and (soon)

ONNX (in the current trunk)

Currently, we support Dense and Conv2D layers. Support for more layers is coming. For our NLP Transformer models, we only need support for the Dense layers.

First, we need the GPT and GPT2 pyTorch models. We will use the popular HuggingFace transformers package.

!pip install transformers

Second, we need to import pyTorch and weightwatcher

import torch import weightwatcher as ww

We will also want the pandas and matplotlib libraries to help us interpret the weightwatcher metrics. In Jupyter notebooks, this looks like

import pandas as pd import matplotlib import matplotlib.pyplot as plt %matplotlib inline

We now import the transformers package and the 2 model classes

import transformers from transformers import OpenAIGPTModel,GPT2Model

We have to get the 2 pretrained models, and run *model.eval() *

gpt_model = OpenAIGPTModel.from_pretrained('openai-gpt') gpt_model.eval(); gpt2_model = GPT2Model.from_pretrained('gpt2') gpt2_model.eval();

To analyze our GPT models with WeightWatcher , simply create a watcher instance, and run *watcher.analyze(*). This will return a pandas dataframe with the metrics for each layer

watcher = ww.WeightWatcher(model=gpt_model) gpt_details = watcher.analyze()

The details dataframes reports quality metrics that can be used to analyze the model performance–without needing access to test or training data. The most important metric is our Power Law metric . WeightWatcher reports for every layer. The GPT model has nearly 50 layers, so it is convenient to examine all the layer alphas at once as a histogram (using the pandas API).

gpt_details.alpha.plot.hist(bins=100, color='red', alpha=0.5, density=True, label='gpt') plt.xlabel(r"alpha $(\alpha)$ PL exponent") plt.legend()

This plots the density of the values for all layers in the GPT model.

From this histogram, we can immediately see 2 problems with the model

- The peak . which is higher than optimal for a well trained model.
- There are several outliers with , indicating several poorly trained layers.
- There are no ; when alpha is too small, the layer may be overtrained.

So knowing nothing about GPT, and having never seen the test or training data, WeightWatcher tells us that this model should never go into production.

Now let’s look GPT2, which has the same architecture, but trained with more and better data. Again, we make a *watche*r instance with the model specified, and just run *watcher.analyze()*

watcher = ww.WeightWatcher(model=gpt2_model) gpt2_details = watcher.analyze()

Now let’s compare the Power Law alpha metrics for GPT and GPT2. We just create 2 histograms, 1 for each model, and overlay them.

gpt_details.alpha.plot.hist(bins=100, color='red', alpha=0.5, density=True, label='gpt') gpt2_details.alpha.plot.hist(bins=100, color='green', density=True, label='gpt2') plt.xlabel(r"alpha $(\alpha)$ PL exponent") plt.legend()

The layer alphas for GPT are shown in **red**, and for GPT2 in **green**, and the histograms differ significantly. For the GPT2, the peak , and, more importantly, there are no outlier . Smaller alphas are better, and the GPT2 model is much better than GPT because it is trained with significantly more and better data.

The only caveat here is if ; in these cases, the layer is overtrained or overfit in some way. In GPT and GPT2, we have no alphas that are too small.

WeightWatcher has many features to help you evaluate your models. It can do things like

- Help you decide if you have trained it with enough data (as shown here)
- Detect potential layers that are overtrained (as shown in a previous blog)
- Be used to get early stopping criteria (when you can’t peek at the test data)
- Predict trends in the test accuracies across models and hyperparameters (see our Nature paper, and our most recent submission).

and many other things.

Please give it a try. And if it is useful to you, let me know.

**And if your company needs help with AI, reach out. I provide strategy consulting, mentorship, and hands-on development. #talkToChuck, #theAIguy**.

In the Figure above, fig (a) is well trained, whereas fig (b) may be over-trained. That orange spike on the far right is the tell-tale clue; it’s what we call a* Correlation Trap. *

Weightwatcher can detect the signatures of overtraining in specific layers of a pre/trained Deep Neural Networks. In this post, we show how to use the weightwatcher tool to do this.

**WeightWatcher** (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It analyzes the weight matrices of a pre/trained DNN, layer-by-layer, to help you detect potential problems. Problems that can not be seen by just looking at the test accuracy or the training loss.

**Installation**:

```
pip install weightwatcher
```

**Usage:**

```
import weightwatcher as ww
import torchvision.models as models
model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze(plot=True, randomize=True)
```

For each layer, Weightwatcher plots the Empirical Spectral Density, or ESD. This is just a histogram of the eigenvalues of the layer correlation matrix **X=W ^{T}W**.

```
import numpy as np
import matplotlib,pyplot as plt
...
X = np.dot(W,W.T)
evals, evecs = np.linalg.eig(X(
plt.hist(evals, bin=100, density=True)
...
```

By specifying the randomize option, WW randomizes elements of the weight matrix **W**, and then computes the it’s ESD. This randomized ESD is overlaid on the orginal ESD of **X**, and ploted on a log scale.

This is shown above. The original layer ESD is **green**; the randomized ESD is **red**, And the **orange line **depicts the largest eigenvalue of the randomized ESD.

If the layer is well trained matrix, then when **W** is randomized, it’s ESD will look like that of a normally distributed random matrix. This is shown in Figure (a), above.

But if the layer is over-trained, then it’s weight matrix **W** may have some unusually large elements, where the correlations may concentrated, or become *trapped*. In this case, the ESD may have 1 or more unusually large eigenvalues. This is shown in Figure (b) above, with the **orange line** extending to the far right of the bulk of the **red** ESD.

Notice also that in Figure (a), the **green** ESD is very Heavy Tailed, with the histogram extending out to log10=2, or a largest eigenvalue of nearly 100: . But in Figure (b),, the green ESD has a distinctly different shape and is smaller in scale than in Figure (a). In fact, in (b), the **green** (original) and **red** (randomized) layer ESDs look almost the same, except for a small shelf of larger **green** eigenvalues, extending out to and concentrating around the **orange line**.

**In cases like this, we can identify the orange line as a Correlation Trap. **

This indicates that something went wrong in training this layer, and the model did not capture the correlations in this layer in a way that will generalize well to other examples.

Using the Weight Watcher tool, you can detect this and other potential problems when training or fine-tuning your Deep Neural Networks.

You can learn more about it on the WeightWatcher github website.

]]>The WeightWatcher tool is an open-source python package that can be used to predict the test accuracy of a series similar of Deep Neural Network (DNN) — without peeking at the test data.

WeightWatcher is based on research done in collaboration with UC Berkeley on the foundations of Deep Learning. We built this tool to help you analyze and debug your Deep Neural Networks.

It is easy to install and run; the tool will analyze your model and return both summary statistics and detailed metrics for each layer:

The WeightWatcher github page lists various papers and online presentations given at UC Berkeley and Stanford, and various conferences like ICML, KDD, etc. There, and on this blog, you can find examples of how to use it.

This post describes how to select the metric for your models, and why.

You can use WeightWatcher to model a series of DNNs of either increasing size, or with different hyperparameters. But you need different metrics — vs. — for different cases:

**alpha :**for different hyper-parameters settings (batch size, weight decay, …) on the same model**weighted alpha: :**for an architecture series like VGG11, VGG13, VGG16, VGG19

But why do need 2 different alpha metrics? To understand this, we need to understand

Traditional machine learning theory suggests that the test performance of a Deep Neural Network is correlated with the average log Spectral Norm. That is, the test error should be bounded by the average Spectral Norm, so the smaller norm, the smaller the test error.

The Spectral Norm of an matrix is just the (square root of the ) maximum eigenvalue of its correlation matrix

We denote the (squared) Spectal Norm as:

Note: in earlier papers we (and others) also use:

WeightWatcher computes the log Spectral Norm for each layer, and defines:

which we compute by averaging over all layer weight matrices.

We compute the eigenvalues, or the Empirical Spectral Density (ESD), of each layer by running SVD directly on the layer or, for Conv2D layers, some matrix slice (see the Appendix).

It has been suggested by Yoshida and Miyato that the Spectral Norm would make a good regularizer for DNNs. The basic idea is that the test data should look enough like the training data so that if we can say something about how the DNN performs on *perturbed* training data, that will also say something about the test performance.

Here is the text from the paper; let. me explain how to interpret this in practical terms.

We imagine the test data must look like the training data with some small perturbation . Let us write this as:

As we train a DNN, we run several epochs of BackProp, which amounts to multiplying by a weight matrix at each layer, apply an activation function, and repeat until we get a label

To get an estimate, or bound, on the test accuracy, we can then imagine applying the matrix multiply to the perturbed training point

So if we can say something about how the DNN should perform on a perturbed training point , we can say something about the test output .

*What can we say ? *

When we apply an activation function , like a RELU, which acts pointwise on the data vector, and acts like an affine transformation. So the RELU+weight matrix multiply is a bounded linear operator (at least piecewise) and therefore it can be bounded by it’s Spectral Radius .

So we say that we want to learn a DNN model such that, at each layer, the action of the layer matrix-vector multiply is bounded by the Spectral Norm of its layer weight matrix. This should, in theory, give good test performance. By applying a Spectral Norm regularizer, we think we can make a DNN that is more robust to small changes in it’s input. That is, we can make it perform better on random perturbations the of training data , and, therefore, presumably, better on the test data.

**From Bounds to a Regularizer**

When we develop a mathematical bound , our first instinct is to develop a numerical regularizer . That is, when solving our optimization problem–minimizing the DNN Energy function –we want to prevent the solution from blowing up. Having a *mathematically rigorous* bound helps here since it seems to bound the BackProp optimization step on every iteration:

Notice that since the regularizer appears in the optimization problem, it must be differentiable (either directly, or using some trick).

A regularizer must also be easy to implement. For example, we could also bound the Jacobian , but this very expensive to compute, and it is difficult to apply even a norm bound of this , on every step of BackProp. It might also seem that the Spectral norm is hard to compute because one needs to run SVD, but there is a simple trick. One can approximate the maximum eigenvalue of using the Power Method, by running it for say steps, and then simply add this to the SGD update step,. There are many examples on github of this, and it has been applied, in particular, to GANs and has been shown to work very well in large scale studies.

Also, since this expression is linear in , we can also readily plug this into TensorFlow /Keras or PyTorch and use autograd to compute the derivatives. It is available as a Tensforflow Addon, and is part of the core pyTorch 1.7 package.

Spectral Norm Regularization has not been widely used (outside say GANs) because it only works well for very deep networks. See, however, this adaption for smaller DNNs called Mean Spectral Normalization.

What do we want from a theory of learning ? With WeightWatcher, we have never sought. a rigorous bound. That’s not the goal of our theory. We do not seek a bound because this decribes the* worst-case behavior;* we seek to understand *the average-case behavior* (However, what we can do repair the Spectral Norm as a metric , as shown below.)

With the average-case, we hope to able to predict the generalization error of a DNN (and without peeking at the test data). And we mean this a very practical sense, applying to very large, production quality models, both in training and fine-tuning them.

So what’s the difference between having a bound and analyzing average-case behavior ?

- With a mathematical bound, we can bound the error–for a single model. So this is alike a prediction, and , as shown above, we can use this to develop better regularizers.

- WIth the average-case behavior, we want to predict trends across many different models. Of both different depths (since deeper models usually perform better) and with different hyperparameter settings (batch size, momentum, weight decay, etc.)

It’s not at all obvious that we can expect a mathematical bound to be correlated with trends in the test accuracy of real-world DNNs. It turns out, the Spectral Norm works pretty well–at least across pretrained DNNs of increasing depth.

Here is an example, showing how the Spectral Norm performs on the VGG series

We see that the average log Spectral norm correlates quite well with the test accuracy of the DNN architecture series of the pretrained VGG models. This is remarkable, since we do not have access to the training or the test data (or other information).

We have used WeightWatcher to analyze hundreds of pretrained models, of increasing depths, and using different data sets. Generally speaking, the average log Spectral Norm correlates well with the test accuracies of many different DNN series and for different data sets.

But not always. And that’s the rub.

Oddly, while the average log Spectral Norm is correlated with test error, when changing the depth of a DNN model, it turns out to be *anti-correlated *with test error when varying the optimization hyper-parameters. This is a classic example of Simpson’s paradox.

We have noted this, and it has also pointed by Bengio and co-workers. Indeed, an entire contest was recently set up to study this issue–the 2020 NeurIPS Predicting Generalization challenge.

Below we can see the paradox by looking at predictions for ~100 small, pre-trained VGG-like models, (provided by contest). We use WeightWatcher (version ww0.4) to compute the average `log_spectral_norm`

, and compare to. the reported test accuracies for the contest

set of baseline VVG-like models:**task2_v1**

For more details, please see the contest website details, and/or our contest post-mortem Jupyter Book and paper on the contest (coming soon).

Notice that the `2xx`

models have the best test accuracies, and, correspondingly, as a group, the smallest `avg logspectralnorm`

. Smaller error correlates with the smaller norm metric. Likewise, the `6xx`

models models have. the smallest Test Accuracies as. a group, and also, the largest `avg logspectralnorm`

.

*This is a classic example of Simpson’s Paradox.*

However, also note that, for each model group (`2xx, 10xx, 6xx, & 9xx`

), we can draw, roughly, a straight line that shows most of the test accuracies in that group are anti-correlated with the `avg. log_spectral_norm`

. Now the regression is not always great, and there are outliers, but we think the general trends hold well enough for this level of discussion (and we will drill into the details in our next paper).

This is a classic example of Simpon’s Paradox.

Here, we see a large trend, across the similar models, trained on the same dataset, but with different, depths.

When looking closely at each model group, however, we see the reverse trend.

This makes the Spectral Norm difficult to use as a general purpose metric for predicting test accuracies.

**WeightWatcher to the Rescue**

Using WeightWatcher, however, we can repair the average log Spectral Norm metric by computing it as a weighted average–weighted by the WeightWatcher metric.

Here is a similar plot on the same** task2_v1 **data,, but this time reporting the WeightWatcher average power law metric . Notice that is well correlated within each model group, as expected when changing model hyperparameters. Moreover, the is not correlated with the Test Accuracy more broadly across different depths, nor is it correlated with the average log Spectral Norm (not shown).

WeightWatcher alpha tells us how correlated a single DNN model is. And we can use to correct the average `log_spectral_norm`

by simply taking a weighted average (called `alpha_weighted`

):

If we look closely, we can see more in more detail how the weighted alpha corrects the average `log_spectral_norm `

metric in the VGG architecture series. Below we use WeightWatcher to plot the different metrics vs. the Top 1 Test Accuracy for the many different pre-trained VGG models.

Consider the plot on the far right, and, specifically, the **pink (BN-VGG-16)** and **red dots (VGG-19)**, near test accuracy ~ 26. These are 2 models with both different depths (16 vs 19 layers) and different hyperparameter settings (BatchNorm or not). The two models have nearly the same accuracy, but a large variance between their average `log_spectral_norm`

. Now consider the far left plot for average alpha , which shows that and **red** has a smaller alpha than the **pink** . The **VGG-19** model is more strongly correlated than **BN_VGG-16**. The average alpha for the **red dot is ~3.5**, whereas the **pink dot ~ 3.85**. When we combine these 2 metrics, on the middle plot (alpha_weighted), the 2 models now appear much closer together. So `alpha_weighted`

metric corrects the average `log_spectral_norm`

, reducing the variance between similar models, and making more suitable to treating models of different depth and different hyperparameter settings.

The open source WeightWatcher tool provides metrics for Deep Neural Networks that allow the user to predict (trends in) the test accuracies of Deep Neural Networks without needing the test data. The different (power law) metrics, and apply to models with different hyperparameter settings, and different depths, resp. Here, we explain why.

The average alpha metric describes the amount of correlation contained in the DNN weight matrices. Smaller correlates with better test accuracy for a single model with different hyperparemeter settings. It is a unique metric, developed from the theory on strongly correlated systems from theoretical chemistry and physics

The average weighted alpha metric is suited for treating a series of models with different depths, like the VGG series: VGG11, VGG13, VGG16, VGG19. It is a weighted average of the log Spectral Norm.

To explain why works, we have reviewed the theory and application of Spectral Norm Regularization, and the use of the average log Spectral Norm as a metric for predicting DNN test accuracies.

While theory suggests that might be able to predict the generalization performance of different pre-trained DNNs, in practice, it is correlated with test error for models with different depths, and anti-correlated for models trained with different hyperparameters. This is a classic example of Simpson’s Paradox.

We show that we can fix-up the average `log_spectral_norm`

(as provided in WeightWatcher) by using a weighted average, weighted by the WeightWatcher power-law layer alpha metric. And this is exactly the WeightWatcher metric `alpha_weighted`

:

Try it yourself on your own DNN models.

`pip install weightwatcher `

And let me know how it goes.

**Spectral Density of 2D Convolutional Layers**

We can test this theory numerically using WeightWatcher. Notice, however, that while it is obvious how to define for Dense matrix, there is some ambiguity doing this for a 2DConvolution and to make this work as a useful metric.

WeightWatcher has 2 methods for computing the SVD of a Conv2D layer, depending on the version. For a Conv2D layer, the options are

**new version ww.0.4:**extract matrix slices of dimension , and combine these to define the final matrix, and run SVD on this. This gives 1 ESD per Conv2D layer.**old version ww2x**: extract matricies of dimension each, and run SVD on each . This gives ESDs per Conv2D layer, and the layer metrics are then averaged over all of these slices.**future version (maybe)**: Run SVD on the linear operator that defines the Conv2D transform. This requires running SVD on the discrete FFT on each of the Conv2D input-output channels, and then combining the resulting eigenvalues into 1 very large ESD. This is quite slow and the numerical results were not as good in early experiments.

All three methods give slightly different layer Spectral Norms, with ww0.4 being the best estimator so far. The` ww2x=True `

option is included for back compatibility with earlier papers.

The weightwatcher tool uses power law fits to model the eigenvalue density of weight matrices of any Deep Neural Network (DNN).

The average power-law exponent is remarkably well correlated with the test accuracy when changing the number of layers and/or fine-tuning the hyperparameters. In our latest paper, we demonstrate this using a metaanalysis on hundreds of pre-trained models. This begs the question:

**Why can we model the weight matrices of DNNs using power law fits ?**

In theoretical chemistry and physics, we know that strongly correlated, complex systems frequently display power laws.

In many machine learning and deep learning models, the correlations also display heavy / power law tails. After all, the whole point of learning is to learn the correlations in the data. Be it a simple clustering algorithm, or a very fancy Deep Neural Network, we want to find the most strongly correlated parts to describe our data and make predictions.

For example, in strongly correlated systems, if you place an electron in a random potential, it will show a transition from delocalized to localized states, and the spectral density will display power law tails. This is called Anderson Localization, and Anderson won the Nobel Prize in Physics for this in 1977. In the early 90s, Cizeau and Bouchaud argued that a Wigner-Levy matrix will show a similar localization transition, and since then have modeled the correlation matrices in finance using their variant of heavy tailed random matrix theory (RMT). Even today this is still an area of active research in mathematics and in finance.

In my earlier days as a scientist, I worked on strongly correlated multi-reference *ab initio* methods for quantum chemistry. Here, the trick is to find the right correlated subspace to get a good low order description. I believe, in machine learning, the same issues arise. For this reason, I also model the correlation matrices in Deep Neural Networks using heavy tailed RMT.

Here I will show that the simplest machine learning model, Latent Semantic Analysis (LSA), shows a localization transition, and that this can be used to identify and characterize the heavy tail of the LSA correlation matrix.

Take Latent Semantic Analysis (LSA). How do we select the Latent Space? We need to select the top-K components of the TF-IDF Term-Document matrix . I believe this can be done by selecting those K eigenvalues that best fit a power law. Here is an example, using the scikit-learn 20newsgroups data:

We call ths plot the Empirical Spectral Density (ESD). This is just a histogram plot, on a log scale, of the eigenvalues of the TF-IDF correlation matrix . The correlation matrix is the square of the TF-IDF matrix

and the eigenvalues of are the singular values of , squared: .

We fit the eigenvalues to a power law (PL) using the python powerlaw package, which implements a standard MLE estimator.

fit = powerlaw.fit(eigenvalues)

The fit selects the optimal `xmin=`

using a brute force search, and returns the best PL exponent , and the quality of the fit (the KS-distance). The orange line displays the start of the power law tail, which contains the most strongly correlated eigenpairs.

We can evaluate the quality of the PL fit by comparing the ESD and the actual fit on log-log plot.

The **blue solid line **is the ESD on a log-log scale (or the PDF), and the blue dotted line is the PL fit. (The **red solid line** is the empirical CDF, and the red dotted line the fit). The PDF (blue) shows a very good linear fit up, except perhaps for largest eigenvalues, . Likewise, the CDF (red) shows a very good fit, up until the end of the tail. This is typical of power-law fits on real-world data and is usually best described as a Truncated Power Law (TPL), with some noise in the very far tail *(more on the noise in a future post). And the reported KS-distance , which is exceptional.

We can get even more insight into the quality of the fit by examining how the PL method selected `xmin`

, the start of the PL. Below, we plot the KS-distance for each possible choice of `xmin`

:

The optimization landscape is convex, with a clear global minimum at the orange line, which occurs at the . That is, there are 547 eigenpairs in the tail of the ESD displaying strong power-law behavior.

*To form the Latent space, we select these largest 547 eigenpairs, to the right of the orange line, the start of the (truncated) power-law fit*

To identify the localization transition in LSA, we can plot localization ratios in the same way, where the localization is defined as in our first paper:

def localization_ratio(v): return np.linalg.norm(v, ord=1) / np.linalg.norm(v, ord=np.inf)

We see that get an elbow curve, and the eigenvalue cutoff appears just to the right of the ‘elbow’:

Other methods include looking scree plots or even just the sorted eigenvalues themselves.

Typically, in unsupervised learning, one selects the top-K clusters, eigenpairs, etc. by looking at some so-called ‘elbow curve’, and identifying the K at the inflection point. We can make these plots too. A classic way is to plot the explained variance per eigenpair:

We see that the power-law , the orange line, occurs just to the right of the inflection point. So these two methods give similar results. No other method provides a theoretically well-defined way, however, of selecting the K components.

I suspect that in these strongly correlated systems, the power law behavior really kicks it right at / before these inflection points. So we can find the optimal low-rank approximation to these strongly correlated weight matrices by finding that subspace where the correlations follow a power-law / truncated power law distribution. Moreover, we can detect, and characterize these correlations, by both the power-law exponent , and the quality of the fit D.

*And AFAIK, this has never been suggested before. *

We introduce the weightwatcher (ww) , a python tool for a python tool for computing quality metrics of trained, and pretrained, Deep Neural Netwworks.

pip install weightwatcher

This blog describes how to use the tool in practice; see our most recent paper for even more details.

Here is an example with pretrained VGG11 from pytorch (ww works with keras models also):

import weightwatcher as ww import torchvision.models as models model = models.vgg11(pretrained=True) watcher = ww.WeightWatcher(model=model) results = watcher.analyze() summary = watcher.get_summary() details = watcher.get_details()

WeightWatcher generates a dict that summarizes the empirical quality metrics for the model (with the most useful metrics)

summary:= { ... alpha: 2.572493 alpha_weighted: 3.418571 lognorm: 1.252417 logspectralnorm: 1.377540 logpnorm: 3.878202 ... }

The tool also generates a **details** pandas dataframe, with a layer-by-layer analysis (shown below)

The summary contains the Power Law exponent (), as well as several log norm metrics, as explained in our papers, and below. Each value represents an empirical quality metric that can be used to gauge the gross effectiveness of the model, as compared to similar models.

(The main weightwatcher notebook demonstrates more features )

For example, **lognorm** is the average over all layers L of the log of the Frobenius norm of each layer weight matrix :

**lognorm**: average log Frobenius Norm :=

Where the individual layer Frobenius norm, for say a Fully Connected (FC layer, may be computed as

`np.log10(np.linalg.norm(W))`

We can use these metrics to compare models across a common architecture series, such as the VGG series, the ResNet series, etc. These can be applied to trained models, pretrained models, and/or even fine-tuned models.

Consider the series of models VGG11, VGG11_BN, … VGG19, VGG_19, available in pytorch. We can plot the reported the various log norm metrics vs the reported test accuracies.

For a series of similar, well-trained models, all of the empirical log norm metrics correlate well with the reported test accuracies! Moreover, the Weighted Alpha and Log Norm metrics work best.

* Smaller is better*.

We also run a ordinary linear regression (OLS) and the root mean squared error (RMSE) , and for several other CV models that are available in the pytorch torchvision.models package.

We have tested this on over 100 well trained computer vision (CV) pre-trained models on multiple data sets (such as the ImageNet-1K subset of ImageNet). These trends how for nearly every case of well-trained models.

Notice that the RMSE for ResNet, trained on ImageNet1K, is larger than for ResNet trained on the full ImageNet, even though ResNet-1K has more models in the regression. that (19 vs 5). For the exact same model, the larger and better data set shows a better OLS fit!

We have several ideas where we hope this would be useful. These include:

- comparing different models trained using Auto-ML (in addition to standard cross-validation)
- judging the quality of NLP models for generating fake text (in addition to, say, the perplexity)
- evaluating different unsupervised clustering models, to determine which (presumably) gives the best clusters
- deciding if you have enough data, or need to add more, for your specific model or series of models.

We can learn even more about a model by looking at the empirical metrics, layer by layer. The **results** is a dataframe that contains empirical quality metrics for each layer of the model. An example output, for VGG11, is:

The columns contain both metadata for each layer (id, type, shape, etc), and the values of the empirical quality metrics for that layer matrix.

These metrics depend on the spectral properties–singular values of , or, equivalently, the eigenvalues of the correlation matrix of .

WeightWatcher is unique in that it can measure the amount of correlation, or information, that a model contains–without peeking at the training or test data. Data Correlation is measured by the Power Law (PL) exponents .

WeightWatcher computes the eigenvalues (by SVD) for each layer weight matrix , and fits the eigenvalue density (i.e. histogram to a truncated Power Law (PL), with PL exponent

In nearly every pretrained model we have examined, the Empirical Spectral Density can be fit to a truncated PL. And the PL exponent usually is in the range , where smaller is better.

Here is an example of the output of `weightwatcher`

of the second Fully Connected layer (FC2) in VGG11. These results can be reproduced using the WeightWatcher-VGG.ipynb notebook in the ww-trends-2020 github repo., using the options:

```
results = watcher.analyze(alphas=True, plot=True)
```

The plot below shows the ESD (Empirical Spectral Density ,), of the weight matrix , in layer FC2. Again, this is a (normalized) histogram of the eigenvalues of the correlation matrix .

The FC2 matrix is square, 512×512, and has an aspect ratio of Q=N/M=1 . The maximum eigenvalue is about 45, which is typical for many heavy tailed ESDs. And there is a large peak at 0, which is normal for Q=1. Because Q= 1, the ESD might look heavy tailed, but this can be deceiving because a random matrix with Q=1 would look similar. Still, as with nearly all *well-trained* DNNS, we expect the FC2 ESD to be well fit by a Power Law model, with an exponent (i.e. in the Fat Tailed Universality class), or at least, for a model that is not ‘flawed’ in some way, .

**alpha** . PL exponent for for **W**:

The smaller **alpha** is , for each layer, the more correlation that layer describes. Indeed, in the best performing models, all of the layer **alphas** approach 2 .

To check that the ESD is really heavy tailed, we need to check the Power Law (PL) fit. This is done by inspecting the weightwatcher plots.

The plot on the right shows the output of the powerlaw package, which is used to do the PL fit of the ESD. The PL exponent , which is a typical value for (moderately, i.e. Fat) Heavy Tailed ESDs. Also, the KS distance is small , which is good. We can also see this visually. The dots are the actual data, and the lines are the fits. If the lines are reasonably straight, and match the dots, and in the range the fit is good. *And they are.* This is a good PL fit.

As shown above, with ResNet vs ResNet-1K, the `weightwatcher`

tool can help you decide if you have enough data, or your model/architecture would benefit from more data. Indeed, poorly trained models, with very bad data sets, show strange behavior that you can detect using `weightwatcher`

Here is an example of the infamous **OpenAI GPT** model, originally released as a poorly-trained model –so it would not be misused. It was too dangerous to release We can compare this deficient GPT with the new and improved **GPT2-small** model, which has basically the same architecture, but has been trained as well as possible. Yes, they gave in an released it! (Both are in the popular `huggingface`

package, and `weightwatcher`

can read and analyze these models) Below, we plot a histogram of the PL exponents , as well as histogram of the log Spectral Norms for each layer in GPT (blue) and GPT2 (red)

These results can be reproduced using the WeightWatcher-OpenAI-GPT.ipynb notebook in the ww-trends-2020 github repo.

Notice that the poorly-trained GPT model has many unusually high values of **alpha** . Many are be above 6, and even range upto 10 or 12 ! This is typical of poorly trained and/or overparameterized models.

Notice that the new and improved GPT2-small does not have the unusually high PL exponents any more, and, also, the peak of the histogram distribtion is farther to the left (smaller).** **

**Smaller alpha is always better.**

If you have a poorly trained model. and you fix your model by adding more and better data, the **alphas** will generally settle down to below 6. Note: this can not be seen in a total average because the large values will throw the average off–to see this, make a histogram plot of **alpha**

What about the log Spectral Norm ? It seems to show inconsistent behavior. Above, we saw that smaller is better. But now it looks as if smaller is worse ? What is going on with this…and the other empirical Norm metrics ?

Now let’s take a deeper look at how to use the empirical log Norm metrics:

Unlike the PL exponent **alpha**, the empirical Norm metrics depend strongly on the *scale of the weight matrix* **W**. As such, they are highly sensitive to problems like *Scale Collapse*–and examining these metrics can tell us when something is potentially very wrong with our models.

First, what are we looking at ? These empirical (log) Norm metrics reported are defined using the raw eigenvalues. We can compute the eigenvalues of **X** pretty easily (although actually in the code we compute the singular values of **W** using the sklearn TruncatedSVD method.)

```
M = np.min(W.shape)
svd = TruncatedSVD(n_components=M-1)
svd.fit(W)
sv = svd.singular_values_
eigen_values = sv*sv
```

Recall that the Frobenius norm (squared) for matrix **W** is also the sum of the eigenvalues of **X**. The Spectral Norm (squared) is just the maximum eigenvalue of **X**. The weighted alpha and the log (or Shatten) Norm are computed after fitting the PL exponent for the layer. In math, these are:

**lognorm**:**logspectralnorm**:**alpha_weighted:****logpnorm:**

The `weightwatcher`

code computes the necessary eigenvalues, does the PowerLaw (PL) fits, and reports these, and other, empirical quality metrics, for you, both for the average (summary) and layer-by-layer (details) of each. The details dataframe has many more metrics as well, but, for now we will focus on these four.

Now, what can we do with them? We are going to look at 3 ways to identify potential problems in a DNN, which can not be seen by just looking at the test accuracy

**Correlation Flow**: comparing different architectures**Alpha Spikes**: Identifying overparameterized models**Scale Collapse**: potential problems when distilling models

Using the `weighwatcher details dataframe`

, we can plot the PL exponent **alpha** vs. the layer Id to get what is called a* Correlation Flow* plot:

Let us do this, by comparing 3 common (pretrained) computer vision models: VGG, ResNet, and DenseNet.

These results can be reproduced using the following notebooks:

Recall that good models have average PL exponents , in the Fat Tailed Universality class. Likewise, we find that, if we plot **alpha** vs layer_id, then good models also have stable **alphas**, in this range.

The VGG11 and 19 models have good **alphas**, all within the Fat Tailed Universality class, or smaller. And both the smaller and larger models show similar behavior. Also, noitce that the last 3 and FC layers in the VGG models all have final smaller **alphas**, . So while the **alphas** are increasing as we move down the model, the final FC layers seem to capture and concentrate the information, leading to more correlated layer weight matrices at the end.

ResNet152 is an even better example of good Correlation Flow. It has a large number of **alphas** near 2, contiguously, for over 200 layers. Indeed, ResNet models have been trained with over 1000 layers; clearly the ResNet architecture supports a good flow of information.

Good *Correlation Flow * shows that the DNN architecture is learning the correlations in the data at every layer, and implies (*informally) that information is flowing smoothly through the network.

**Good DNNs show good Correlation Flow**

We also find that models in an architecture series (VGG, ResNet, DenseNet, etc) all have similar Correlation Flow patterns, when adjusting for the model depth.

Bad models, however, have **alphas** that increase with layer_id, or behave erratically. This means that the information is not flowing well through the network, and the final layers are not fully correlated. For example, the older VGG models have **alphas** in a good range, * but*, as we go down the network, the

You might think adding a lot of residual connections would improve Correlation Flow–but too many connections is also bad. The DenseNet series is an example of an architecture with too many residual connections. Here, both with the pretrained DenseNet126 and 161 we see the many , and, looking down the network layers, the are scattered all over. The Correlation Flow is poor and even chaotic, and, we conjecture, less than optimal.

Curiously, the ResNet models show good flow internally, as shown when we zoom-in, in (d) above. But the last few layers have unusually large **alphas**; we will discuss this phenomena now.

**Advice**: If you are training or finetuning a DNN model for production use, use `weightwatcher`

to plot the Correlation Flow. If you see alphas increasing with depth, behaving chaotically, or there are just a lot of alphas >> 6, revisit your architecture and training procedures.

*When is a DNN is over-parameterized, once trained on some data ?*

Easy…just look at **alphas**. We have found that well-trained, or perhaps fully-trained, models, should have . And the best CV models have most of their **alphas** just above 2.0. However, some models, such as NLP OpenAI GPT2 and BERT models, have a wider . And many models have several unusually large **alphas**, with latex \alpha\gg 6$. What is going on ? And how is it useful ?

The current batch of NLP Transformer models are great examples. We suspect that many models, like BERT and GPT-xl, are over-parameterized, and that to fully use them in production, they need to be fine-tuned. Indeed, that is the whole point of these models; NLP transfer learning.

Let’s take a look the current crop of pretrained OpenAI GPT-2 models, provided by the `huggingface`

package. We call these “good-better-best” series.

These results can be reproduced using the WeightWatcher-OpenAI-GPT2.ipynb notebook.

For both the PL exponent (a) and our Log Alpha Norm (b) , *Smaller is Better. *The latest and greatest OpenAI GPT2-xl model (in red) has both smaller **alphas** and smaller empirical log norm metrics, compared to the earlier GP2-large (orange) and GPT2-medium (green) models.

But the GPT2-xl model also has more outlier **alphas**:

We have seen similar behavior in other NLP models, such as comparing OpenAI GPT to GPT2-small, and the original BERT, as compared to the Distilled Bert (as discussed in my recent Stanford Lecture). We suspect that when these large NLP Trasnformer models are fine-tuned or distilled, the **alphas** will get smaller, and performance will improve.

**Advice**: So when you fine-tune your models, monitor the **alphas** with `weightwatcher`

. If they do not decrease enough, add more data, and/or try to improve the training protocols.

But you also have to be careful not to break your model, as have found that some distillation methods may do this.

Frequently one may finetune a model, for transfer learning, distillation, or just to add more data.

*How can we know if we broke the model ?*

We have found that poorly trained models frequently exhibit Scale Collapse, in which 1 or more layers have unusually small Spectral and/or Frobenius Norms.

This can be seen in your models by running plotting a histogram of the **logspectralnorm** column from the `details dataframe`

Recall earlier we noted the poorly-trained in the OpenAI GPT model. This is typical of many porly-trained models. Because of this, log norm metrics can not be reliable used to predict trends in accuracies on poorly-trained models.

However, we can use the empirical log Norm metrics to detect problems that can not be seen by simply looking at the training and test accuracies.

We have also observed this in some distilled models. Below we look at the ResNet20 model, before and after distillation using the Group Regularization method (as described in the Intel distiller package and provided in the model zoo). We plot the Spectral Norm (maximum eigenvalue) and PL exponent alpha vs. the layer_id (depth) for both the baseline (green) and finetuned /distiller (red) ResNet20 models.

These results can be reproduced by installing the distiller package, downloading the model zoo pretrained models, and running the WeightWatcher-Intel-Distiller-ResNet20.ipynb notebook in the distiller folder. (We do note that these are older results, and we used older versions of both `distiller`

and `weighwatcher`

, which used a different normalization on the Conv2D layers. Current results may differ although we expect to see similar trends.)

Notice that the baseline and finetuned ResNet20 have similar PL exponents (b) for all layers, but for several layers in (a), the Spectral Norm (maximum eigenvalue) collapses in value. That is, the *Scale Collapse*s. This is bad, and characteristic of a poorly trained model like the original GPT.

**Advice**: if you finetune a model, use `weighwatcher`

to monitor the **log Spectral Norms**. If you see unusually small values, something is wrong.

Our latest paper is now on archive.

Please check out the github webpage for WeightWatcher and the associated papers and online talks at Stanford, UC Berkeley, and the wonderful podcasts that have invited us on to speak about the work.

If you want to get more involved, reach out to me directly at *charles@calculationconsulting.com*

And remember–if you need help at your company with AI, Deep learning, and Machine Learning, please reach out. Calculation Consulting

]]>For the past year or two, we have talked a lot about how we can understand the properties of Deep Neural Networks by examining the spectral properties of the layer weight matrices . Specifically, we can form the correlation matrix

,

and compute the eigenvalues

.

By plotting the histogram of the eigenvalues (i.e the spectral density ), we can monitor the training process and gain insight into the implicit regularization and convergence properties of DNN. Indeed, we have identified

Each of these phases roughly corresponds to a Universality class from Random matrix Theory (RMT). And as we shall see below, we can use RMT to develop a new theory of learning.

First, however, we note that for nearly every pretrained DNNs we have examined (over 450 in all) , the phase appears to be in somewhere between Bulk-Decay and/or Heavy-Tailed .

Moreover, for nearly all DNNs, the spectral density can be fit to a truncated power law, with exponents frequently lying in the Fat Tailed range [2-4], and the maximum eigenvalue no larger than say 100

,

Most importantly, in 80-90% of the DNN architectures studied, on average, smaller exponents correspond to smaller test errors.

Our empirical results suggest that the power law exponent can be used as (part of) a practical capacity metric. This led us to propose the metric for DNNs:

where we compute the exponent and maximum eigenvalue for each layer weight matrix (and Conv2D feature maps), and then form the total DNN capacity as a simple weighted average of the exponents. Amazingly, this metric correlates very well with the reported test accuracy of pretrained DNNs (such as the VGG models, the ResNet models, etc)

We have even built a open source, python command line tool–weightwatcher–so that other researchers can both reproduce and leverage our results

`pip install weightwatcher`

And we have a Slack Channel for those who want to ask questions, dig deeper, and/or contribute to the work. Email me, or ping me on LinkedIn, to join our vibrant group.

All of this leads to a very basic question:

To answer this, we will go back to the foundations of the theory of learning, from the physics perspective, and rebuild the theory using in both our experimental observations, some older results from Theoretical Physics, and (fairly) recent results in Random Matrix Theory.

Here, I am going to sketch out the ideas we are currently researching to develop a new theory of generalization for Deep Neural Networks. We have a lot of work to do, but I think we have made enough progress to present these ideas, informally, to flush out the basics.

**What do we seek ? ** A practical theory that can be used to predict the generalization accuracy of a DNN solely by looking at the trained weight matrices, without looking at the test data.

**Why ? ** Do you test a bridge by driving cars over it until it collapses ? Of course not! So why do we build DNNs and only rely on brute force testing ? Surely we can do better.

**What is the approach ?** We start with the classic Perceptron Student-Teacher model from Statistical Mechanics of the 1990s. The setup is similar, but the motivations are a bit different. We have discussed this model earlier here Remembering Generalization in DNNs. from our paper Understanding Deep Learning Requires Rethinking Generalization.

Here, let us review the mathematical setup in some detail:

We start with the simple model presented in chapter 2, Engel and Van der Brock, interpreted in a modern context.

Here, we want to do something a little different, and use the formalism of Statistical Mechanics to both compute the average generalization error, and to interpret the global convergence properties of DNNs in light of this , giving us more insight into and to provide a new theory of Why Deep Learning Works (as proposed in 2015).

Suppose we have some trained or pretrained DNN (i.e. like VGG19). We want to compute the average / typical error that our Teacher DNN could make, just by examining the layer weight matrices. *Without peeking at the data.*

**Conjecture 1**: * We assume all layers are statistically independent, so that the average generalization capacity * *(i.e. 1.0-error) is just the product of the contributions of from each layer weight matrix * .

**Example:** The Product Norm is a Capacity measure for DNNs from traditional ML theory.

The Norm may be Frobenius Norm, the Spectral Norm, or even their ratio, the Stable Rank.

This independence assumption is probably not a great approximation but it gets us closer to a realistic theory. Indeed, even traditional ML theory recognizes this, and may use Path Norm to correct for this. For now, this will suffice.

**Caveat 1: **If we take the logarithm of each side, we can write the log Capacity as the sum of the layer contributions. More generally, we will express the log Capacity as a weighted average of some (as yet unspecified) log norm of the weight matrix.

We now set up the classic Student-Teacher model for a Perceptron–with a slight twist. That is, from now on, we assume our models have 1 layer, like a Perceptron.

Let’s call our trained or pretrained DNN the Teacher **T**. The Teacher maps data to labels. Of course, there could be many Teachers which map the same data to the same labels. For **our** specific purposes here, we just fix the Teacher **T**. We imagine that the learning process is for us to learn all possible Student Perceptrons **J** that also map the data to the labels, in the same way as the Teacher.

But for a pretrained model, we have no data, and we have no labels. And that’s ok. Following Engle and Van der Brock (and also Engle’s 2001 paper ), consider the following Figure, which depicts the vector space representations of **T** and **J**.

To compute the average generalization error, we write total error is the sum of all the errors over all possible Students **J** for a given Teacher **T**. And we model this error with the inverse (arc cosine) of the vector dot product between **J **and **T**:

**For our purposes**, if instead of N-dim vectors, if we let **T** and **J** be NxM weight matrices, then the dot product becomes the Solid Angle . (Note: the error is no longer since **J** and **T** are matrices, not vectors, but *hopefully* this detail won’t matter here since we are going to integrate this term out below. This remains to be worked out)

This formalism lets us use the machinery of Statistical Mechanics to write the total error as an integral over all possible Student vectors **J**, namely, the phase space volume of our model:

where the first delta function enforces the normalization condition, or spherical constraints, on the Student vectors** J**, and the second delta function is a kind of Energy potential..

The normalization can be subsumed into a general measure, as

which actually provides us a more general expression for the generalization error (where we recall is not quite correct for matrices)

Now we will deviate from the classic Stat Mech approach of the 90s. In the original analysis, one wants to compute the phase space volume as a function of the macroscopic thermodynamic variables, such as the size of the training set, and study the learning behavior. We have reviewed this classic results in our 2017 paper.

We note that, for the simple Perception, the Student and Teachers,: , are represented as N-dimensional vectors, and the interesting physics arises in the Ising Perception, when the elements are discrete:

**Continuous Perception**: (unintersting behavior)

**Ising Perception**: (phase transitions, requires Replica theory, …)

And in our early work, we propose how to interpret the expected phase behavior in light of experimental results (at Google) that seem to require Rethinking Generalization. Here, we want to reformulate the Student-Teacher model in light of our own recent experimental studies of the spectral properties of real-world DNN weight matrices from production quality, pretrained models.

**Our Proposal:** We let be strongly correlated (NxM) real matrices, with truncated, Heavy Tailed ESDs. Specifically, we assume that we know the Teacher **T** weight matrices exactly, and seek all Student matrices **J** that have the same spectral properties as the Teacher.

We can think of the class of Student matrices **J** as all matrices that are close to **T**. What we really want is the best method for doing this, that hasd been tested experimentally. Fortunately, Hinton and coworkers have recently revisited Similarity of Neural Network Representations, and found the best matrix similarity method is

**Canonical Correlation Analysis (CCA): **

Using this, we generalize the Student-Teacher vector-vector overlap., or dot-product, to be the Solid-Angle between the **J **and **T** matrices:and plug this directly into our expression for the phase space volume . (and WLOG, we absorb the normalization N into the matrices, and now have

We now take the Laplace Transform of which allows us to integrate over all possible errors that all possible Students might make:

Note: This is different than the general approach to Gibbs Learning at non-zero Temperature (see ENgle and Van den Broeck, chapter 4). The Laplace Transform converts the delta function to a exponential, giving

**Conjecture 2:*** We can write the layer matrix contribution to the total average generalization error as an integral over all possible (random) matrices J that resemble the actual (pre-)trained weight matrices *

Notice this expression resembles a classical partition function from statistical field theory: ,. Except instead of integrating over the vector-valued **p** and **q** variables, we have to integrate over a class of random matrices** J**. The new expression for the generalization error is a like weighted average over all possible errors, (where the effective inverse Temperature is set by the scale of the empirical weight matrices |**W**|). This is the key observation, and requires some modern techniques to perform

These kinds of integrals traditionally appeared in Quantum Field Theory and String Theory, but also in the context of Random Matrix applied to Levy Spin Glasses, And it is this early work on Heavy Tailed Random Matrices that has motivated our empirical work. Here, to complement and extend our studies, we lay out an (incomplete) overview of the Theory.

These integrals are called Harish Chandra–Itzykson–Zuber (*HCIZ*) integrals. A good introductory reference on both RMT and HCIZ integrals the recent book “A First Course in Random Matrix Theory”, although we will base our analysis here on the results of the 2008 paper by Tanaka,

First, we need to re-arrange a little of the algebra. We will call **A** the Student correlation matrix:

and let **W**, **X** be the original weight and correlation matrices for our pretrained DNN, as above:

,

and then expand the CCA Similarity metric as

We can now express the log HCIZ integral, in using Tanaka’s result, as an expectation value of all random Student correlations matrices **A** that *resemble* **X**.

And this can be expressed as a sum over Generating functions that depends only the statistical properties of the random Student weight matrices **A**. Specifically

where is the R-Transform from RMT.

**The R Transform** is like an inverse Green’s function (i.e a Contour Integral), and is also a cumulant generating function. As such, we can write as a series expansion

where are ** Generalized Cumulants **from RMT.

Now, since we expect the best Students matrices *resemble* the Teacher matrices, we expect the Student correlation matrix **A** to have similar spectral properties as our actual correlation matrices **X**. And this where we can use our classification of the* 5+1 Phases of Training*. Whatever phase

That is, if our DNN weight matrix has a Heavy Tailed ESD

then we expect all of the students to likewise have a Heavy Tailed ESD, and with the same exponent (at least for now).

**Quenched ** **vs Annealed Averages**

Formally, we just say we are averaging over all Students **A**. More technically, what really want to do is fix some Student matrix (i.e. say **A** = diagonal **X**), and then integrate over all possible Orthogonal transformations **O **of **A** (see 6.2.3 of Potters and Bouchaud)

Then, we integrate over all possible** A**~diag(**X**) , which would account for fluctuations in the eigenvalues. We conceptually assume this is the same as integrating over all possible Students **A**, and then taking the log.

The LHS is called the Quenched Average, and the RHS is the Annealed. Technically, they are not the same, and in traditional Stat Mech theory, this makes a big difference. In fact, in the original Student-Teacher model, we would also average over all Teachers, chosen uniformly (to satisfy the spherical constraints)

Here, we are doing RMT a little differently, which may not be obvious until the end of the calculation. We do not assume a priori a model for the Student matrices. That is, instead of fixing **A**=diag(**X**), we will fit the ESD of **X** to a *continuous* (power law) distribution , and then *effectively sample* over all **A** as if we had drawn the eigenvalues of **A** from . (In fact, I suppose we could actually do this numerically instead of doing all this fancy math–but what fun is that?).

The point is, we want to find an expression for the HCIZ integral (i.e the layer / matrix contribution to the Generalization Error) that only depends on observations of **W**, the weight matrix of the pretrained DNN (our Teacher network). The result only depends on the eigenvalues of **X**, and the R-transform of **A** , which is parameterized by statistical information from** X**.

In principle, I supposed we could measure the generalized cumulants of **X**,. and assume we can plug these in for . We will do something a little easier.

Let us consider 2 classes of matrices as models for **X.**

**Gaussian (Wigner) Random Matrix: **Random-Like Phase

The R-Transform for Gaussian Random matrix is well known:

Taking the integral and plugging this into the Generating function, we get

So when **X** is Random-Like , the layer / matrix contribution is like the Frobenius Norm (but squared), and thus average Generalization Error is given by a Frobenius Product Norm (squared).

**Levy Random Matrix: ** Very Heavy Tailed Phase–but with

We don’t have results (yet) for the Very Heavy Tailed Phase with , but, as we have argued previously, due to finite size effects, we expect that the Very Heavy Tailed matrices appearing in DNNs will more resemble Levy Random matrices that the Random-Like Phase. So for now, we will close one eye and extend the results for to .

The R-Transform for a Levy Random Matrix has been given by Burda

Taking the integral and plugging this into the Generating function, we get

**Towards our Heavy Tailed Capacity Metric**

1. Let us pull the power law exponent out of the Trace, effectively ignoring cross terms in the sum over

2. We also assume we can replace the Trace of with its largest eigenvalue , which is actually a good approximation for very heavy tailed Levy matrices, when

This gives an simple expression for the HCIZ integral expression for the layer contribution to the generalization error

Taking the logarithm of both sides, gives our expression

We have now derived the our Heavy Tailed Capacity metric using a matrix generalization of the classic Student Teacher model, with the help of some modern Random Matrix Theory.

**QED**

I hope this has convince you that there is still a lot of very interesting theory to develop for AI / Deep Neural Networks. And that you will stay tuned for the published form of this work. And remember…

`pip install weightwatcher`

A big thanks to Michael Mahoney at UC Berkeley for collaborating with me on this work , and to Mirco Milletari’ (Microsoft), who has been extremely helpful. And to my good friend Matt Lee (Triaxiom Capital, LLC) for long discussions about theoretical physics, RMT, quant finance, etc., , and for encouraging me to publish this.

And here’s the Empirical Evidence that this actually works:

https://www.nature.com/articles/s41467-021-24025-8

**Podcast about this work:**

Thanks to Miklos Toth for interviewing me to discuss this (Listen on SoundCloud):

]]>]]>

My Collaborator did a great job giving a talk on our research at the local San Francisco Bay ACM Meetup

Michael W. Mahoney UC Berkeley

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a “size scale” separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that—all else being equal—DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Joint work with Charles Martin of Calculation Consulting, Inc.

Bio: https://www.stat.berkeley.edu/~mmahoney/

Michael W. Mahoney is at the UCB in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics. He has worked and taught at Yale University in the Math department, Yahoo Research, and Stanford University in the Math department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), He was on the National Research Council’s Committee on the Analysis of Massive Data. He co-organized the Simons Institute’s fall 2013 program on the Theoretical Foundations of Big Data Analysis, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the lead PI for the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. He holds several patents for work done at Yahoo Research and as Lead Data Scientist for Vieu Labs, Inc., a startup re-imagining consumer video for billions of users.

More information is available at https://www.stat.berkeley.edu/~mmahoney/

Long version of the paper (upon which the talk is based): https://arxiv.org/abs/1810.01075http://www.meetup.com/SF-Bay-ACM/http://www.sfbayacm.org/

]]>