```
details = watcher.analyze(..., fix_fingers='clip_xmax', ...)
```

This will take a tiny bit longer, and will yield more reliable alpha for your model layers, along with a new column, * num_fingers,* which reports the number of outliers found

Note that other metrics, such as *alpha_weighted* (alpha-hat), will not be affected.

It is recommended that the *fix_fingers* be added for all analysis moving forward, however, we have not completed a detailed study on the impact yet, so it has been included as an advanced feature.

So why do this ?

In our Nature paper (Nature Communications 2021), we looked at the layer alphas for the GPT2 models, and, among other things, noticed several large alphas, greater than 8 and even 10! These are outliers because no alpha should be greater than 8 (which is the minimum alpha for a random matrix).

Up until now, one just had to accept them as errant and/or remove them from our analysis. But now, I can explain where they come from and how to adjust the calculations when necessary.

Recall that the HTSR and SETOL theories describe the layer quality by analyzing the histogram of its eigenvalues–that is, we analyze the ESD (Empirical Spectral Density). The best-trained layers have a heavy-tailed ESD which can be well fit to a Power Law (PL)

Fingers (i.e. outliers) tend to appear when the layer (Empirical Spectral Density) is really a Truncated Power Law (TPL), as opposed to a simple Power Law (PL). And that’s OK–it is actually predicted by the HTSR theory and for very large and/or high-quality layers. More precisely, we expect that, in the far tail, the (far) Tail Local stats will be Frechet.

So why not just fit a TPL ? Weightwatcher does support this (i.e. fit=’TPL’), but this is very slow, and, more importantly, we usually don’t have enough eigenvalues in the tail of the ESD to get a good TPL fit, and, instead, there are 1 or a few very large eigenvalues that degrade the PL fit.

*We call these very large eigenvalues fingers because they look like fingers peeking out of the ESD* tail.

How can we account for these outliers or fingers? Just remove them!

If we just remove. the TPL fingers when they appear, very frequently, and we are careful, we can recover a very good PL fit. (And this works better than using the fit=’TPL” option Let’s look at an example…

But before we do this, let’s discuss *why* they might appear.

The weightwatcher project is motivated by the theory of Self Organized Criticality (SOC), and the amazing fact that it has been successfully applied to understand the observed behavior of real-world (biological) spiking neurons. This is called the Critical Brain Hypothesis.

The weightwatcher theory posits that Deep Neural Networks (DNNs) exhibit the same signatures of criticality–namely power law behavior–as frequently seen in neuroscience experiments on real neurons.

Moreover, I believe that as LLMs (Large Language Models) approach true Self-Organized-Criticality, we will even more amazing properties emerge.

The weightwatcher theory posits that Deep Neural Networks (DNNs) exhibit the same signatures of criticality–namely power law behavior–as frequently seen in neuroscience experiments on real neurons.

Let’s see how the new and improved *fix_fingers *works on GPT and GPT2

FIrst, we need to download the models (from HuggingFace)

```
import transformers
from transformers import OpenAIGPTModel,GPT2Model
gpt_model = OpenAIGPTModel.from_pretrained('openai-gpt')
gpt_model.eval();
gpt2_model = GPT2Model.from_pretrained('gpt2')
gpt2_model.eval();
```

Now, let’s run weightwatcher

```
import weightwatcher as ww
watcher = ww.WeightWatcher()
gpt_details = watcher.analyze(model=gpt_model, fix_fingers='clip_xmax')
gpt2_details = watcher.analyze(model=gpt2_model, fix_fingers='clip_xmax')
```

Finally, we can generate a histogram of the layer alphas

```
gpt_details.alpha.plot.hist(bins=100, color='red', alpha=0.5, density=True, label='gpt')
gpt2_details.alpha.plot.hist(bins=100, color='green', density=True, label='gpt2')
plt.legend()
plt.xlabel(r"alpha $(\alpha)$ PL exponent")
plt.title(r"GPT vs GPT2 layer alphas $(\alpha)$ w/Fixed Fingers")
```

Here are the results

We see that the GPT model still has several fingers, (alphas >=6) , and, on average, larger alphas than GPT2. But there are fewer fingers than previously.

We have introduced the new and improved option

```
details = watcher.analyze(..., fix_fingers='clip_xmax', ...)
```

which will provide more stable and smaller alphas for layers that are well-trained.

I encourage you to try it and see if it works better for you. And please let me know.

The weightwatcher tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. Email: Info@CalculationConsulting.com

]]>The latest release of the open-source weightwatcher tool includes several important advances, including

- removing explicit dependence on tensorflow and torch on install
- the ability to process very large models, directly from their pytorch statdict files
- GPU-enabled SVD calculations
- Much faster and more stable power law calculations
- Lower memory footprint on GPU enabled machines
- an improved method for finding the weightwatcher shape-metric alpha, with the option
`fix_fingers='clip_xmax'`

(to remove structural outliers, called fingers) - a new landing page: https://weightwatcher.ai with lots of examples

To learn more now, join our Discord channel (*Documentation to come)

**WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).**

**WeightWatcher** (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.

It can be used to:

- analyze pre/trained pyTorch and Keras DNN models (Conv2D and Dense layers)
- monitor models, and the model layers, to see if they are over-trained or over-parameterized
- predict test accuracies across different models, with or without training data
- detect potential problems when compressing or fine-tuning pretrained models
- layer warning labels: over-trained; under-trained

It is based on theoretical research into Why Deep Learning Works, using the new Theory of Heavy-Tailed Self-Regularization (HT-SR), published in JMLR and Nature Communications.

The weightwatcher tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. Email: Info@CalculationConsulting.com

]]>The open-source weightwatcher tool has been featured in Nature Communications and has over 86K downloads. It can help you diagnose problems in your DNN models, layer-by-layer, without even needing access to the test or training data. But how can weightwatcher possibly do this?

In a previous post, I present the current working theory for the weightwatcher project, which explains where the weightwatcher power law shape metrics alpha , and alpha-hat really come from. This post has been written up in a new paper, and an early draft available upon request (email me),

We can write the weightwatcher metric as an approximate Free Energy/Likeliehood, given as the log model Quality , where the model quality might be the model test accuracy, an average GLUE score, etc. The SETOL approach expresses , in terms of an HCIZ integral

It is shown in the SETOL paper that when is well trained, the weightwatcher alpha-hat metric is then the approximate log quality

In order to formulate this, however, we need to invoke the change of measure from the distribution of the full space of all (Student) weight matrices to the space of all (Student) *correlation* matrices,

where represents an arbitrary ‘Student’ correlation matrix .

What does this mean? And, more importantly, how can we be sure it is correct ?

First, some more familiar notation. More generally, given an arbitrary layer weight matrix , of dimension , we define the correlation matrix . In this general case, we assume we can perform the change of measure from layer weight matrices , to general *correlation* matrices .

In words, this means that, we need to change our concept of where the generalizing components live and how we effectively measure them.

For a very well-trained DNN layer, the correlations concentrate into a lower rank space, effectively defined by eigenvalues in the power law (PL) tail of the ESD. By this, we mean that there is some effective operator that spans the same space as the tail of the ESD (Empirical Spectral Density) . and has the same eigenvalues as those eigenvalues in the tail of the original layer correlation matrix .

This is consistent with the HTSR theory (published in JMLR), that states that for any well-trained weight matrix, the ESD of the layer not only forms a power law , but, also, that this tail contains the dominant eigencomponents of that allow the model to generalize.

More on what precisely is in another post; suffice it to say that right now, we just care about the eigenvalues . And we can test this theory by simply looking at the eigenvalues of our actual weight matrices.

In the rigorous formulation of the weightwatcher theory, to accomplish this, we require that the* Effective Correlation Space* be defined by a* Volume-Preserving Transformation*

if we (crudey) write this transformation in terms of a Jacobian , one can see that to change the measure from the uncorrelated to correlated measure,

then the Determinant of the Jacobian of this transformation should be the identity, i.e. . This can be shown exactly (and is derived in the appendix of the new theory paper)

The key assumption of the weightwatcher theory is an empirically verifiable approximation:

- The eigenvalues of the tail of , satisfy the relation .
- That is, the effective correlation matrix has determinant 1, i.e. ,

This means we have now 2 independent methods to identify the power law tail (indicated with a **red** or **purple** line.)

**MLE fit:****(red line**): Apply the standard (Clauset MLE) PL estimator, and find the eigenvalue where PL the tail starts. This is the default weightwatcher approach.**det X=1****(purple line)**: Search for the first eigenvalue that satisfies condition the , condition:

When these 2 methods coincide, we can have great confidence the theory applies well . At least I do.

This constraint can be evaluated empirically by plotting the ESD on a log-linear plot, and comparing the point where to where the PL fit says the tail of the ESD starts (**red line**) to where the constraint is best satisfied (**purple line**) When these 2 lines overlap, the theory works best. And when they don’t overlap, the layers is not fully optimized.

Remarkably, this is very frequently satisfied when the ESD is power law , and has the exponent , And this, remarkably, according to the HTSR theory, corresponds to optimal learning in that layer!

Here, we will provide some justification for these key assumptions, using the open-source weightwatcher tool. As always, all of these results are 100% reproducible, and, moreover, you can test the assumptions on other models yourself.

As always, you can generate these results yourself on any model you like using weightwatcher. Simply use the following commands in your favorite notebook (Jupyter, Google Colab, etc.):

```
import weightwatcher as ww
watcher = ww.WeightWatcher()
details = watcher.analyze(model=your_model, plot=True, detX=True)
```

When the theory is working perfectly, or at least as prescribed, then

- the weightwatcher Pl shape metric alpha will be about 2:
- the detX constant plot will show the
**purple**() and**red**() lines very close together, if not overlapping.

In the theory paper, we show this for a simple example, however, it turns out, this is very common even in many SOTA models!

ALBERT (A Lite BERT) is a SOTA LLM published by Google & the University of Chicago; it is a lightweight variant of BERT. It even outperforms, BERT, and XLNet, and RoBERTa in some cases.

There are several ALBERT models (`base, large, xlarge, xxlarge`

), and most have layers have . Here’s a plot of the layer averaged alphas for all 4 models, compared their average quality (see this notebook for more details).

Notice that the layer averaged alpha metric, , is roughly correlated with model quality. This shows that general weightwatcher approach works on this set of models.

Lets take a look at 2 random layers from. the ALBERT albert-xlarge-v2 model, with different alphas: a middle layer, with , and the last layer, with . Below you can see that with the smaller alpha of 1.89, closer to 2.0, the **purple** and **red** lines overlap.

We have looked at the VGG series of models before, both in our JMLR paper, and in our Nature Communicationss paper. Here, we look at a couple of Fully Connected (FC) or Dense/Linear layers with alpha near 2.0. (Note, we did this analysis in Table 6, the JMLR paper, but I have found that more recent versions of VGG give different results; here Ireport results from the current, default Keras VGG19 model)

Here, despite the 2 plots above looking very different, both alphas are close to 2, and in both cases the **purple** and **red** lines overlap, showing that these ESDs exhibit the signatures of the required *Volume Preserving Transformation* under-the-hood of the rigorous weightwatcher Statistical Mechanics-based Semi-Empirical Theory of Learning (SETOL).

To convince you I am not just cherry-picking examples, let me encourage you to download the open-source weightwatcher tool and try it yourself.

`pip install weightwatcher`

If you find it works, please let me know. And. if something is wrong, I’d like to know that too.

WeightWatcher is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, and is described in a new theory–the weightwatcher Semi-Empirical Theory of Learning (SETOL). Here, we have shown a key feature of the new SETOL approach–that DNNs can be described with an* Effective Correlation Space* at each layer, and that this space is characterized by a *Volume Preserving Transformation* that captures and concentrates the correlations in the data at each layer. And we have shown how you can test this yourself using the weightwatcher tool.

Most importantly, the empirical results show that when the layer is learning perfectly, the layer correlations can be fit to a heavy-tailed power law (PL) distribution, with PL exponent alpha of 2–exactly as prescribed by the earlier HTSR theory!

If you would like to read an early draft of the new paper, please reach out me. I would greatly appreciate the feedback. Additionally, feel free to join the weightwatcher community Discord channel to discuss all things about the theory and how to use the tool to help you with the training and monitoring of your AI models.

**WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).**

The weightwachter tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. #**talkToChuck #theAIguy **Email: Info@CalculationConsulting.com

Most people just choose the most popular model–and this is usually BERT. Or some BERT variant. Bert was created by Google, so it must be good.

But is BERT the really best choice for you ?

How can you find out ? You can search through the literature, read blogs, ask on Reddit, etc, and try to find a better model. This is time consuming and imperfect. Fortunately, there is a better way.

The weightwatcher tool can tell you.

WeightWatcher is an open-source, data-free diagnostic tool that can estimate the quality of an DNN model like BERT, GPT, etc–without needing any data! (No training or test data–just the weights). It has been featured in JMLR, at ICML and KDD, and even in Nature.

Here’s an example using weightwatcher to compare of 3 NLP models: **BERT**, **RoBERTa**, and **XNLet**

The WeightWatcher Power-Law (PL) metric *alpha* is a DNN model quality metric; smaller is better. This plot above displays all the layer *alpha* values for the 3 models. It is immediately clear that the **XNLet layers** look much better than **BERT** or **RoBERTa**; the *alpha* values are smaller on average, and there are no *alpha*s larger than 5: . In contrast, the **BERT** and **RoBERTa** alphas are much larger on average, and both models have too many large *alphas*.

This is totally consistent with the published results.: In the original paper (from Microsoft Research), XLNet outperforms BERT on 20 different NLP tasks.

**Do it yourself:**

WeightWatcher will work with any HuggingFace Transformer (or CV) model.

Here is a Google Colab notebook that lets you reproduce this yourself

Give it a try. And if you need help with AI, ML, or just Data Science, please reach out. I provide strategy consulting, data science leadership, and hands-on, heads-down development. I will have availability in Q3 2022 for new projects. Reach out today. #talkToChuck #theAIguy

]]>In this post, we will show how to use the open-source weightwatcher tool to answer this.

WeightWatcher is an open-source, data-free diagnostic tool for analyzing (pre-)trained DNNs. It is based on my personal research into Why Deep Learning Works, in collaboration with UC Berkeley. It is based on ideas from the Statistical Mechanics of Learning (i.e theoretical physics and chemistry).

`pip install weightwatcher`

WeightWatcher lets you inspect your layer weight matrices to see if they are converging properly. And in some cases, it can even tell you if the layer is over-trained. The idea is simple. If you are training a model, and you over-regularize one of the layer, then you any observe the weightwatcher alpha metric drops below 2 (). This is predicted by our HTSR theory of learning (although we have not published this specific result yet). And very unique as no other approach can do this.

To see how this works, we will look at a very specific, carefully-designed experiment where the theory is known to work exactly as advertised.

**BUT (and here’s the disclaimer)**

*Please be aware–training DNNs to State-of-the-Art (SOTA) is not easy, and applying the tool requires designing careful experiments that can isolate the problems you are trying to fix. It does not work in every case, and you may see unusual results that are difficult to interpret. In these cases, please feel free to reach out to me directly, and join our Slack channel to get help.*

**Having said that, let’s get started**

First, here’s the Google Colab notebook for reproducing this post; please try it yourself.

We consider a very simple DNN, a 3-layer MLP (Multi-Layer Perceptron), trained on MNIST.

To induce the overtraining, we will train this model using different batch sizes, with `batch_size in [1,2,4,8,16,32]`

.

**Why do we vary the batch sizes** ?… and not a specific regularization hyper-parameter like Weight Decay or Dropout? The batch size acts like a very strong regularizer, which can induce the Heavy-Tails we see in SOTA models even in this very small model and generally poorly performing model. This is shown in Figure 25 of our JMLR paper describing our theory of Heavy-Tailed Self-Regularization (HT-SR), the theory behind weightwatcher.

Moreover, with extremely small batch sizes, and a long number of epochs, we can even drive the model into a state of over-training. Which is the goal here.So each model is trained for a very long number of epochs, and until the training loss stabilizes, using a Keras EarlyStopping Callback

```
tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3, verbose=0, min_delta=0.001, restore_best_weights=True)e()
```

**In your own models, the situation may be more complex**.

The weightwatcher metrics work best when applied to SOTA models because this is when the layer weight matrics are best correlated, and the Power Law fits work the best. It takes some work to design experiments on small models that can flush out these features. So we choose to use the batch size to induce this effect. But let me encourage you to try other approaches.

The key to using the HTSR theory is to carefully control the training so that when you adjust some other knob (i.e Dropout, momentum, weight decay) that the training and test error change smoothly and systematically. If, however, the training accuracy or loss is unstable, and you are jumping all over the loss landscape, then HTSR theory, is more difficult to apply. So, here,

**I follow the KISS mantra: “Keep It Super Simple!”**

To compare 2 or more models to each other, with different batch sizes, for the purposes here, we need to ensure they have been trained with the exact same initial conditions. To do this, we have to both set all the random seeds to a default value and tell the framework (here, Keras) to use deterministic options. This also, nicely, makes the experiments 100% reproducible.

```
%env CUBLAS_WORKSPACE_CONFIG=:4096:8
import random
def reset_random_seeds(seed_value=42):
os.environ['PYTHONHASHSEED']=str(seed_value)
tf.random.set_seed(seed_value)
tf.keras.utils.set_random_seed(seed_value)
np.random.seed(seed_value)
random.seed(seed_value)
tf.config.experimental.enable_op_determinism()
```

Every time we build the model, we will first run `reset_random_seeds()`

to ensure that every run, with different batch sizes, regularization, etc, is stated from the same spot and is reproducible.

This model has 3 layers: input, hidden, and output. Note that each layer is initialized in the same way (i.e with GlorotNormalization, with the same seed). Also, here, to *keep it super simple*, no specific regularization is applied to the model (except for the changing of the batch size).

```
initializer = tf.keras.initializers.GlorotNormal(seed=1)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape = [28,28]),
tf.keras.layers.Dense(300, activation='relu', kernel_initializer=initializer),
tf.keras.layers.Dense(100, activation='relu', kernel_initializer=initializer),
tf.keras.layers.Dense(10, activation='softmax', kernel_initializer=initializer),
])escribe()Also,
```

We can inspect the model using weightwatcher to see how the layers are labeled (layer_id), what kind of layer they are (DENSE, Conv2D, etc), and what their shapes are (N, M).

```
import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
watcher.describe()
```

In this experiment, we will analyze layer 1 (the Hidden Layer) and only layer 1. This layer is a DENSE layer, which has a single weight matrix of dimension 100×300. It will have 100 eigenvalues, which is a large enough size for weightwatcher to analyze. And for this super, simple experiment, this is the only later that is trainable; all other layers are held fixed.

Again, we will train the same model, with the same exact same initial conditions, in a deterministic way, while changing the batch size. For each fully trained model, we then compute the weighwatcher Power-Law capacity metric alpha (). We will then compare the layer 1 alpha ) to the model test accuracy for each run.

Notice first that, however, when decreasing the batch size, both the training accuracy and the test accuracy improve both smoothly and systematically, and then drop off suddenly. For example, below, see that test accuracy increases from 89.0% at batch size 32 to 89.4% at batch size 4, and then drops off suddenly for batch size 2 down to 88.5%. (The training accuracy behaves in a similar way when decreases the batch size, as can be seen in the notebook).

Likewise, the training loss is varying smoothly, and the optimizer is not jumping all over the energy landscape. This indicates a clean experiment, amenable to analysis.

(Notice that we apply early stopping to the training loss, not the validation loss. That is because, in this experiment, we are trying to drive the model to a state of over-training by reducing the batch size, and going past the perhaps more common early stopping critera on the validation loss. Also, since we are changing the batch size, we want to ensure each model runs with enough epochs to the runs can be compared to each other).

To compute the weightwatcher metrics, at the end of every training cycle, just run

```
results = watcher.analyze(layers=[1])
```

The `watcher.analyze() `

method will generate a pandas dataframe, with layer by layer metrics.

**What does alpha mean? **Alpha () is a measure of how Heavy-Tailed the layer is. It can be found, crudely, by simply plotting a histogram of the eigenvalue of the layer correlation matrix,

, on a log-log scale, and calculating the slope of this plot in the tail region. Here is an example where .**X**=np.dot(W.T,W)

The smaller alpha is, the more Heavy-Tailed the layer matrix **X** is, and the better the layer performs for the model. *But only upto a point.* If the layer is too Heavy-Tailed, where (for simple models) then it may be over-trained.

We can now plot the alpha vs the test accuracy for layer 1, and the result is quite amazing.

Notice 2 key things

- as the test accuracy increases, the alpha metric decreases ()
- as soon the test accuracy drops (with batch size = 1), alpha drops below 2 ()

For simple models like this 3-layer MLP, the weightwatcher approach can, remarkably, detect which layer is over-trained! No other theory can do this.

For more complex models, with lots of parameters varying, the situation may be more complex.

Let me encourage you to try the weightwatcher tool for yourself, and join our Slack channel to discuss this and other aspects of training large models to SOTA.

The weightwatcher alpha $(latex \alpha)$ metric is the exponent found when fitting the empirical spectral density (ESD), or a histogram of the eigenvalues, to a Power-Law distribution. Moreover, when alpha is between roughly 2 and higher (theoretically 4, practically, upto 6, ), as shown in our JMLR paper, we can use our HTSR theory to characterize the layer weight matrix as being Moderately Heavy-Tailed. See Table 1:

When a Power Law distribution is simply Moderately Heavy-Tailed, this means that, in the limit, the variance may be unbounded, but the average (or mean) value is well defined. So, for Deep Learning, this implies that the model has learned a wide variety of correlations, but, on average, the correlations are reasonably bounded, moreover, typical. Being typical, the layer weight matrix model can be used to describe the information in the training and the test data, as long as they come from the same data distribution,

But when the alpha is very small (), this means the layer weight matrix is Very Heavy-Tailed, and the layer weight matrix is **atypical.** That is, the distributions of the correlations do not have a well-defined average of mean value, and the individual elements of W may even themselves be unbounded (ie. when you have a Correlation Trap). Therefore, this layer weight matrix can not be used to describe any data except the training data.

**A Correlation Trap appears when the batch size = 1**

Seeing this in practice is not necessarily easy, and interpreting it is harder. As here, one may have to design a very careful experiment to flush this out. Still, we encourage you to try the tool out, try to use it to identify and resolve such problems, and please give feedback.

And if you need help with AI, ML, or just Data Science, please reach out. I provide strategy consulting, data science leadership, and hands-on, heads-down development. I will have availability in Q3 2022 for new projects. Reach out today. #talkToChuck #theAIguy

]]>WeightWatcher is a work in progress, based on research into Why Deep Learning Works, and has been featured in venures like ICML, KDD, JMLR and Nature Communications.

Before we start, let me just say that it is very difficult to develop general-purpose metrics that can work for any arbitrary DNN (DNN), and, also, maintains a tool that is also back comparable with all of our work to be 100% reproducible. I hope the tool is useful to you, and we rely upon your feedback (positive and negative ) to improve the tool. so if it works for you, please let me. If not, feel free to post an issue on the Github site. Thanks again for the interest!

Given this, the first metrics we will look at will be the

How is it even possible to predict the generalization accuracy of a DNN without even needing the test data ? Or at least trends ? The basic idea is simple; for each later weight matrix, measure how non-random it is. After all, the more information the layer learns, the less random it should be.

What are some choices for such a layer capacity metric:

- Matrix Entropy:
- Distance from the initial weight matrix: (and variants of this)
- DIvergence from the ESD of the randomized weight matrix:

We define in terms of the eigenvalues of the layer correlation matrix

.

So the matrix entropy i is both a measure of layer randomness and a measure of correlation.

In our JMLR paper, however, we show that while the Matrix Entropy does decrease during training, it is not particularly informative. Still, I mention it here for completeness. (And in the next post, I will discuss the Stable Rank; stay tuned)

The next metric to consider is the Frobenius norm of the difference between the layer weight matrix and it’s specific, initialized, random value . Note, however, that this metric requires that you actually have the initial layer matrices (which we did not for our Nature paper)

The weightwatcher tool supports this metric with the distance method:

import weightwatcher as ww watcher = ww.watcher(model=your_model) <strong>distance_from_init</strong> = watcher.distances(your_model, init_model)

where init_model is your model, with the original, actual, initial weight matrices.

In our most recent paper, we evaluated the **distance_from_init** method as a generalization metric in great detal, however, in order to cut the paper down to submit (which I really hate doing), we had to remove most of this discussion, and only a table in the appendix remained. I may redo this paper, and revert it to the long form, at some point. For now, I will just present some results from that study here, that are unpublished.

These are the raw results for the task1 and task2 sets of models, described in the paper. Breifly we were given about 100 pretrained models, group into 2 tasks (corresponding to 2 different architectures), and then subgrouped again (by the number of layers in each set). Here, we see how the **distance_from_init** metric correlates against the given test accuracies–and it’s pretty good most of the time. But’s not the best metric in general.

The are a few variants of this distance metric, depending on how one defines the distance. These include:

- Frobenius norm distance.
- Cosine distance
- CKA distance

Currently, weightwatcher 0.5.5 only supports (1), but in the next minor release, we plan to include both (2) & (3).

The problem with this simple approach is it is not going to be useful if your models are overfit because the distance from init increases over time anyway–and this is exactly what we think is happening in the **task1** models from this NeurIPS contest. But it is a good sanity check on your models during training, and can be used with other metrics as a diagnostic indicator.

So the question becomes, can we somehow create a distance-from-random metric that compensates for overfitting. And that leads to…

With the new **rand_distance** metric, we mean something very different from the **distance_from_init. ** Above, we used the original instantiation of the , and constructed an *element-wise distance* metric. The **rand_distance** metric, in contrast:

- Does not require the original, initial weight matrics
- Is not an element-wise metric, but, instead, is a
**distributional metric**

So this metric is defined in terms of a distance between the distributions of the eigenvalues (i..e the ESDs) of the layer matrix and its random counterpart .

For weightwatcher, we choose to use the Jensen-Shannon divergence for this:

rand_distance = jensen_shannon_distance(esd, random_esd)

where

def jensen_shannon_distance(p, q): m = (p + q) / 2 divergence = (sp.stats.entropy(p, m) + sp.stats.entropy(q, m)) / 2 distance = np.sqrt(divergence) return distance

Moreover, there are 2 ways to construct the random layer matrix $\mathbf{W}_{rand}&bg=ffffff$ metric:

- Take any (Gaussian) Random Matrix, with the same aspect ratio as the layer weight matrix
- Permute (shuffle) the elements of

While at first glance, these may seem the same, in fact, in practice, they can be quite different. This is because while every (Normal) Random Matrix has the same ESD (up to finite size effects), the Marchenko-Pastur (MP) distribution, if the actual contains any usually large elements , then it’s ESD will behave like a Heavy-Tailed Random Matrix, and look very different from it’s random MP counterpart. For more details, see our JMLR paper.

Indeed, when contains usually large elements, we call these **Correlation Traps. **Now, I have conjectured, in an earlier post, that such Correlation Traps may be a indication of a layer being overtrained. However, some research suggests the opposite, and, that, in-fact, such large elements are needed for large, modern NLP models. The jury is still out on this, however, the weightwatcher tool can be used to resolve this question since it can easily identify such Correlation Traps in every layer. I look forward to seeing the final conclusion.

The take-a-way here is that, when a layer is well trained, we expect the ESD of the layer to be significantly different from the ESD its randomized, shuffled form. Let’s compare 2 cases:

*The rand_distance metric measure the divergence between the original and random ESDs*

In case (a), the **original ESD** (green) looks significantly different from its **randomized ESD** (red).Â In case (b), however, the **original** the **randomized** ESDs are much more similar.Â So we expect case (a) to be more well trained than case (b).Â Â

*And rand_distance works, presumably, at least in some cases, when the layer is overtrained, as in (b)*.

To compute the **rand_distance** metric, simply specify the randomize=True option, and it will be available as a layer metric in the details dataframe

import weightwatcher as ww watcher = ww.watcher(model=your_model) details = watcher.analyze(randomize=True) avg_rand_distance = details.rand_distance.mean()

Finally, let’s see how well the **rand_distance** metric actually works to predict trends in the test accuracy for a well known set of pretrained models, the ALBERT series. Similar to the analysis in our Nature paper, we consider how well the **average rand_distance** metric is correlated with the reported top1 errors of the series of ALBERT models: base, large, xlarge, xxlargeÂ

Â

I’ll end part 1 of this series of blog posts with a comparison between the new **rand_distance** metric and the weighwatcher **alpha** metric, for another well known model:Â VGG19

When the weightwatcher **alpha** < 5, this means that the layers are Heavy Tailed and therefore well trained, and, as expected, alpha is correlated with the **rand_distance** metric. As expected!

I hope this has been useful to you and that you will try out the weightwather tool

pip install weightwatcher

Give it a try. And please give me feedback if it is useful. And, if interested in getting involved or just learning more, ping me to join our Slack channel. And if you need help with AI, reach out. #t**alkToChuck #theAIguy **

Let’s see how to do this. First, install the tool.

```
pip install weightwatcher
```

Second, pick a model and get a basic description of it

```
import weightwatcher as ww
watcher = ww.WeightWatcher(model=my_model)
details = watcher.analyze()
```

WeightWatcher produces a pandas dataframe, details, with layer metrics describing your model. In particular, the details dataframe contains layer names, ids, types, and warnings.

Here is an example, an analysis of the OpenAI GPT model (discussed in our recent Nature paper)

```
import transformers
from transformers import OpenAIGPTModel,GPT2Model
gpt_model = OpenAIGPTModel.from_pretrained('openai-gpt')
gpt_model.eval();
watcher = ww.WeightWatcher(model=gpt_model)
details = watcher.analyze()
```

The details dataframe now includes a wide range of information, for each layer, including a specific warnings columns

```
details[details.warning!=""][['layer_id','name','warning']]
```

For comparison, below we show the same details deatframe, but for GPT2. GPT2 is the same model as GPT, but trained with more and better data, Being a much better model, GPT2 has far fewer warnings than GPT.

That’s all there is to it. WeightWatcher provides simple layer metrics for pre-trained Deep Neural Networks, indicating simple warnings for which layers are over-trained and which layers are under-trained.

Give it a try. We are looking for early adopters needing better, faster, and cheaper AI monitoring. if it is useful to you , please let me know.

]]>- What is the best metric to evaluate your model ?
- How can you be sure you trained it with enough data ?
- And how can your customers be sure ?

WeightWatcher can help.

`pip install weightwatcher `

WeightWatcher is an open-source, diagnostic tool for evaluating the performance of (pre)-trained and fine-tuned Deep Neural Networks. It is based on state-of-the-art research into *Why Deep Learning Works*. Recently, it has been featured in Nature:

Here, we show you how to use WeightWatcher to determine if your DNN model has been trained with enough data.

In the paper, we consider the example of GPT vs GPT2. GPT is a NLP Transformer model, developed by OpenAI, to generate fake text. When it was first developed, OpenAI released the GPT model, which had specifically been trained with a small data set, making it unusable to generate fake text. Later, they realized fake text is good business, and they released GPT2, which is just like GPT. but trained with enough data to make it useful.

We can apply WeightWatcher to GPT and GPT2 and compare the results; we will see that the WeightWatcher *log* *spectral norm *and *alpha (power law)* metrics can immediately tell us that something is wrong with the GPT model. This is shown in Figure 6 of the paper;

Here we will walk through exactly how to do this yourself for the WeightWatcher Power Law (PL) alpha metric , and explain how to interpret these plots.

It is recommended to run these calculations in a Jupiter notebook, or Google Colab. (For reference, you can also view the actual notebook used to create the plots in the paper, however, this uses an older version of weightwatcher)

For this post, we provide a working notebook in the WeightWatcher github repo.

WeightWatcher understands the basic Huggingface models. Indeed, WeightWatcher supports:

- TF2.0 / Keras
- pyTorch 1.x
- HuggingFace

and (soon)

ONNX (in the current trunk)

Currently, we support Dense and Conv2D layers. Support for more layers is coming. For our NLP Transformer models, we only need support for the Dense layers.

First, we need the GPT and GPT2 pyTorch models. We will use the popular HuggingFace transformers package.

```
!pip install transformers<
```

Second, we need to import pyTorch and weightwatcher

```
import torch
import weightwatcher as ww<
```

We will also want the pandas and matplotlib libraries to help us interpret the weightwatcher metrics. In Jupyter notebooks, this looks like

```
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
```

We now import the transformers package and the 2 model classes

```
import transformers
from transformers import OpenAIGPTModel,GPT2Model
```

We have to get the 2 pretrained models, and run *model.eval() *

```
gpt_model = OpenAIGPTModel.from_pretrained('openai-gpt')
gpt_model.eval();
gpt2_model = GPT2Model.from_pretrained('gpt2')
gpt2_model.eval();
```

To analyze our GPT models with WeightWatcher , simply create a watcher instance, and run *watcher.analyze(*). This will return a pandas dataframe with the metrics for each layer

```
watcher = ww.WeightWatcher(model=gpt_model)
gpt_details = watcher.analyze()
```

The details dataframes reports quality metrics that can be used to analyze the model performance–without needing access to test or training data. The most important metric is our Power Law metric . WeightWatcher reports for every layer. The GPT model has nearly 50 layers, so it is convenient to examine all the layer alphas at once as a histogram (using the pandas API).

```
gpt_details.alpha.plot.hist(bins=100, color='red', alpha=0.5, density=True, label='gpt')
plt.xlabel(r"alpha $(\alpha)$ PL exponent")
plt.legend()
```

This plots the density of the values for all layers in the GPT model.

From this histogram, we can immediately see 2 problems with the model

- The peak . which is higher than optimal for a well trained model.
- There are several outliers with , indicating several poorly trained layers.
- There are no ; when alpha is too small, the layer may be overtrained.

So knowing nothing about GPT, and having never seen the test or training data, WeightWatcher tells us that this model should never go into production.

Now let’s look GPT2, which has the same architecture, but trained with more and better data. Again, we make a *watche*r instance with the model specified, and just run *watcher.analyze()*

watcher = ww.WeightWatcher(model=gpt2_model) gpt2_details = watcher.analyze()

Now let’s compare the Power Law alpha metrics for GPT and GPT2. We just create 2 histograms, 1 for each model, and overlay them.

```
gpt_details.alpha.plot.hist(bins=100, color='red', alpha=0.5, density=True, label='gpt')
gpt2_details.alpha.plot.hist(bins=100, color='green', density=True, label='gpt2')
plt.xlabel(r"alpha $(\alpha)$ PL exponent")
plt.legend()
```

The layer alphas for GPT are shown in **red**, and for GPT2 in **green**, and the histograms differ significantly. For the GPT2, the peak , and, more importantly, there are no outlier . Smaller alphas are better, and the GPT2 model is much better than GPT because it is trained with significantly more and better data.

The only caveat here is if ; in these cases, the layer is overtrained or overfit in some way. In GPT and GPT2, we have no alphas that are too small.

WeightWatcher has many features to help you evaluate your models. It can do things like

- Help you decide if you have trained it with enough data (as shown here)
- Detect potential layers that are overtrained (as shown in a previous blog)
- Be used to get early stopping criteria (when you can’t peek at the test data)
- Predict trends in the test accuracies across models and hyperparameters (see our Nature paper, and our most recent submission).

and many other things.

Please give it a try. And if it is useful to you, let me know.

**And if your company needs help with AI, reach out. I provide strategy consulting, mentorship, and hands-on development. #talkToChuck, #theAIguy**.

In the Figure above, fig (a) is well trained, whereas fig (b) may be over-trained. That orange spike on the far right is the tell-tale clue; it’s what we call a* Correlation Trap. *

Weightwatcher can detect the signatures of overtraining in specific layers of a pre/trained Deep Neural Networks. In this post, we show how to use the weightwatcher tool to do this.

**WeightWatcher** (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It analyzes the weight matrices of a pre/trained DNN, layer-by-layer, to help you detect potential problems. Problems that can not be seen by just looking at the test accuracy or the training loss.

**Installation**:

```
pip install weightwatcher
```

**Usage:**

```
import weightwatcher as ww
import torchvision.models as models
model = models.vgg19_bn(pretrained=True)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze(plot=True, randomize=True)
```

For each layer, Weightwatcher plots the Empirical Spectral Density, or ESD. This is just a histogram of the eigenvalues of the layer correlation matrix **X=W ^{T}W**.

```
import numpy as np
import matplotlib,pyplot as plt
...
X = np.dot(W,W.T)
evals, evecs = np.linalg.eig(X(
plt.hist(evals, bin=100, density=True)
...
```

By specifying the randomize option, WW randomizes elements of the weight matrix **W**, and then computes the it’s ESD. This randomized ESD is overlaid on the orginal ESD of **X**, and ploted on a log scale.

This is shown above, in the RHS (right hand side). The original layer ESD is **green**; the randomized ESD is **red**, And the **orange line **depicts the largest eigenvalue of the randomized ESD.

If the layer is well trained matrix, then when **W** is randomized, it’s ESD will look like that of a normally distributed random matrix. This is shown in Figure (a), above.

But if the layer is over-trained, then it’s weight matrix **W** may have some unusually large elements, where the correlations may concentrate, or become *trapped*. In this case, the ESD may have 1 or more unusually large eigenvalues. This is shown in Figure (b) above, with the **orange line** extending to the far right of the bulk of the **red** ESD.

Notice also that in Figure (a), the **green** ESD is very Heavy-Tailed, with the histogram extending out to log10=2, or the largest eigenvalue of nearly 100: . But in Figure (b),, the green ESD has a distinctly different shape and is smaller in scale than in Figure (a). In fact, in (b), the **green** (original) and **red** (randomized) layer ESDs look almost the same, except for a small shelf of larger **green** eigenvalues, extending out to and concentrating around the **orange line**.

**In cases like this, we can identify the orange line as a Correlation Trap. **

This indicates that something went wrong in training this layer, and the model did not capture the correlations in this layer in a way that will generalize well to other examples.

Using the Weight Watcher tool, you can detect this and other potential problems when training or fine-tuning your Deep Neural Networks.

You can learn more about it on the WeightWatcher github website.

]]>The WeightWatcher tool is an open-source python package that can be used to predict the test accuracy of a series similar of Deep Neural Network (DNN) — without peeking at the test data.

WeightWatcher is based on research done in collaboration with UC Berkeley on the foundations of Deep Learning. We built this tool to help you analyze and debug your Deep Neural Networks.

It is easy to install and run; the tool will analyze your model and return both summary statistics and detailed metrics for each layer:

The WeightWatcher github page lists various papers and online presentations given at UC Berkeley and Stanford, and various conferences like ICML, KDD, etc. There, and on this blog, you can find examples of how to use it.

This post describes how to select the metric for your models, and why.

You can use WeightWatcher to model a series of DNNs of either increasing size, or with different hyperparameters. But you need different metrics — vs. — for different cases:

**alpha :**for different hyper-parameters settings (batch size, weight decay, …) on the same model**weighted alpha: :**for an architecture series like VGG11, VGG13, VGG16, VGG19

But why do need 2 different alpha metrics? To understand this, we need to understand

Traditional machine learning theory suggests that the test performance of a Deep Neural Network is correlated with the average log Spectral Norm. That is, the test error should be bounded by the average Spectral Norm, so the smaller norm, the smaller the test error.

The Spectral Norm of an matrix is just the (square root of the ) maximum eigenvalue of its correlation matrix

We denote the (squared) Spectal Norm as:

Note: in earlier papers we (and others) also use:

WeightWatcher computes the log Spectral Norm for each layer, and defines:

which we compute by averaging over all layer weight matrices.

We compute the eigenvalues, or the Empirical Spectral Density (ESD), of each layer by running SVD directly on the layer or, for Conv2D layers, some matrix slice (see the Appendix).

It has been suggested by Yoshida and Miyato that the Spectral Norm would make a good regularizer for DNNs. The basic idea is that the test data should look enough like the training data so that if we can say something about how the DNN performs on *perturbed* training data, that will also say something about the test performance.

Here is the text from the paper; let. me explain how to interpret this in practical terms.

We imagine the test data must look like the training data with some small perturbation . Let us write this as:

As we train a DNN, we run several epochs of BackProp, which amounts to multiplying by a weight matrix at each layer, apply an activation function, and repeat until we get a label

To get an estimate, or bound, on the test accuracy, we can then imagine applying the matrix multiply to the perturbed training point

So if we can say something about how the DNN should perform on a perturbed training point , we can say something about the test output .

*What can we say ? *

When we apply an activation function , like a RELU, which acts pointwise on the data vector, and acts like an affine transformation. So the RELU+weight matrix multiply is a bounded linear operator (at least piecewise) and therefore it can be bounded by it’s Spectral Radius .

So we say that we want to learn a DNN model such that, at each layer, the action of the layer matrix-vector multiply is bounded by the Spectral Norm of its layer weight matrix. This should, in theory, give good test performance. By applying a Spectral Norm regularizer, we think we can make a DNN that is more robust to small changes in it’s input. That is, we can make it perform better on random perturbations the of training data , and, therefore, presumably, better on the test data.

**From Bounds to a Regularizer**

When we develop a mathematical bound , our first instinct is to develop a numerical regularizer . That is, when solving our optimization problem–minimizing the DNN Energy function –we want to prevent the solution from blowing up. Having a *mathematically rigorous* bound helps here since it seems to bound the BackProp optimization step on every iteration:

Notice that since the regularizer appears in the optimization problem, it must be differentiable (either directly, or using some trick).

A regularizer must also be easy to implement. For example, we could also bound the Jacobian , but this very expensive to compute, and it is difficult to apply even a norm bound of this , on every step of BackProp. It might also seem that the Spectral norm is hard to compute because one needs to run SVD, but there is a simple trick. One can approximate the maximum eigenvalue of using the Power Method, by running it for say steps, and then simply add this to the SGD update step,. There are many examples on github of this, and it has been applied, in particular, to GANs and has been shown to work very well in large scale studies.

Also, since this expression is linear in , we can also readily plug this into TensorFlow /Keras or PyTorch and use autograd to compute the derivatives. It is available as a Tensforflow Addon, and is part of the core pyTorch 1.7 package.

Spectral Norm Regularization has not been widely used (outside say GANs) because it only works well for very deep networks. See, however, this adaption for smaller DNNs called Mean Spectral Normalization.

What do we want from a theory of learning ? With WeightWatcher, we have never sought. a rigorous bound. That’s not the goal of our theory. We do not seek a bound because this decribes the* worst-case behavior;* we seek to understand *the average-case behavior* (However, what we can do repair the Spectral Norm as a metric , as shown below.)

With the average-case, we hope to able to predict the generalization error of a DNN (and without peeking at the test data). And we mean this a very practical sense, applying to very large, production quality models, both in training and fine-tuning them.

So what’s the difference between having a bound and analyzing average-case behavior ?

- With a mathematical bound, we can bound the error–for a single model. So this is alike a prediction, and , as shown above, we can use this to develop better regularizers.

- WIth the average-case behavior, we want to predict trends across many different models. Of both different depths (since deeper models usually perform better) and with different hyperparameter settings (batch size, momentum, weight decay, etc.)

It’s not at all obvious that we can expect a mathematical bound to be correlated with trends in the test accuracy of real-world DNNs. It turns out, the Spectral Norm works pretty well–at least across pretrained DNNs of increasing depth.

Here is an example, showing how the Spectral Norm performs on the VGG series

(VGG11, VGG13, VGG16, VGG19, with/out Batchnorm, trained on Imagenet-1K)

We see that the average log Spectral norm correlates quite well with the test accuracy of the DNN architecture series of the pretrained VGG models. This is remarkable, since we do not have access to the training or the test data (or other information).

We have used WeightWatcher to analyze hundreds of pretrained models, of increasing depths, and using different data sets. Generally speaking, the average log Spectral Norm correlates well with the test accuracies of many different DNN series and for different data sets.

But not always. And that’s the rub.

Oddly, while the average log Spectral Norm is correlated with test error, when changing the depth of a DNN model, it turns out to be *anti-correlated *with test error when varying the optimization hyper-parameters. This is a classic example of Simpson’s paradox.

We have noted this, and it has also pointed by Bengio and co-workers. Indeed, an entire contest was recently set up to study this issue–the 2020 NeurIPS Predicting Generalization challenge.

Below we can see the paradox by looking at predictions for ~100 small, pre-trained VGG-like models, (provided by contest). We use WeightWatcher (version ww0.4) to compute the average `log_spectral_norm`

, and compare to. the reported test accuracies for the contest

set of baseline VVG-like models:**task2_v1**

For more details, please see the contest website details, and/or our contest post-mortem Jupyter Book and paper on the contest (coming soon).

Notice that the `2xx`

models have the best test accuracies, and, correspondingly, as a group, the smallest `avg logspectralnorm`

. Smaller error correlates with the smaller norm metric. Likewise, the `6xx`

models models have. the smallest Test Accuracies as. a group, and also, the largest `avg logspectralnorm`

.

*This is a classic example of Simpson’s Paradox.*

However, also note that, for each model group (`2xx, 10xx, 6xx, & 9xx`

), we can draw, roughly, a straight line that shows most of the test accuracies in that group are anti-correlated with the `avg. log_spectral_norm`

. Now the regression is not always great, and there are outliers, but we think the general trends hold well enough for this level of discussion (and we will drill into the details in our next paper).

This is a classic example of Simpon’s Paradox.

Here, we see a large trend, across the similar models, trained on the same dataset, but with different, depths.

When looking closely at each model group, however, we see the reverse trend.

This makes the Spectral Norm difficult to use as a general purpose metric for predicting test accuracies.

**WeightWatcher to the Rescue**

Using WeightWatcher, however, we can repair the average log Spectral Norm metric by computing it as a weighted average–weighted by the WeightWatcher metric.

Here is a similar plot on the same** task2_v1 **data,, but this time reporting the WeightWatcher average power law metric . Notice that is well correlated within each model group, as expected when changing model hyperparameters. Moreover, the is not correlated with the Test Accuracy more broadly across different depths, nor is it correlated with the average log Spectral Norm (not shown).

WeightWatcher alpha tells us how correlated a single DNN model is. And we can use to correct the average `log_spectral_norm`

by simply taking a weighted average (called `alpha_weighted`

):

If we look closely, we can see more in more detail how the weighted alpha corrects the average `log_spectral_norm `

metric in the VGG architecture series. Below we use WeightWatcher to plot the different metrics vs. the Top 1 Test Accuracy for the many different pre-trained VGG models.

(VGG11, VGG13, VGG16, VGG19, with/out BatchNorm, trained on Imagenet-1K)

Consider the plot on the far right, and, specifically, the **pink (BN-VGG-16)** and **red dots (VGG-19)**, near test accuracy ~ 26. These are 2 models with both different depths (16 vs 19 layers) and different hyperparameter settings (BatchNorm or not). The two models have nearly the same accuracy, but a large variance between their average `log_spectral_norm`

. Now consider the far left plot for average alpha , which shows that and **red** has a smaller alpha than the **pink** . The **VGG-19** model is more strongly correlated than **BN_VGG-16**. The average alpha for the **red dot is ~3.5**, whereas the **pink dot ~ 3.85**. When we combine these 2 metrics, on the middle plot (alpha_weighted), the 2 models now appear much closer together. So `alpha_weighted`

metric corrects the average `log_spectral_norm`

, reducing the variance between similar models, and making more suitable to treating models of different depth and different hyperparameter settings.

The open source WeightWatcher tool provides metrics for Deep Neural Networks that allow the user to predict (trends in) the test accuracies of Deep Neural Networks without needing the test data. The different (power law) metrics, and apply to models with different hyperparameter settings, and different depths, resp. Here, we explain why.

The average alpha metric describes the amount of correlation contained in the DNN weight matrices. Smaller correlates with better test accuracy for a single model with different hyperparemeter settings. It is a unique metric, developed from the theory on strongly correlated systems from theoretical chemistry and physics

The average weighted alpha metric is suited for treating a series of models with different depths, like the VGG series: VGG11, VGG13, VGG16, VGG19. It is a weighted average of the log Spectral Norm.

To explain why works, we have reviewed the theory and application of Spectral Norm Regularization, and the use of the average log Spectral Norm as a metric for predicting DNN test accuracies.

While theory suggests that might be able to predict the generalization performance of different pre-trained DNNs, in practice, it is correlated with test error for models with different depths, and anti-correlated for models trained with different hyperparameters. This is a classic example of Simpson’s Paradox.

We show that we can fix-up the average `log_spectral_norm`

(as provided in WeightWatcher) by using a weighted average, weighted by the WeightWatcher power-law layer alpha metric. And this is exactly the WeightWatcher metric `alpha_weighted`

:

Try it yourself on your own DNN models.

`pip install weightwatcher `

And let me know how it goes.

**Spectral Density of 2D Convolutional Layers**

We can test this theory numerically using WeightWatcher. Notice, however, that while it is obvious how to define for Dense matrix, there is some ambiguity doing this for a 2DConvolution and to make this work as a useful metric.

WeightWatcher has 2 methods for computing the SVD of a Conv2D layer, depending on the version. For a Conv2D layer, the options are

**new version ww.0.4:**extract matrix slices of dimension , and combine these to define the final matrix, and run SVD on this. This gives 1 ESD per Conv2D layer.**old version ww2x**: extract matricies of dimension each, and run SVD on each . This gives ESDs per Conv2D layer, and the layer metrics are then averaged over all of these slices.**future version (maybe)**: Run SVD on the linear operator that defines the Conv2D transform. This requires running SVD on the discrete FFT on each of the Conv2D input-output channels, and then combining the resulting eigenvalues into 1 very large ESD. This is quite slow and the numerical results were not as good in early experiments.

All three methods give slightly different layer Spectral Norms, with ww0.4 being the best estimator so far. The` ww2x=True `

option is included for back compatibility with earlier papers.