calculated | content

Describing Double Descent with WeightWatcher

Charles H Martin, PhD — Fri, 01 Mar 2024 08:37:45 +0000

Double Descent (DD) is something that has surprised statisticians, computer scientists, and deep learning practitioners–but it was known in the physics literature in the 80s: And while DD can seem complicated in deep learning models, the original model is actually very easy to understand — and reproduce — with just a few lines of python.

IMHO, DD is a great way to understand how and when Deep Neural Networks might overfit their data, and, moreover, where they achieve optimal performance. And you can do this and more with the open-source weightwatcher tool

https://weightwatcher.ai

The original 1989 DD physics experiment

The original DD experiment from 1989 is easy to set up and run using modern python.

Here’s a notebook that you can run on Google Colab to reproduce all these results, and with a lot more details. (a lot)

One simply trains a Linear Regression to predict the labels of a dataset with binary labels.

In this case, the data instances are drawn from the vertices of an N-dimensional hypercube and the labels are +1 or -1, depending on the sum of the ‘1s’ in the instance data vector.

This gives the linear equation

where each data vector (x), label (y=[-1|1]), and weight vector (w). Of course, the goal is to learn the N-dimension weight vector (w) given P training data instances.

We can generate the data vectors (x) and their labels (y) using this code (from the notebook):

def generate_mdp_dataset_numpy(P, N):
    X = np.random.choice([-1, 1], size=(P, N))
    Y = generate_majority_labels(X)
    return X, Y

Lets run the experiment 100 times and plot the aveage results. Lets pick features, and let patterns, where is called the load. We vary and compute the test accurary (measured on some random sample).

To just get the resuilts, we can use the scikit-learn LinearRegression module. Again, we run Linear (not Logistic) Regression to predict the binary labels . Here’s the notebook code we use for this.

def run_LR_experiment(alpha=1.0, verbose=True, N=100):

  P = int(alpha * N)

  X_train, Y_train = generate_mdp_dataset_numpy(P,N)
  X_test, Y_test = generate_mdp_dataset_numpy(P,N)

  # Train the linear regression model
  regressor =  LinearRegression(fit_intercept=False)
  regressor.fit(X_train, Y_train)

  # Predict Y values for the training and test sets
  Y_train_value = regressor.predict(X_train)
  Y_test_value  =  regressor.predict(X_test)

  # Convert predictions to binary class labels based on the condition
  Y_train_pred = np.where(Y_train_value > 0, 1, -1)  # Converts to 1 if Y > 0 else -1
  Y_test_pred = np.where(Y_test_value > 0, 1, -1)  # Converts to 1 if Y > 0 else -1


  # Compute accuracy for both training and test sets
  train_accuracy = accuracy_score(Y_train, Y_train_pred)
  test_accuracy = accuracy_score(Y_test, Y_test_pred)

Doing this, we can reproduce the the original DD curve. On the LHS, I reproduced the experiment from the 1989 paper that discovered it . On the RHS, I present theoretical curves (from Opper 1995; ask me for a copy) using both statistical mechanics (StatMech, red) and statistical learning theory (SLT, blue). Notice that it is the StatMech approach that matches experiment. But we don’t need StatMech to understand this…

N: Number of features / parameters

P: Number of training examples (i.e Patterns)

alpha: the ‘load’ parameter (note: this is not the weightwatcher alpha)

There are 2 regimes of interest”:

under-parameterized: P > N. We have more training data than features. Here, we can always make the model better by just adding more data. Way more data. That’s easy.

over-parameterized: N > P: We have more features, and therefore more adjustable weights, than training data. Of course, in this simple model, the test accuracy is no better than say 65%. But it does work. And this is the regime most interesting to LLMs and other Deep Neural Networks.

The over-parameterized regime

When P" class="latex" />, we call this the overparameterized regime. And this is the most interesting regime for LLMs and other DNNs which have a huge number of parameters. Also, this is where traditional statistics falls down.

The traditional stats approach fails to describe the case , where the test error unexpectedly explodes. But, in fact, this is quite easy to describe — and is even predicted — by the weightwatcher theory.

Moreover, when , the test error is minimized. Of course, it is still pretty high, but that’s OK since, again, we will see that, again, this is quite easy to describe — and is even predicted — by the weightwatcher theory. But first,

The PseudoInverse solution

Given a data pair , we can write Linear Regression problem as (where, here, the target variable is just a binary label )

We want to find the weight vector latex \mathbf{X}$ and the vector of labels . We can now write the Linear Regression problem in matrix form as

,We can write the solutions in terms of the Moore-Penrose PseudoInverse. First, lets flip things around a little

Now, multiply by on both sides of this relation

We now invert the data covariance matrix

We now identify the Moore-Penrose PseudoInverse operator

The optimal weight vector is given by

As part of our initial analysis, we will compute the eigenvalues of the covariance matrix , as described in the paper

as well as the distribution of the inverse eigenvalues . It is these inverse eigenvalues we will analyze with weightwatcher

The weightwatcher layer quality metric (alpha)

To apply weightwatcher to the PseudoInverse problem, we need to first place the data matrix into a SingleLayerModel (see the notebook), and then use the newly added ‘inverse’ option, which analyzed the inverse covariance matrix of the layer

model = SingleLayerModel(weights=X_train)
watcher = ww.WeightWatcher(model=model)
details = watcher.analyze(inverse=True, plot=True, detX=True)

We can now plot the test accuracy as a function of the weightwatcher layer Power Law (PL) quality metric alpha ). This is shown on the LHS.

From the left side plot, we immediately see that when

WW PL-alpha =- 2.0, test error is minimized
WW-PL-alpha = 1.5. test error is maximized, and the model is severely overfit
WW PL-alpha > 2.0, test error is sub-optimal as well

Ideal Learning

On the RHS, we drill into the case where , where the test error is minimized. This plot shows the Empirical Spectral Density (ESD) of the (inverse) eigenvalues , and on a log-log scale.

The weightwatcher exactly, which means the ESD can be fit perfectly to a power law distribution with exponent 2.0. This is also exactly where the weightwatcher HTSR theory (as in our JMLR paper) predicts the model would have the best out-of-sample performance!

The WeightWatcher Volume Preserving Transformation

In addition to the Power Law (OK) metric alpha $\alpha_{PL}$, the weightwatcher StatMech theory (discussed in some detail at NeurIPS2023), also states that when the layer is perfectly converged for its data, then the PL tail will form an Effective Correlation Space that satisfies a Volume-Preserving Transformation.

You can check this yourself by adding the detX=True option to the analyze method

details = watcher.analyze(inverse=True, plot=True, detX=True)

and examining the plot on the lower RHS. You can tell your layer is well-trained when the red and purple lines are very close together:

Remarkably, in the case $P=0.5N$, where the test error is minimized, not only is the weightwatcher PL $\alpha_{PL}-2.0$, but the PL tail also satisfies the theoretically predicted volume preserving transformation!

But more importantly, you can use this additional weightwatcher metric to check the quality of your NN layer to see if it is properly converged, or if the layer is is overfit.

Signatures of over-fitting

When the weightwatcher , when , this indicates that the layer ESD is Very Heavy Tailed (VHT), and the underlying weight matrix is atypical. In these case, we can infer that the model is potentially overfit. Again, this is exactly what we see. In other words,

The weightwatcher theory describes the DD curve as predicted

(in the small alpha, overparameterized regime).

Now this is not explained in the old JMLR paper, but I do discuss it at some length in my recent invited talk at NeurIPS2023 (rerecorded). Also see the NeurIPS page for our Workshop on Heavy Tails ML. (with both my original talk and several others)

Why does weightwatcher work so well here. It turns out, for the case , the ESD of the data covariance matrix is the same as the ESD of that Marchenko Pastur random matrix with aspect ratio .

As shown on the LHS below, the ESD has many zero and near-zero eigenvalues $latex \lambda \approx 0$, which was conjectured (back in 1989) to prevent the model from describing out-ot-sample examples.

As shown on the RHS (above), when we compute the ESD of the inverse covariance matrix , this ESD can be fit perfectly to a Power Law (PL) distirbution, with alpha exponent $\alpha_{PL}=1.5$ (which is the lower limit of the Clauset esimator weightwatcher users). This means the inverse covariance matrix is highly atypical, and is dominated by the largest (inverse) eigencomponents. Because of this, this matrix is not a typical sample or view on the training data, and it can not describe anything but the training data. In other words, the model is severely overfit.

Also, as shown above, if you add the detX=True option, you can verify that your layer is overfit if the purple line is to the right of the red line. We will discuss this in detail in a future post.

The weightwatcher alpha and the HTSR theory

How are these results related to the basis of weightwatcher–from our JMLR paper on the theory of Heavy Tailed Self-Regularization?

The HTSR theory is a phenomenology that classifies a NN layer weight matrix into a specific Universality Class from Random Matrix Theory (RMT), based on the weightwatcher fit of the Power Law exponent .

As explained in the paper, when the PL exponent of the ESD is , we can associate the layer with the Heavy-Tailed (HT) Universality Class. When , the HTSR theory predicts that the layer is optimally converged for its training data. And when , the layer ESD is in the Very-Heavy-Tailed (VHT) Universality Class. And as explained in my invited talk at NeurIPS2023, this is a signature that the layer may be overfit, just like in the Double Descent example described here.

Applying weightwatcher to real-world LLMs and DNNs

What can do with weightwatcher ? Ask yourself…

Do you really know if you using enough data to fine-tune your LLM ?
Are you worried that you are overfitting the model to your training data
Are you having trouble evaluating your LLM ?

You can use the tool to find and fix these kinds of problems in your LLMs and NNs that no other tool can diagnose or discover.

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

Learning is an inverse problem. And weightwatcher provides a view into the correlations in the training data, akin to looking at the inverse correlation matrix at different levels of granulairty. And it works even if your data is not random (like this simple model), but correlated, as in a real world problem.

You can use weightwatchet to look for potential signatures of overfitting using the weightwatcher tool. The kind of signatures seen in the classic Double Descent experiment. Even without test data! Even without training data!

You just need to ‘watch’ the model weights.

I developed weightwatcher to help my clients who are training and fine-tuning their own LLMs (and DNN AI models). If it’s useful to you, I’d love to hear about it. And it you need help training or fine-tuning your own models, please reach out. My hashtags are: #talkToChuck #theAIguy.

In the meantime, remember–Statistical Learning Theory (SLT) cannot describe Double Descent in the over-parameterized regime–but StatMech can. And so can weightwatcher!

Thanks to my friend Patrick Tangphao for this meme (haha)

SVDSmoothing LLM Layers with WeightWatcher

Charles H Martin, PhD — Tue, 13 Feb 2024 07:07:11 +0000

Recently, Microsoft Research published the LASER method: ”Layer-Selective Rank Reduction” in this recent, very popular paper

The Truth is in There: Improving Reasoning in Language Models
with Layer-Selective Rank Reduction

And it got a lot of press (the Verge ) because it hints that it may be possible to improve the truthfulness of LLMs with a simple mathematical transformation

The thing is, the weightwatcher tool has had a similar feature for some time, called SVDSmoothing. And like the name sounds, you can apply TruncatedSVD to the layers of an AI model, like an LLM, to improve performance.

Lets take a deep look at how you can apply this yourself, and why it works.

First, it you haven’t done so already, install weightwatcher

pip install weightwatcher

For our example on Google Colab, we will also need the accelerate package

!pip install accelerate

Weighwatcher can run on a GPU, multi-core CPU, or vanilla CPU. To check things are working correctly, look for any warning message(s) after importing weightwatcher. I.e.

import weightwatcher as ww

Here, we see a warning that the Google Colab GPU is not available, so weightwatcher defaults to the CPU, which is going to be slower but the SVD calculations will be more accurate.

WARNING:weightwatcher:PyTorch is available but CUDA is not. Defaulting to SciPy for SVD
WARNING:weightwatcher:Import error , reetting to svd accurate methods

‘Requirements

weightwatcher version 0.7.4.7 or higher
pytroch or keras frameworks (onnx support is available but not well tested)
your LLM must be loaded into memory *(the SVDSmoothing option does. not support safetensors or related ‘lazy’ formats yet)
the model must have only Dense / MLP layers *(no LSTM, although Conv2D layers are ok)

Here is a link to a Google Colab notebook with the code discussed below

An Example: TinyLLaMA

For our example, we will use the TInyLLaMA LLM. Use git to download the model files to a local folder

!git clone https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-955k-token-2T/

We will also need a folder to store our smoothed model in for testing later.

We copy the folder, and remove the model files in it:

import os

tinyLLaMA_folder = "TinyLlama-1.1B-intermediate-step-955k-token-2T"
smoothed_model_folder = "smoothed_TinyLLaMA"
smoothed_model_filename = os.path.join(smoothed_model_folder, "pytorch_model.bin")

!cp -r $tinyLLaMA_folder $smoothed_model_folder
!rm $smoothed_model_filename

Running SVDSmoothing with WeightWatcher

We first need to load the model into memory. (for now; later a version can be made that supports the safetensors format, reducing the memory footprint considerably)

import torch

tinyLLaMA_filename = os.path.join(tinyLLaMA_folder, "pytorch_model.bin")
tinyLLaMA = torch.load(tinyLLaMA_filename)

Getting Started: Describe a model

Before we run this, however, lets first check that weightwatcher is working properly by running watcher.describe()

import weightwatcher as ww

watcher = ww.WeightWatcher(model=your_llm)
details = watcher.descrobe()

The watcher produces the details, a dataframe with various layer information:

Notice that in an LLM, each invidudual layer is a DENSE layer–even the ones making up the transformer layers. In order to only select the MLP layers, we need to identify them (by name, and then number)

Selecting Specific Layers

The LASER paper recommends applying (what we call) SVDSmoothing to specific layers of an LLM, such as the MLP/DENSE layers towards the end of the model (closer to the labels)

We can select specific layers by id (number), by type (i.e DENSE), or by name. To select the MLP layers, lets just get list of TinyLLaMA layers that have the term ‘mlp’ in them

import pandas as pd

# Assuming 'details' is your DataFrame
D = details[details['name'].astype(str).str.contains('mlp')]

# Now, extract 'layer_id' column as a list of ids
mlp_layer_ids = list(D['layer_id'].to_numpy())

Now that we have the layers listed out, we need to select the method for choosing our low rank (TruncatedSVD) approximation. We then specify layer_ids=…

smoothed_model = watcher.SVDSmoothing(layers=mlp_layer_ids, ...)

Selecting a Low-Rank Approximation

The weightwatcher tool offers several different automated methods for selecting the SVD subspace for the low rank approximation. This subspace can be selected as the eigencomponents associated with the method=… (and the optional percent=…) option(s):

smoothed_model = watcher.SVDSmoothing(method='svd', percent=0.2, ...)

where the method option can be

method = ‘detX’ (default) | ‘rmt’ | ‘svd’ | ‘alpha_min’

The default method , ‘detX, select the top eigenvalues satisfying the detX condition, i.e. a volume perserving transformation.

method=’detX ‘: default

This is approximately the same as using method=’svd’ , with ‘percent’=0.8 (80% of the rank is retained)

method=’svd ‘,percent=P: the top P percent of the largest eigenvalues

Other options are:

method=’alpha_min’: the eigenvalues in the fitted Power Law tail
method = ‘rmt’: the large spikes, or eigenvalues larger than the MP (Marhcenko Pastur) bulk region predicted by RMT (Random Matrix Theory)–currently broken, needs fixed

and in the future, weightwatcher will also offer method=’entropy’, which will select the SVD subspace based on the entropy of the eigenvectors.

For now, lets keep it simple and pretty conservative and use the default (which picks the top eigencomponents associated with ~60-80% largest eigenvalyes of each individual layer weight matrix)

Generating the Smoothed Model

watcher = ww.WeightWatcher(model=tinyLLaMA)
smoothed_model = watcher.SVDSmoothing(layers=mlp_layer_ids)

Testing the Smoothed Model

Now we save the smoother model to the folder we created aove

torch.save(smoothed_model, smoothed_model_filename)

To test the models, we will generate some fake text from a prompt and compare. Just a sanity check.

We meed a tokenizer; will share the tokenizer between models

from transformers import AutoTokenizer, pipeline
import torch

# Initialize the tokenizer from the local directory
tokenizer = AutoNow, Tokenizer.from_pretrained(tinyLLaMA_folder)

We then specify the model folder to create a text generation pipeline. For the orginal TinyLlama, we use

# Manually set the device you want to use (e.g., 'cuda' for GPU or 'cpu' for CPU)
# If you want to automatically use GPU if available, you can use torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize the pipeline and specify the local model directory
text_generation_pipeline = pipeline(
    "text-generation",
    model=tinyLLaMA_folder,
    tokenizer=tokenizer,
    framework="pt",  # Specify the framework 'pt' for PyTorch
)

and for the smoothed model, we use

smoothed_generation_pipeline = pipeline(
    "text-generation",
    model=smoothed_model_folder,
    tokenizer=tokenizer,
    device=device,  # Use the manually specified device
    framework="pt",  # Specify the framework 'pt' for PyTorch
)

Now we can generate some text and compare the results of the original and the smoothed model. Lets just do a sanity check:

# Use the pipeline for text generation (as an example)
generated_text = text_generation_pipeline("Who was the first US president ?", max_length=20)
print(generated_text)

The original response is

“Who was the first US president ?\nThe first US president was George Washington.\n…”

generated_text = smoothed_generation_pipeline("Who was the first US president ?", max_length=20)
print(generated_text)a

The smoother result is:

“Who was the first US president ?\nThe first US president was George Washington.\n…”

and matches the above correct result (above)

Let me encourage you to try this, with different settings for SVDSmoothing, and decide for yourself what is a good result.

Why does SVDSmothing Work ?

I have discussed this in great detail in my invited talk at NeurIPS2023 in our Workshop on Heavy Tails in ML. To view the all the workshop vidoes, go here.

Breifly, the weightwatcher theory postulates that for any very well-trained DNN, the correlations in the layer concentrate into the eigencomponents associated with the tail of the layer ESD. And that this tends to happen when the weightwatcher layer quality metric alpha is 2.0, and, simultaneously, when the detX / volume preserving condition holds. We call this subspace the Effective Correlation Space (ECS).

For example, if we train a small, 3-layer MLP on the MNIST data set, we can see that when the FC1 layer alpha -> 2, then we can then replace the FC1 weight matrix W with its ‘Smoothed’ form, and, subsequently, reproduce the original test error (orange) exactly!

Moreover, we can not reproduce the training error (blue); the ‘Smoothed’ training error is always a smaller larger than the original training error, but larger than zero.

This is very easy to reproduce and I will share some Jupyter Notebooks to anyone who wants to try this.

This experiment shows that the parts of W that contribute to the generalization ability into the Effective Correlation Space (ECS) defined by the tail of the ESD.

When running SVDSmoothing, if you use the method=detX default option, weightwatcher will attempt to define the ECS automatically for you, but if alpha > 2, the Effective Correlation Space (ECS) will be larger than necessary, just to be safe.

If you run the SVDSmoothing this yourself, please join our Community D iscord and share your learnings with us.

The weightwatcher tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. Email: Info@CalculationConsulting.com

Evaluating LLMs with WeightWatcher Part III: The Magic of Mistral, a Story of Dragon Kings

Charles H Martin, PhD — Tue, 30 Jan 2024 06:48:27 +0000

Recently, the Mistral models have taken the LLM world by storm. The Mistral Mixture of Experts (MOE) 8x7b model outperforms other models in it’s weight class such as LLamA 2 70B and GPT 3.5. Here’s a quick review of it’s performance on different LLM benchmarks:

And even the smaller Mistal 7b model seems to be “punching well above its weight class, taking on LLM giants”, and emerging as the Best [small) OpenSource LLM Yet. Can we understand why ?

In this post, I conjecture why Mistral 7b works so well–based on an analysis with the open-source weightwatcher tool, and drawing upon the Sornette’s theory of Dragon Kings

Here’s a Google Colab notebook with the code discussed below

A) Analyzing Mistral 7b with weightwatcher

Let’s run Mistral-7b through its paces with the weightwatcher tool. This time, however, we are going compare the standard results with one of the advanced features of the tool, the fix_fingers option (described in this earlier blog post).

Here are the steps we take:

A.1) Download Mistral-7B-v0.1 base model to a local repo:

!git clone $base_model_html

Notice that this repo contains the base model in both as both safetensors files and pytorch_model.bin files; we can remove the latter and just keep the 2 safetensors files.

A.2) Run weightwatcher with the fix_fingers=’clip_xmax’)

import weightwatcher as ww
watcher = ww.WeightWatcher()
details = watcher.analyze(model="Mistral-7B-v0.1", fix_fingers='clip_xmax')

The resulting details dataframe wlll contain 2 columns

raw_alpha: the estimated Power Law (PL) exponent alpha, without applying fix_fingers
alpha: the ‘fixed’ alpha, adjusted for fingers (i.e possible Dragon Kings)

A.3) Compare the 2 alphas

We analyze the results weightwatcher power law qualiy metric alpha by making a histogram of the raw (default) alpha, and the (fixed) alpha.

We see that the raw_alpha and the ‘fixed’ alpha look very different. The raw_alpha have a very wide distribution, with many layer raw_alpha >> 6 , with 1 even as high as 40! And the average <raw_alpha>=~5.7, which is just at the edge of the safe range (we want the average layer in [2,.4])

The ‘fixed’ alphas, however, are very different. The distribution is much sharper, with very few ‘fixed’ alpha > 6, and the average fixed =~ 4.8. Under the lens of the weightwatcher HTSR theory (described in our seminal paper), this is model looks much better.

A.4) Compare to other base models

Here are the same plots for the LaAMA-7b and Falcon-7b base models:

Notice that the raw_alpha and the ‘fixed’ alphas look almost identical. And the layer averaged alphas are also almost identical. So it not necessary to use the slower fixed_finger options when running weightwatcher.

So what’s going on ?

B) Fingers and Power Laws

So-called ‘fingers’ are large positive outliers that arise when computing the Empirical Spectral Density (ESD) of a layer weight matrix . That is, they are usually large eigenvalues of the correlation matrix —so large that they live outside the normal and expected range of the ESD when fit to a Power Law, i.e . The best way to see this is with a plot.

B.1) Plot a layer ESD with weightwatcher

Let’s pick a layer with a prominent ‘finger’ and plot the ESD.

watcher.analyze(layers=[139], plot=True)

When we add plot=True, weightwatcher will generate several plots for each layer, such as the 2 plots above. The plot on the left is a histogram of eigenvalues for layer 139 (the ESD), and on a linear-linear scale, depicting a typical Heavy-Tailed (HT) ESD, The one on the right is the later ESD, this time on log-log scale, along with the PL exponent and the quality of the PL fit . The PL tail starts at the (vertical red line), and the (dashed red line) depicts the best theoretical PL fit

The finger is in the lower right side of the log-log plot, and looks like a small ‘shelf’ at the far end of the ESD. The finger consists of 1 or more spuriously large eigenvalues \lambda^{max}_{fixed}" class="latex" />. If we include these large in the PL fit, the result is the (dashed red line), and the alpha is unusually large, i.e >\alpha_{fixed} " class="latex" />. Using the fix_fingers option, weightwatcher will remove upto 10 fingers until the ‘fixed’ alpha is stable.

B.2) Stability analysis of fix_fingers option

How can we tell if the ‘fixed’ alpha is stable ? There is a plot for this too.

The plot on the left helps us evaluate the stability of the weightwatcher power law (PL0 fits. It plots the choice of the start of the PL tails (red line) against the qualty of the PL fit (Dks, the Kolmogorov–Smirnov distance between the data and the fit)

Good fits show a single, easily-found global minima. Bad or spurious fits have lots of close or degenerate fits

In this plot for the Layer 139 PL fit, we see a realtively convex envelope near the choice of near the fixed fit, and lots of spurious fits that would have occured if we had included the finger in the data. (That is, there are lots of spurious fits toward the right side, indicating a smaller PL tail, i.e. the purple-dashed-line above)

B.3) Comparing alpha to other weightwatcher metrics

The alpha metric is one of a number of weightwatcher layer quality metrics. It is the most inteesting because it lets us identify the Universality class of the layer. That is , is the layer well fit , overfit , or underfit 6" class="latex" />). But alpha is also the hardest to estimate. Fortunately, we can get a rough check on our estimated alpha by comparing it to 1 or more other weightwatcher metrics.

A good metric to compare against is the rand_distance metric. This metric computes how far a layer weight matrix W is from being a random matrix–the larger the distance, the less random (or more correlated) W is. Generally speaking, smaller alpha means larger rand_distance. We can compute rand_distance using therandomize option:

details = watcher.analyze(plot=True, fix_fingers='clip_xmax', randomize=True)

alpha vs Rand_Distance

We can now plot the raw_alpha vs. rand_distance , and the (fixed) alpha vs rand_distance.

Once again, we see that the raw_alpha and the ‘fixed’ alpha cases look very different. The raw_alpha is is not even a proper function of rand_distance, and shows a stange circular trend red line. On contrast, while not a perfect relation, the ‘fixed’ alpha at least shows a noisy but proper trend green line.

Both the metrics are noisy estimators, but by cimparing them, we can see that the fixed_alpha is a more consistent estimator of the Heavy-Tailed (HT) quakity metric alpha for each layer.

C) Fingers as Dragon Kings

Why do we observe so many outlier eigenvalues ? Here, I posit the conjecture that these spuriously large eigenvalues correspond to Dragon Kings (DK). The Dragon King Theory was first postulated Professor Didier Sornette to explain the appearance of enormously large outliers in the power law fits of a wide range of self-organizing phenomena.

Dragon Kings (DK) are thought to arise as a result of some kind of coherent collective event or some other deterministic mechanism that can act like a dynamical attractor, accelerating the learning for the features associated with those specific eigenvectors at the far end of the ESD.

In particular, in Dragon Kings are thought to arise in quasi-critical systems exhibiting Self-Organized Criticality (SOC)–such as in the avalanche patterns observed in biological, cultured neurons!– due to the inherent dynamics and interactions within the system that push it to a critical point.

This can be due to long-range interactions and./or feedback loops, that, when activated, lead to the emergence of these extreme signatures.

When it comes to neuro-dynamics of, sometimes Dragon Kings help, and sometimes they hurt. In some cases, they are thought to suppress neural function, and, in others, they are characteristic of proper function. See this 2012 paper for a brief review; there are many recent scientific studies on this as well.

But most importantly, and in the context of training an LLM, the appearance of Dragon Kings indicates that some unique dynamical processes is generating these extreme eigenvalues and that this is fundamentally different from the goings on the normal dynamics of the SGD training processes for LLMs.

In an LLM , when do DKs hurt and when do they help ? I suspect that when they arise in the ESD like they do in Mistral-7b, they may indicate better performance. But I suspect may also appear as Correlation Traps, and, in this case, they may hurt performance.

I am postulating that it is this unique process, whatever it is, that is giving Mistral-7b (and the MOE 7x8B model) such remarkable performance. And if correct, it’s possible we could identify the underlying driving process and amplify it during training.

D) Testing the Dragon King Hypothesis with weightwatcher

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

As we more and more powerful open source LLMs emerge, it will be possible to test this Dragon King Hypothesis by running the open source weightwatcher tool. I encourage you do this yourself, and / or join our Community D iscord and submit some LLMs to test.

Evaluating Fine-Tuned LLMs with WeightWatcher Part II: PEFT / LoRa Models

Charles H Martin, PhD — Sun, 28 Jan 2024 07:06:50 +0000

Evaluating LLMs is hard. Especially when you don’t have a lot of test data.
In the last post, we saw how to evaluate fine-tuned LLMs using the open-source weightwatcher tool. Specifically, we looked at models after the ‘deltas’ (or updates) have been merged into the base model.

In this post, we will look at LLMs fine-tuned using Parameter Efficient Fine-Tuning (PEFT), also called Low-Rank Adaptations (LoRA). The LoRA technique lets one update the weight matrices (W) of the LLM with a Low-Rank update (BA):

Here is a great blog post explaining all things LoRA

The A and B matrices are significantly smaller than the base model W matrix, and since many LLMs are quite large and require a lot of disk space to store, it is frequently convenient to only store the A and B matrices. If you use the HuggingFace peft package, you would then store these matrices in a file called either

adapter_model.bin, or
adapter_model.safetensors

These adapter model files can be loaded directly into weightwatcher so that the lora BA matrices can be analyzed separately from the base model. To do this, use the peft=True option. But first, lets check the requirements on the model for doing this

0.) Requirements for PEFT/LoRA models
–

The update or delta should either be either loaded in memory, or stored in a directory/folder, and in the pytorch or safetensors format
The LoRA rank (r) should be r>10, otherwise you need to also specify min_evals=2
The LoRA layer names for the A and B matrix updates should include the tokens ‘lora_A’ and/or ‘lora_B’
We do not specifically support the LighteningAI Lit-GPT framework yet
the weightwatcher version should now be 0.7.4.3 or higher

Here’s a Google Colab Notebook with the code discussed below

1) A simple example: llama-7b-lora

To illustrate the tool, we will use this Llama-7b LoRa fine-tuned model . You may download this model.

!git clone https://huggingface.co/DevaMalla/llama_7b_lora

The llama-7b-lora directory should have the adapter_model.bin file

2) Describe the model
I recommend you first check the model to ensure weightwatcher reads the model file correctly and finds all the layers.

2a) peft=False

First, lets see what the raw model looks like

import weightwatcher as ww
watcher = ww.WeightWatcher()
details = watcher.describe(model='llama_7b_lora')

Note: if you don’t specify left-=True, then weightwatcher will try analyze the individual A and B matrices, which is NOT what you want. But we can check that the layer names. The details dataframe should look like this:

Notice that this dataframe has 128 rows, and each layer name has the phrase ‘lora_A’ or ‘lora_B’ in it.

Also, notice that the matrix rank for A and B is M=64, which is large enough for weightwatcher to get a good estimate of the weightwatcher layer quality metric alpha

2a) peft=True

watcher = ww.WeightWatcher()
details = watcher.describe(model='llama_7b_lora', peft=True)

The peft=True dataframe now should look like this:

Now we see that the details dataframe has only 64 rows, and each layer name has the phrase ‘lora_BA‘ in it. Each layer now represents the matrix:

Note: we use the notation from the original LoRa paper, whereas other blogs and frameworks may use $latex \Delta \mathbf{W}=\mathbf{AB}$

3) Analyze the adapted_model.bin

watcher = ww.WeightWatcher()
details = watcher.analyze(model='llama_7b_lora', peft=True)

The details dataframe should now contain useful layer quality metrics such as alpha, alpha_weighted, etc. Here’s a peek:

We can now look at some of these metrics to see how well our LoRA fine-tuning worked.

3.a) Heavy-Tailed Layer Quality Metric alpha

Let’s plot a histogram of the Power law exponenent metric alpha.

import matplotlib.pyplot as plt

avg_alpha = details.alpha.mean()

title=f"Fine Tuned WeightWatcher Alpha\n Llama-7b-LoRA"
details.alpha.plot.hist(bins=50)
plt.title(title)
plt.axvline(x=avg_alpha, color='green',label=f" = {avg_alpha:0.3f}")
plt.legend()

Notice that all of the layer alphas are less then 2 (alpha < 2). That is, al the layer ESDs live in the HTSR Very Heavy Tailed (VHT) Universality Class--which could be problematic. Typically, with well-performing models, we see alpha in [2.4], and we very rarely find alpha < 2. And this holds for the updated fully finetuned (not PEFT) models.

We have found that for many LoRA fine-tuned, the alphas < 2. What does it mean when a layer alpha < 2 ?

the layer is over-regularized ?
the layer could be overfitting its training data ?

We could potentially prevent these small alphas by decreasing the learning rate.

4) LoRA alphas vs Base Model alpha

Why are the LoRA layer alphas so small ? Are these layers overfitting their training data ? This remains an open question, but we can get some insights by comparing the LoRA layer alphas to the corresponding layers in the base model. Notice that in this model, the developer only fine tuned the Q and V layers.

If we have the base_model details for the llama_7b model , we can compare the layer alphas between the Base and the fine-tuned LoRA models

B = base_details
filtered_B = B[B['name'].str.contains('q|v', na=False)]
base_alphas = filtered_B.alpha.to_numpy()
lora_alphas  = details.alpha.to_numpy()

plt.scatter(x=base_alphas, y=lora_alphas)
plt.title("Llama-7b: LoRA alphas vs Base Model alphas")
plt.xlabel("Base Model Layer alpha")
plt.xlabel("Lora Layer alpha")

From this plot, we see immediately that the larger the Base model alphas are, the smaller the LoRA layer alphas are. This is a common pattern we have observed in other LoRA models, and is easy to reproduce even when fine-tuning a small model like BERT. (for a future blog post)

What does this mean ? It appears that the when the base model layer is less correlated (alpha > 6 say), the LoRA update tends to be more correlated. It is as if the LoRA fine-tuning procedure over-fits its training data when the layer being adapted is not well trained.

This is quite curious and suggests that one has to be very careful when fine-tuning base models like Llama with many under-correlated layers (i.e. alpha > 6). In contrast, the Falcon models do not show such behavior (see the weightwatcher LLM leaderboard page).

On the other hand, maybe this is not a bad thing, and you actually want your fine-tuned LLM to overfit / memorize your training data as much as possible ?

5) Next Steps

If you want to try this yourself:

pip install weightwatcher

and join our Community Discord channel to discuss the results and explore this and other features of the weightwatcher tool.

Evaluating Fine-Tuned LLMs with WeightWatcher

Charles H Martin, PhD — Wed, 24 Jan 2024 07:49:05 +0000

if you are fine-tuning your own LLMs, you need a way to evaluate them. And while there are over a dozen popular methods to choose from,

each of them are biased toward a specific, narrowly scoped measure.
none of them can identify potential internal problems in your model, and
in the end, you will probably need to design a custom metric for your LLM

Can you do better? Before you design a custom metric, there is a better, cheaper, and faster approach to help you get started–using the open -source weightwatcher tool

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

How does it work ? The weightwatcher tool assigns a quality metric (called alpha) to every layer in the model. The best models have layer alpha in the range alpha>2 and alpha<6. And the average layer alpha () is a general-purpose quality metric for your fine-tuned LLMs; when comparing models, smaller is better.

How does this work ? Let’s walk through an example. We consider the Falcon-7b-Instruct instruction fine-tuned model. But first,

0.) WeightWatcher Requirements for Fine-Tuned Models

For this blog, we consider models where the fine-tuning updates has been merged with the base model. In a future blog, we will show how to examine the updates (i.e the adapter_model.bin) files directly. For this

The base_model should be stored in the HF safetensors format
The merged model does not need to be safetensors format, but the base model must have exactly the same layer names
The update should be for either fully fined models, or PEFT/LoRa models with a fairly large LoRa rank r>=32, preferably r>=64 or r>=128

On compute resources, weightwatcher can run on a generaic GPU, CPU, or multi-core CPU, and will automatically will detect the ennviroment and asjut to it.

Here’s a Google Colab Notebook with the code discussed below

1.) Install weight watcher

pip install weightwatcher

Just to be safe, you can print the weightwatcher version

print(ww.__version__)

The version should be 0.7.4 or higher

2.) Download the models. Here, we will use the pre-trained Falcon-7b-instruct model and the Falcon-7b base_model. To save memory, we will use the versions stored in the safetensors format.

2.a) For Falcon-7b-instruct, you can just check out the git repo directly

git clone https://huggingface.co/tiiuae/falcon-7b-instruct

To save space, you may delete the large pytorch_model.bin files

2.b) For the base model, Falcon-7b, we need to do a little extra work to download them (because the Falcon-7b safetensors branch is not available from git directly. First, git check out the main repo:

git clone https://huggingface.co/tiiuae/falcon-7b

To save space, you may also delete these large pytorch_model.bin files here

2.c) Manually download the individual model.*.safetensors files, and place them directly in your local falcon-7b directory.

wget https://huggingface.co/tiiuae/falcon-7b/resolve/d09af65857360b23079dc3dc721a2ed29f4423e0/model-00001-of-00002.safetensors
wget https://huggingface.co/tiiuae/falcon-7b/resolve/d09af65857360b23079dc3dc721a2ed29f4423e0/model-00002-of-00002.safetensors

You will also want to download the model.safetensors.index.json file and place this in your local falcon-7b directory.

Your local directory should now look like:

3.) run weightwatcher

3.a) Just to confirm everything is working correctly, let’s run watcher.describe() first

iimport weightwatcher as ww
watcher = ww.WeightWatcher()
details = watcher.describe(base_model='falcon-7b', model='falcon-7b-instruct')

This will generate a pandas dataframe with a list of layer attributes. It should look like this:

3.b) If the details dataframe look good, you can run watcher.analyze()

import weightwatcher as ww
watcher = ww.WeightWatcher()
details = watcher.analyze(base_model='falcon-7b', model='falcon-7b-instruct')

The weightwatcher tool will compute the weight matrix updates (W-W_base) for every layer by subtracting out the base_model weights (W_base) from the weights (W) of the fine-tuned model. This lets us evaluate how well the fine-tuning worked.

3.c) Plotting a histogram of the layer alphas, we can see that almost every alpha lies in the weightwatcher safe-range of alpha in [2,4]. indicating this is a pretty good fine tuning.

Notice also the average layer alpha, =2.814, which is pretty good.

And that’s all there is too it. You now have an evaluation metric ~2.8 for your fine-tined model (here, Falcon-7b-instruct). And you did not need to run any costly inference calculations, or even access to the training data.

You don’t even need a GPU–weightwatcher can run on a single CPU or a shared memory, multi-core CPU machine. You can run weightwatcher on Google Colab, or even locally on your MacBook (which I did).

4.) Comparing Fine Tuned Models

We can compare 2 or more fine-tuned models directly by looking at the weightwatcher layer quyality metrics. As an example, lets compare the Falcon-7b-instruct model to the Falcon-7b-sft-mix-2000 model.

4.a) Comparing the layer averaged alpha

we can compute the layer average alpha for the 2 models directly from the details dataframe ( =avg_alpha) using:

avg_alpha =  details.alpha.mean()

Doing this for both models, we find:

falcon-7b-instruct. =~ 2.8
falcon-7b-sft-mix-2000: =~ 2.6 (smaller is better)

From this, we see that the falcon-7b-sft-mix-2000, having a smaller average alpha, is a little bit better than the falcon-7b-instruct. Can we say more ?

4.b) Comparing layer alpha histogram plots

Comparing ths at this plot to the one above, we see that the falcon-7b-sft-mix-2000 has a few undeseirable layers with alpha < 2 *(left of the red line), but, more importantly, has a maximum alpha of ~3.5. In contrast, while the falcon-7b-instruct models has fewer layers left of the red line, it’s maximum alpha is ~ 4.5…much higher. So, on average (and even if we ginored the layer alphas < 2), we find that the alcon-7b-sft-mix-2000 is a little better.

5.) Summary

We have seen how to use the weightwatcher tool to evaluate and compare 2 different fine-tuned models (with the same base model, falcon-7b). Notice we did not need to run inference or even need any test data.

To run these calculations quickly, I used an A100 GPU on Google Colab, and they only took a few minutes. In fact, it took longer to upload these large model files to my Google Drive than to run weightwatcher.

But these are just 7B parameter models, and since these LLMs can get fairly large, in the next blog post, I will show how to analyze the much smaller adapter_modell.bin model files (and trained with or without PEFT/LoRa(, directly with weightwatcher.

Stay tuned. And if you need help training or fine-tuning your own LLMs, please reach out. #talktochuck #theaiguy

WeightWatcher new feature: fix_fingers=’clip_xmax’

Charles H Martin, PhD — Tue, 21 Mar 2023 22:03:54 +0000

WeightWatcher 0.7 has just been released, and it includes the new and improved advanced feature for analyzing Deep Neural Networks (DNN) called fix_fingers. To activate this, simply use:

details = watcher.analyze(..., fix_fingers='clip_xmax', ...)

This will take a tiny bit longer, and will yield more reliable alpha for your model layers, along with a new column, num_fingers, which reports the number of outliers found

Note that other metrics, such as alpha_weighted (alpha-hat), will not be affected.

It is recommended that the fix_fingers be added for all analysis moving forward, however, we have not completed a detailed study on the impact yet, so it has been included as an advanced feature.

So why do this ?

What was wrong; what’s been fixed ?

In our Nature paper (Nature Communications 2021), we looked at the layer alphas for the GPT2 models, and, among other things, noticed several large alphas, greater than 8 and even 10! These are outliers because no alpha should be greater than 8 (which is the minimum alpha for a random matrix).

Up until now, one just had to accept them as errant and/or remove them from our analysis. But now, I can explain where they come from and how to adjust the calculations when necessary.

Recall that the HTSR and SETOL theories describe the layer quality by analyzing the histogram of its eigenvalues–that is, we analyze the ESD (Empirical Spectral Density). The best-trained layers have a heavy-tailed ESD which can be well fit to a Power Law (PL)

Fingers (i.e. outliers) tend to appear when the layer (Empirical Spectral Density) is really a Truncated Power Law (TPL), as opposed to a simple Power Law (PL). And that’s OK–it is actually predicted by the HTSR theory and for very large and/or high-quality layers. More precisely, we expect that, in the far tail, the (far) Tail Local stats will be Frechet.

So why not just fit a TPL ? Weightwatcher does support this (i.e. fit=’TPL’), but this is very slow, and, more importantly, we usually don’t have enough eigenvalues in the tail of the ESD to get a good TPL fit, and, instead, there are 1 or a few very large eigenvalues that degrade the PL fit.

We call these very large eigenvalues fingers because they look like fingers peeking out of the ESD tail.

How can we account for these outliers or fingers? Just remove them!

If we just remove. the TPL fingers when they appear, very frequently, and we are careful, we can recover a very good PL fit. (And this works better than using the fit=’TPL” option Let’s look at an example…

But before we do this, let’s discuss why they might appear.

The Critical Brain Hypothesis

The weightwatcher project is motivated by the theory of Self Organized Criticality (SOC), and the amazing fact that it has been successfully applied to understand the observed behavior of real-world (biological) spiking neurons. This is called the Critical Brain Hypothesis.

The weightwatcher theory posits that Deep Neural Networks (DNNs) exhibit the same signatures of criticality–namely power law behavior–as frequently seen in neuroscience experiments on real neurons.

Moreover, I believe that as LLMs (Large Language Models) approach true Self-Organized-Criticality, we will even more amazing properties emerge.

New and Improved Metrics for GPT

Let’s see how the new and improved fix_fingers works on GPT and GPT2

FIrst, we need to download the models (from HuggingFace)

import transformers
from transformers import OpenAIGPTModel,GPT2Model

gpt_model = OpenAIGPTModel.from_pretrained('openai-gpt')
gpt_model.eval();

gpt2_model = GPT2Model.from_pretrained('gpt2')
gpt2_model.eval();

Now, let’s run weightwatcher

import weightwatcher as ww
watcher = ww.WeightWatcher()
gpt_details = watcher.analyze(model=gpt_model, fix_fingers='clip_xmax')  
gpt2_details = watcher.analyze(model=gpt2_model, fix_fingers='clip_xmax')

Finally, we can generate a histogram of the layer alphas

gpt_details.alpha.plot.hist(bins=100, color='red', alpha=0.5, density=True, label='gpt')
gpt2_details.alpha.plot.hist(bins=100, color='green', density=True, label='gpt2')
plt.legend()
plt.xlabel(r"alpha $(\alpha)$ PL exponent")
plt.title(r"GPT vs GPT2 layer alphas $(\alpha)$ w/Fixed Fingers")

Here are the results

We see that the GPT model still has several fingers, (alphas >=6) , and, on average, larger alphas than GPT2. But there are fewer fingers than previously.

Summary: a new and improved Power Law estimator

We have introduced the new and improved option

details = watcher.analyze(..., fix_fingers='clip_xmax', ...)

which will provide more stable and smaller alphas for layers that are well-trained.

I encourage you to try it and see if it works better for you. And please let me know.

WeightWatcher 0.7: March 2023

Charles H Martin, PhD — Tue, 21 Mar 2023 00:20:03 +0000

First, let me say thanks to all the users in our great community — we have reached over 93K downloads as of March 2023 !

The latest release of the open-source weightwatcher tool includes several important advances, including

removing explicit dependence on tensorflow and torch on install
the ability to process very large models, directly from their pytorch statdict files
GPU-enabled SVD calculations
Much faster and more stable power law calculations
Lower memory footprint on GPU enabled machines
an improved method for finding the weightwatcher shape-metric alpha, with the option fix_fingers='clip_xmax' (to remove structural outliers, called fingers)
a new landing page: https://weightwatcher.ai with lots of examples

To learn more now, join our Discord channel (*Documentation to come)

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, based on our Theory of Heavy-Tailed Self-Regularization (HT-SR). It uses ideas from Random Matrix Theory (RMT), Statistical Mechanics, and Strongly Correlated Systems.

It can be used to:

analyze pre/trained pyTorch and Keras DNN models (Conv2D and Dense layers)
monitor models, and the model layers, to see if they are over-trained or over-parameterized
predict test accuracies across different models, with or without training data
detect potential problems when compressing or fine-tuning pretrained models
layer warning labels: over-trained; under-trained

It is based on theoretical research into Why Deep Learning Works, using the new Theory of Heavy-Tailed Self-Regularization (HT-SR), published in JMLR and Nature Communications.

Deep Learning and Effective Correlation Spaces

Charles H Martin, PhD — Thu, 02 Feb 2023 04:32:03 +0000

AI has taken the world by storm. With recent advances like AlphaFold, Stable Diffusion, and ChatGPT, Deep Neural Networks (DNNs) have had their Sputnik moment. And yet, we really don’t understand why DNNs even work. Unless, of course, you follow this blog and use the widely popular open-source weightwatcher tool.

The open-source weightwatcher tool has been featured in Nature Communications and has over 86K downloads. It can help you diagnose problems in your DNN models, layer-by-layer, without even needing access to the test or training data. But how can weightwatcher possibly do this?

In a previous post, I present the current working theory for the weightwatcher project, which explains where the weightwatcher power law shape metrics alpha , and alpha-hat really come from. This post has been written up in a new paper, and an early draft available upon request (email me),

I call this SETOL, a SemiEmpirical Theory of Learning.

We can write the weightwatcher metric as an approximate Free Energy/Likeliehood, given as the log model Quality , where the model quality might be the model test accuracy, an average GLUE score, etc. The SETOL approach expresses , in terms of an HCIZ integral

It is shown in the SETOL paper that when is well trained, the weightwatcher alpha-hat metric is then the approximate log quality

In order to formulate this, however, we need to invoke the change of measure from the distribution of the full space of all (Student) weight matrices to the space of all (Student) correlation matrices,

where represents an arbitrary ‘Student’ correlation matrix .

What does this mean? And, more importantly, how can we be sure it is correct ?

The Effective Correlation Space

First, some more familiar notation. More generally, given an arbitrary layer weight matrix , of dimension , we define the correlation matrix . In this general case, we assume we can perform the change of measure from layer weight matrices , to general correlation matrices .

In words, this means that, we need to change our concept of where the generalizing components live and how we effectively measure them.

For a very well-trained DNN layer, the correlations concentrate into a lower rank space, effectively defined by eigenvalues in the power law (PL) tail of the ESD. By this, we mean that there is some effective operator that spans the same space as the tail of the ESD (Empirical Spectral Density) . and has the same eigenvalues as those eigenvalues in the tail of the original layer correlation matrix .

This is consistent with the HTSR theory (published in JMLR), that states that for any well-trained weight matrix, the ESD of the layer not only forms a power law , but, also, that this tail contains the dominant eigencomponents of that allow the model to generalize.

More on what precisely is in another post; suffice it to say that right now, we just care about the eigenvalues . And we can test this theory by simply looking at the eigenvalues of our actual weight matrices.

In the rigorous formulation of the weightwatcher theory, to accomplish this, we require that the Effective Correlation Space be defined by a Volume-Preserving Transformation

Volume-Preserving Transformation

if we (crudey) write this transformation in terms of a Jacobian , one can see that to change the measure from the uncorrelated to correlated measure,

then the Determinant of the Jacobian of this transformation should be the identity, i.e. . This can be shown exactly (and is derived in the appendix of the new theory paper)

The key assumption of the weightwatcher theory is an empirically verifiable approximation:

The eigenvalues of the tail of , satisfy the relation .
That is, the effective correlation matrix has determinant 1, i.e. ,

This means we have now 2 independent methods to identify the power law tail (indicated with a red or purple line.)

MLE fit: (red line): Apply the standard (Clauset MLE) PL estimator, and find the eigenvalue where PL the tail starts. This is the default weightwatcher approach.
det X=1 (purple line): Search for the first eigenvalue that satisfies condition the , condition:

When these 2 methods coincide, we can have great confidence the theory applies well . At least I do.

This constraint can be evaluated empirically by plotting the ESD on a log-linear plot, and comparing the point where to where the PL fit says the tail of the ESD starts (red line) to where the constraint is best satisfied (purple line) When these 2 lines overlap, the theory works best. And when they don’t overlap, the layers is not fully optimized.

Remarkably, this is very frequently satisfied when the ESD is power law , and has the exponent , And this, remarkably, according to the HTSR theory, corresponds to optimal learning in that layer!

Empirical Verification of the (SETOL) Theoretical Model

Here, we will provide some justification for these key assumptions, using the open-source weightwatcher tool. As always, all of these results are 100% reproducible, and, moreover, you can test the assumptions on other models yourself.

As always, you can generate these results yourself on any model you like using weightwatcher. Simply use the following commands in your favorite notebook (Jupyter, Google Colab, etc.):

import weightwatcher as ww
watcher = ww.WeightWatcher()
details = watcher.analyze(model=your_model, plot=True, detX=True)

When the theory is working perfectly, or at least as prescribed, then

the weightwatcher Pl shape metric alpha will be about 2:
the detX constant plot will show the purple () and red () lines very close together, if not overlapping.

In the theory paper, we show this for a simple example, however, it turns out, this is very common even in many SOTA models!

Here are some examples from SOTA DNN models

ALBERT

ALBERT (A Lite BERT) is a SOTA LLM published by Google & the University of Chicago; it is a lightweight variant of BERT. It even outperforms, BERT, and XLNet, and RoBERTa in some cases.

There are several ALBERT models (base, large, xlarge, xxlarge), and most have layers have . Here’s a plot of the layer averaged alphas for all 4 models, compared their average quality (see this notebook for more details).

Notice that the layer averaged alpha metric, , is roughly correlated with model quality. This shows that general weightwatcher approach works on this set of models.

Where the does the SETOL theory work exactly ?

Lets take a look at 2 random layers from. the ALBERT albert-xlarge-v2 model, with different alphas: a middle layer, with , and the last layer, with . Below you can see that with the smaller alpha of 1.89, closer to 2.0, the purple and red lines overlap.

alpha = 2.387

alpha = 1.829

VGG19

We have looked at the VGG series of models before, both in our JMLR paper, and in our Nature Communicationss paper. Here, we look at a couple of Fully Connected (FC) or Dense/Linear layers with alpha near 2.0. (Note, we did this analysis in Table 6, the JMLR paper, but I have found that more recent versions of VGG give different results; here Ireport results from the current, default Keras VGG19 model)

alpha = 2.14

alpha = 2.06

Analysis

Here, despite the 2 plots above looking very different, both alphas are close to 2, and in both cases the purple and red lines overlap, showing that these ESDs exhibit the signatures of the required Volume Preserving Transformation under-the-hood of the rigorous weightwatcher Statistical Mechanics-based Semi-Empirical Theory of Learning (SETOL).

More Models ?

To convince you I am not just cherry-picking examples, let me encourage you to download the open-source weightwatcher tool and try it yourself.

pip install weightwatcher

If you find it works, please let me know. And. if something is wrong, I’d like to know that too.

Closing Points

WeightWatcher is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It is based on theoretical research into Why Deep Learning Works, and is described in a new theory–the weightwatcher Semi-Empirical Theory of Learning (SETOL). Here, we have shown a key feature of the new SETOL approach–that DNNs can be described with an Effective Correlation Space at each layer, and that this space is characterized by a Volume Preserving Transformation that captures and concentrates the correlations in the data at each layer. And we have shown how you can test this yourself using the weightwatcher tool.

Most importantly, the empirical results show that when the layer is learning perfectly, the layer correlations can be fit to a heavy-tailed power law (PL) distribution, with PL exponent alpha of 2–exactly as prescribed by the earlier HTSR theory!

If you would like to read an early draft of the new paper, please reach out me. I would greatly appreciate the feedback. Additionally, feel free to join the weightwatcher community Discord channel to discuss all things about the theory and how to use the tool to help you with the training and monitoring of your AI models.

WeightWatcher is a one-of-a-kind must-have tool for anyone training, deploying, or monitoring Deep Neural Networks (DNNs).

The weightwachter tool has been developed by Calculation Consulting. We provide consulting to companies looking to implement Data Science, Machine Learning, and/or AI solutions. Reach out today to learn how to get started with your own AI project. #talkToChuck #theAIguy Email: Info@CalculationConsulting.com

Better than BERT: Pick your best model

Charles H Martin, PhD — Fri, 22 Jul 2022 19:05:40 +0000

Have you ever had to sort through HuggingFace to find your best model ? There are over 54,000 models on HuggingFace! So it’s not an easy task.

Most people just choose the most popular model–and this is usually BERT. Or some BERT variant. Bert was created by Google, so it must be good.

But is BERT the really best choice for you ?

How can you find out ? You can search through the literature, read blogs, ask on Reddit, etc, and try to find a better model. This is time consuming and imperfect. Fortunately, there is a better way.

The weightwatcher tool can tell you.

WeightWatcher is an open-source, data-free diagnostic tool that can estimate the quality of an DNN model like BERT, GPT, etc–without needing any data! (No training or test data–just the weights). It has been featured in JMLR, at ICML and KDD, and even in Nature.

Here’s an example using weightwatcher to compare of 3 NLP models: BERT, RoBERTa, and XNLet

The WeightWatcher Power-Law (PL) metric alpha is a DNN model quality metric; smaller is better. This plot above displays all the layer alpha values for the 3 models. It is immediately clear that the XNLet layers look much better than BERT or RoBERTa; the alpha values are smaller on average, and there are no alphas larger than 5: . In contrast, the BERT and RoBERTa alphas are much larger on average, and both models have too many large alphas.

This is totally consistent with the published results.: In the original paper (from Microsoft Research), XLNet outperforms BERT on 20 different NLP tasks.

Do it yourself:

WeightWatcher will work with any HuggingFace Transformer (or CV) model.

Here is a Google Colab notebook that lets you reproduce this yourself

Give it a try. And if you need help with AI, ML, or just Data Science, please reach out. I provide strategy consulting, data science leadership, and hands-on, heads-down development. I will have availability in Q3 2022 for new projects. Reach out today. #talkToChuck #theAIguy

Is your layer over-fit? (part 2)

Charles H Martin, PhD — Tue, 14 Jun 2022 20:53:05 +0000

Say you are training a Deep Neural Network (DNN), and you see your model is over-trained. Or just not performing well. Is there a way to detect which layer is actually over-trained? (or over-fit, as some people call it)

In this post, we will show how to use the open-source weightwatcher tool to answer this.

WeightWatcher is an open-source, data-free diagnostic tool for analyzing (pre-)trained DNNs. It is based on my personal research into Why Deep Learning Works, in collaboration with UC Berkeley. It is based on ideas from the Statistical Mechanics of Learning (i.e theoretical physics and chemistry).

pip install weightwatcher

WeightWatcher lets you inspect your layer weight matrices to see if they are converging properly. And in some cases, it can even tell you if the layer is over-trained. The idea is simple. If you are training a model, and you over-regularize one of the layer, then you any observe the weightwatcher alpha metric drops below 2 (). This is predicted by our HTSR theory of learning (although we have not published this specific result yet). And very unique as no other approach can do this.

To see how this works, we will look at a very specific, carefully-designed experiment where the theory is known to work exactly as advertised.

BUT (and here’s the disclaimer)

Please be aware–training DNNs to State-of-the-Art (SOTA) is not easy, and applying the tool requires designing careful experiments that can isolate the problems you are trying to fix. It does not work in every case, and you may see unusual results that are difficult to interpret. In these cases, please feel free to reach out to me directly to get help.

Having said that, let’s get started

HERES THE GOOGLE COLAB NOTEBOOK

Experimental Design

We consider a very simple DNN, a 3-layer MLP (Multi-Layer Perceptron), trained on MNIST.

To induce the overtraining, we will train this model using different batch sizes, with batch_size in [1,2,4,8,16,32].

Why do we vary the batch sizes ?… and not a specific regularization hyper-parameter like Weight Decay or Dropout? The batch size acts like a very strong regularizer, which can induce the Heavy-Tails we see in SOTA models even in this very small model and generally poorly performing model. This is shown in Figure 25 of our JMLR paper describing our theory of Heavy-Tailed Self-Regularization (HT-SR), the theory behind weightwatcher.

Moreover, with extremely small batch sizes, and a long number of epochs, we can even drive the model into a state of over-training. Which is the goal here.So each model is trained for a very long number of epochs, and until the training loss stabilizes, using a Keras EarlyStopping Callback

tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3, verbose=0, min_delta=0.001, restore_best_weights=True)e()

In your own models, the situation may be more complex.

The weightwatcher metrics work best when applied to SOTA models because this is when the layer weight matrics are best correlated, and the Power Law fits work the best. It takes some work to design experiments on small models that can flush out these features. So we choose to use the batch size to induce this effect. But let me encourage you to try other approaches.

The key to using the HTSR theory is to carefully control the training so that when you adjust some other knob (i.e Dropout, momentum, weight decay) that the training and test error change smoothly and systematically. If, however, the training accuracy or loss is unstable, and you are jumping all over the loss landscape, then HTSR theory, is more difficult to apply. So, here,

I follow the KISS mantra: “Keep It Super Simple!”

Reproducibility

To compare 2 or more models to each other, with different batch sizes, for the purposes here, we need to ensure they have been trained with the exact same initial conditions. To do this, we have to both set all the random seeds to a default value and tell the framework (here, Keras) to use deterministic options. This also, nicely, makes the experiments 100% reproducible.

%env CUBLAS_WORKSPACE_CONFIG=:4096:8

import random
def reset_random_seeds(seed_value=42):
   os.environ['PYTHONHASHSEED']=str(seed_value)
   tf.random.set_seed(seed_value)
   tf.keras.utils.set_random_seed(seed_value)
   np.random.seed(seed_value)
   random.seed(seed_value)
   tf.config.experimental.enable_op_determinism()

Every time we build the model, we will first run reset_random_seeds()to ensure that every run, with different batch sizes, regularization, etc, is stated from the same spot and is reproducible.

Model Size and Shape: The Three (3) Layers

This model has 3 layers: input, hidden, and output. Note that each layer is initialized in the same way (i.e with GlorotNormalization, with the same seed). Also, here, to keep it super simple, no specific regularization is applied to the model (except for the changing of the batch size).

initializer = tf.keras.initializers.GlorotNormal(seed=1)
  model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape = [28,28]),
      tf.keras.layers.Dense(300, activation='relu', kernel_initializer=initializer),
      tf.keras.layers.Dense(100, activation='relu', kernel_initializer=initializer),
      tf.keras.layers.Dense(10, activation='softmax', kernel_initializer=initializer),
  ])escribe()Also,

We can inspect the model using weightwatcher to see how the layers are labeled (layer_id), what kind of layer they are (DENSE, Conv2D, etc), and what their shapes are (N, M).

import weightwatcher as ww
watcher = ww.WeightWatcher(model=model)
watcher.describe()

WeightWatcher Descirption DataFrame

In this experiment, we will analyze layer 1 (the Hidden Layer) and only layer 1. This layer is a DENSE layer, which has a single weight matrix of dimension 100×300. It will have 100 eigenvalues, which is a large enough size for weightwatcher to analyze. And for this super, simple experiment, this is the only later that is trainable; all other layers are held fixed.

Training the model (with different batch sizes)

Again, we will train the same model, with the same exact same initial conditions, in a deterministic way, while changing the batch size. For each fully trained model, we then compute the weighwatcher Power-Law capacity metric alpha (). We will then compare the layer 1 alpha ) to the model test accuracy for each run.

Notice first that, however, when decreasing the batch size, both the training accuracy and the test accuracy improve both smoothly and systematically, and then drop off suddenly. For example, below, see that test accuracy increases from 89.0% at batch size 32 to 89.4% at batch size 4, and then drops off suddenly for batch size 2 down to 88.5%. (The training accuracy behaves in a similar way when decreases the batch size, as can be seen in the notebook).

Likewise, the training loss is varying smoothly, and the optimizer is not jumping all over the energy landscape. This indicates a clean experiment, amenable to analysis.

Training and test losses for a sample run training the 3-layer MLP

(Notice that we apply early stopping to the training loss, not the validation loss. That is because, in this experiment, we are trying to drive the model to a state of over-training by reducing the batch size, and going past the perhaps more common early stopping critera on the validation loss. Also, since we are changing the batch size, we want to ensure each model runs with enough epochs to the runs can be compared to each other).

The WeightWatcher Layer Capacity Matric Alpha ()

To compute the weightwatcher metrics, at the end of every training cycle, just run

results = watcher.analyze(layers=[1])

The watcher.analyze() method will generate a pandas dataframe, with layer by layer metrics.

What does alpha mean? Alpha () is a measure of how Heavy-Tailed the layer is. It can be found, crudely, by simply plotting a histogram of the eigenvalue of the layer correlation matrix, X=np.dot(W.T,W), on a log-log scale, and calculating the slope of this plot in the tail region. Here is an example where .

The smaller alpha is, the more Heavy-Tailed the layer matrix X is, and the better the layer performs for the model. But only upto a point. If the layer is too Heavy-Tailed, where (for simple models) then it may be over-trained.

Results: detecting an over-trained layer

We can now plot the alpha vs the test accuracy for layer 1, and the result is quite amazing.

Notice 2 key things

as the test accuracy increases, the alpha metric decreases ()
as soon the test accuracy drops (with batch size = 1), alpha drops below 2 ()

For simple models like this 3-layer MLP, the weightwatcher approach can, remarkably, detect which layer is over-trained! No other theory can do this.

For more complex models, with lots of parameters varying, the situation may be more complex.

Let me encourage you to try the weightwatcher tool for yourself, and join our Slack channel to discuss this and other aspects of training large models to SOTA.

Why does alpha < 2 mean the layer may be over-trained ?

The weightwatcher alpha $(latex \alpha)$ metric is the exponent found when fitting the empirical spectral density (ESD), or a histogram of the eigenvalues, to a Power-Law distribution. Moreover, when alpha is between roughly 2 and higher (theoretically 4, practically, upto 6, ), as shown in our JMLR paper, we can use our HTSR theory to characterize the layer weight matrix as being Moderately Heavy-Tailed. See Table 1:

When a Power Law distribution is simply Moderately Heavy-Tailed, this means that, in the limit, the variance may be unbounded, but the average (or mean) value is well defined. So, for Deep Learning, this implies that the model has learned a wide variety of correlations, but, on average, the correlations are reasonably bounded, moreover, typical. Being typical, the layer weight matrix model can be used to describe the information in the training and the test data, as long as they come from the same data distribution,

But when the alpha is very small (), this means the layer weight matrix is Very Heavy-Tailed, and the layer weight matrix is atypical. That is, the distributions of the correlations do not have a well-defined average of mean value, and the individual elements of W may even themselves be unbounded (ie. when you have a Correlation Trap). Therefore, this layer weight matrix can not be used to describe any data except the training data.

A Correlation Trap appears when the batch size = 1

Seeing this in practice is not necessarily easy, and interpreting it is harder. As here, one may have to design a very careful experiment to flush this out. Still, we encourage you to try the tool out, try to use it to identify and resolve such problems, and please give feedback.

Final Plug

And if you need help with AI, ML, or just Data Science, please reach out. I provide strategy consulting, data science leadership, and hands-on, heads-down development. I will have availability in Q3 2022 for new projects. Reach out today. #talkToChuck #theAIguy