# Towards a new Theory of Learning: Statistical Mechanics of Deep Neural Networks

## Introduction

For the past few years, we have talked a lot about how we can understand the properties of Deep Neural Networks by examining the spectral properties of the layer weight matrices $\mathbf{W}$. Specifically, we can form the correlation matrix $\mathbf{X} =\frac{1}{N}\mathbf{W}^{T}\mathbf{W}$,

and compute the eigenvalues $\lambda$ $\mathbf{X}{\mathbf{e}}=\lambda{\mathbf{e}}$.

By plotting the histogram of the eigenvalues (i.e the spectral density $\rho(\lambda)$), we can monitor the training process and gain insight into the implicit regularization and convergence properties of DNN. Indeed, we have identified

### 5+1 Phases of Training

Each of these phases roughly corresponds to a Universality class from Random matrix Theory (RMT). And as we shall see below, we can use RMT to develop a new theory of learning.

First, however, we note that for nearly every pretrained DNNs we have examined (and we have examined thousands of them), the phase appears to be in somewhere between Bulk-Decay and/or Heavy-Tailed .

### Heavy Tailed Implicit Regularization

Moreover, for nearly all layers in all well trained, production quality DNNs, the layer spectral density $\rho(\lambda)$ can be fit to a truncated power law, with exponents frequently lying in the Fat Tailed range [2-4], and the maximum eigenvalue no larger than say 100 $\rho(\lambda)\sim\lambda^{-\alpha},$ $\text{where}\;\; \alpha\in[2,4] ,\;\;\lambda_{max}<100$,

Most importantly, in 80-90% of the DNN architectures studied, on average, smaller exponents $\alpha$ correspond to smaller test errors.

### DNN Quality Metrics in Practice

Our empirical results suggest that the power law exponent can be used as (part of) a practical quality metric. This led us to propose the $\hat{\alpha}$ metric for DNNs: $\hat{\alpha}:=\sum\alpha_{i}\log\lambda_{i,max}$

where we compute the exponent $\alpha$ and maximum eigenvalue $\lambda_{max}$ for each layer weight matrix (and Conv2D feature maps), and then form the total DNN quality as a simple weighted average of the exponents. Amazingly, this metric correlates very well with the reported test accuracy of pretrained DNNs (such as the VGG models, the ResNet models, etc)

#### WeightWatcher

We have even built a open source, python command line tool–weightwatcher–so that other researchers can both reproduce and leverage our results

pip install weightwatcher

And we have a Slack Channel for those who want to ask questions, dig deeper, and/or contribute to the work. Email me, or ping me on LinkedIn, to join our vibrant group.

All of this leads to a very basic question:

## Why does this work ?

To answer this, we will go back to the foundations of the theory of learning, from the physics perspective, and rebuild the theory using in both our experimental observations, some older results from Theoretical Physics, and (fairly) recent results in Random Matrix Theory.

## Statistical Mechanics of Learning

Here, I am going to sketch out the ideas we are currently researching to develop a new theory of generalization for Deep Neural Networks. We have a lot of work to do, but I think we have made enough progress to present these ideas, informally, to flush out the basics.

What do we seek ? A practical theory that can be used to predict the generalization accuracy of a DNN solely by looking at the trained weight matrices, without looking at the test data.

Why ? Do you test a bridge by driving cars over it until it collapses ? Of course not! So why do we build DNNs and only rely on brute force testing ? Surely we can do better.

What is the approach ? We start with the classic Perceptron Student-Teacher model from Statistical Mechanics of the 1990s. The setup is similar, but the motivations are a bit different. We have discussed this model earlier here Remembering Generalization in DNNs. from our paper Understanding Deep Learning Requires Rethinking Generalization.

Here, let us review the mathematical setup in some detail:

### the Student-Teacher model

We start with the simple model presented in chapter 2, Engel and Van der Brock, interpreted in a modern context. See also this classic 1992 paper, Statistical Mechanics of Learning from Examples.

Here, we want to do something a little different, and use the formalism of Statistical Mechanics to both compute the average generalization error, and to interpret the global convergence properties of DNNs in light of this , giving us more insight into and to provide a new theory of Why Deep Learning Works (as proposed in 2015).

Suppose we have some trained or pretrained DNN (i.e. like VGG19). We want to compute the average or typical error that our Teacher DNN could make, just by examining the layer weight matrices. Without peeking at the data.

A Simple Layer Approach: We assume all layers are statistically independent, so that the average generalization quality $\mathcal{Q}$ (i.e. 1.0-error) is just the product of the contributions of from each layer weight matrix $\mathbf{W}_l$ . $\mathcal{Q}:=\underset{l}{\prod}\;\mathcal{Q}(\mathbf{W}_l)$

Example: The Product Norm is a Capacity (or Quality) measure for DNNs from traditional ML theory. $\mathcal{Q}\sim\underset{l}{\prod}\;\Vert\mathbf{W}_l\Vert=\Vert\mathbf{W}_{1}\Vert\Vert\mathbf{W}_{2}\Vert\cdots\Vert\mathbf{W}_{L}\Vert$

The Norm may be Frobenius Norm, the Spectral Norm, or even their ratio, the Stable Rank.

This independence assumption is probably not a great approximation but it gets us closer to a realistic theory. Indeed, even traditional ML theory recognizes this, and may use Path Norm to correct for this. For now, this will suffice.

A Log Product Norm If we take the logarithm of each side, we can write the log Quality as the sum of the layer contributions. More generally, we will express the log Quality as a weighted average of some (as yet unspecified) log norm of the weight matrix. $\log\mathcal{Q}\sim\sum\limits_{l}a_{l}\log\Vert\mathbf{W}_{l}\Vert$

#### Quality for a Single Layer: Perceptron model

We now set up the classic Student-Teacher model for a Perceptron–with a slight twist. That is, from now on, we assume our models have 1 layer, like a Perceptron.

Let’s call our trained or pretrained DNN the Teacher T. The Teacher maps data to labels. Of course, there could be many Teachers which map the same data to the same labels. For our specific purposes here, we just fix the Teacher T. We imagine that the learning process is for us to learn all possible Student Perceptrons J that also map the data to the labels, in the same way as the Teacher.

But for a pretrained model, we have no data, and we have no labels. And that’s ok. Following Engle and Van der Brock (and also Engle’s 2001 paper ), consider the following Figure, which depicts the vector space representations of T and J.

To compute the average generalization error for a given model, we write first need the Student-Teacher (ST) error function $\epsilon(R)$. This is simply the L2 loss between the 2 models, averaged over a random data set. $\epsilon(R)=\langle\frac{1}{2}[y_{T}-y_{S}]^{2}\rangle_{data}$

where $y_{S}$ and $y_{T}$ are the Student and Teacher labels, respectively.

For a Linear Perceptron, the labels are given by $y_{S}=\mathbf{S}^{T}\mathbf{x}$ $y_{T}=\mathbf{T}^{T}\mathbf{x}$

and the ST error function is simply 1 minus the ST overlap $\epsilon=1-R,\;\;R = \dfrac{1}{N}\mathbf{S}^{T}\mathbf{T}$

For the classic Boolean (Rosenblatt) Perceptron, the ST error function is with the inverse (arc cosine) of the vector dot product between S and T: $\epsilon=\frac{1}{\pi}\arccos\;R,\;\;R = \dfrac{1}{N}\mathbf{S}^{T}\mathbf{T}$

If we just use the linear error model, then the quality of the Teacher is simply 1 minus the error, or, trivially , the ST overlap $\mathcal{Q}_{T}(R)=R$

To estimate the average total generalization quality of the Teacher, we write the total Quality as an integral over all possible Student matrices S $\mathcal{Q}_{\mathbf{T}}\simeq\int d\mu(\mathbf{S})R,$

(and this will turn out to be an approximation to the Annealed Free Energy of the matrix-generalized Linear Percepton…stay tuned!)

Fixing the Teacher: That’s it. What’s so hard about this ? Normally in Statistical Mechanics, one also has to average over all possible Teachers (T) that look the same — this complicates the story immensely. What we have described this for a fixed Teacher, we will use this as a general formalism to derive the weightwatcher AlphaHat metric.

Our Proposal:  We let $\mathbf{S}, \mathbf{T}\$ be strongly correlated (NxM) real matrices, with truncated, Heavy Tailed ESDs. Specifically, we assume that we know the Teacher T weight matrices exactly, and seek all Student matrices S that have the same shape as T and the same average spectral properties as T. That is still a lot of Students.

We can think of the class of Student matrices S as all matrices that are close to T. What we really want is the best method for doing this, that has been tested experimentally. Fortunately, Hinton and coworkers have recently revisited Similarity of Neural Network Representations, and found the best matrix similarity method is

Canonical Correlation Analysis (CCA): $R=\Vert\mathbf{S}^{T}\mathbf{T}\Vert^{2}_{F}=Tr[(\mathbf{S}^{T}\mathbf{T})^{2}]$

Using CCA, we can formulate our new, Semi-Empirical Theory with the following conjectures

Conjecture 1: We can write the layer matrix contribution to the total average generalization error as an integral over all possible (random) matrices S that resemble the actual (pre-)trained weight matrices T (as given above),, such that $\mathcal{Q}_{\mathbf{T}}\simeq\int d\mu(\mathbf{S})\left(Tr[\mathbf{S}^{T}\mathbf{T}]^{2}\right),$

Conjecture 2: We can express this integral in the Annealed Approximation (AA) which lets us re-express the log Quality in exponential form $\mathcal{Q}_{\mathbf{T}}\simeq\log\int d\mu(\mathbf{S})\exp\left(Tr[\mathbf{S}^{T}\mathbf{T}]^{2}\right),$

Conjecture 3: We can evaluate this integral over an Effective Correlation Space, spanned by the tail of the ESD, and such that we can replace the integral over all Student weight matrices $d\mu(\mathbf{S})$ with an integral over all Student Correlation matrices $d\mu(\mathbf{A})$, where $\mathbf{A}=\mathbf{S}^{T}\mathbf{S}$ giving $\mathcal{Q}_{\mathbf{T}}\simeq\log\int d\mu(\mathbf{A})\exp\left(Tr[\mathbf{T}^{T}\mathbf{A}\mathbf{T}]\right),$

We explain these in more detail below

### RMT and HCIZ Integrals

These kinds of integrals traditionally appeared in Quantum Field Theory and String Theory, but also in the context of Random Matrix applied to Levy Spin Glasses, And it is this early work on Heavy-Tailed Random Matrices that has motivated our empirical work. Here, to complement and extend our studies, we lay out an (incomplete) overview of the Theory.

These integrals are called Harish Chandra–Itzykson–Zuber (HCIZ) integrals. A good introductory reference on both RMT and HCIZ integrals the recent book “A First Course in Random Matrix Theory”, although we will base our analysis here on the results of the 2008 paper by Tanaka,

First, we need to re-arrange a little of the algebra. We will call A the Student correlation matrix: $\mathbf{A} =\frac{1}{N}\mathbf{S}^{T}\mathbf{S}$

and let W, X be the original weight and correlation matrices for our pretrained DNN, as above: $\mathbf{X} =\frac{1}{N}\mathbf{W}^{T}\mathbf{W},\;\;\mathbf{X}{\mathbf{e}}=\lambda{\mathbf{e}}$,

and then expand the CCA Similarity metric as $Tr[(\mathbf{S}^{T}\mathbf{W})^{2}]=Tr[\mathbf{W}^{T}\mathbf{A}\mathbf{W}]$

We can now express the log HCIZ integral, in using Tanaka’s result, as an expectation value of all random Student correlations matrices A that resemble X. $\underset{N\rightarrow\infty}{lim}\dfrac{1}{N}\log\;\mathbb{E}_{A}\left[exp\left(\dfrac{\beta}{2}Tr[\mathbf{W}^{T}\mathbf{A}\mathbf{W}]\right)\right]=\dfrac{\beta}{2}\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)$

And this can be expressed as a sum over Generating functions $G_{A}(\lambda)$ that depends only the statistical properties of the random Student weight matrices A. Specifically $G_{A}(\lambda):=\int\limits_{0}^{\lambda}R_{A}(z)dz$

where $R_{A}(\lambda)$ is the R-Transform from RMT.

The R Transform is like an inverse Green’s function (i.e a Contour Integral), and is also a cumulant generating function. As such, we can write $R(z)$ as a series expansion $R(z)=\sum\limits_{k=1}^{\infty}c_{k}z^{k-1}$

where $c_{k}$ are Generalized Cumulants from RMT.

Now, since we expect the best Students matrices resemble the Teacher matrices, we expect the Student correlation matrix A to have similar spectral properties as our actual correlation matrices X. And this where we can use our classification of the 5+1 Phases of Training. Whatever phase X is in, we expect all the A to be in as well, and we therefore expect the R-Transform of A to have the same functional form as X.

That is, if our DNN weight matrix has a Heavy Tailed ESD $\rho_{X}(\lambda)\sim\lambda^{-\alpha}$

then we expect all of the students to likewise have a Heavy Tailed ESD, and with the same exponent (at least for now). $\rho_{A}(\lambda)\sim\lambda^{-\alpha}$

Quenched vs Annealed Averages

Formally, we just say we are averaging over all Students A. More technically, what really want to do is fix some Student matrix (i.e. say A = diagonal X), and then integrate over all possible Orthogonal transformations O of A (see 6.2.3 of Potters and Bouchaud) $\mathbb{E}\left[\log\left\langle\exp\left(\dfrac{\beta}{2}Tr[\mathbf{W}^{T}\mathbf{O}^{T}\mathbf{A}\mathbf{OW}]\right)\right\rangle_{\mathbf{O}}\right]_{\mathbf{A}}\simeq\log\mathbb{E}\left[\exp\left(\dfrac{\beta}{2}Tr[\mathbf{W}^{T}\mathbf{A}\mathbf{W}]\right)\right]_{\mathbf{A}}$

Then, we integrate over all possible A~diag(X) , which would account for fluctuations in the eigenvalues. We conceptually assume this is the same as integrating over all possible Students A, and then taking the log.

The LHS is called the Quenched Average, and the RHS is the Annealed. Technically, they are not the same, and in traditional Stat Mech theory, this makes a big difference. In fact, in the original Student-Teacher model, we would also average over all Teachers, chosen uniformly (to satisfy the spherical constraints)

Here, we are doing RMT a little differently, which may not be obvious until the end of the calculation. We do not assume a priori a model for the Student matrices. That is, instead of fixing A=diag(X), we will fit the ESD of X to a continuous (power law) distribution $\rho_{X}(\lambda)$ , and then effectively sample over all A as if we had drawn the eigenvalues of A from $\rho_{X}(\lambda)$. (In fact, I suppose we could actually do this numerically instead of doing all this fancy math–but what fun is that?).

The point is, we want to find an expression for the HCIZ integral (i.e the layer / matrix contribution to the Generalization Error) that only depends on observations of W, the weight matrix of the pretrained DNN (our Teacher network). The result only depends on the eigenvalues of X, and the R-transform of A , which is parameterized by statistical information from X.

In principle, I supposed we could measure the generalized cumulants $(c_{k})$ of X,. and assume we can plug these in for $R_{A}$. We will do something a little easier.

Let us consider 2 classes of matrices as models for X.

Gaussian (Wigner) Random Matrix: Random-Like Phase

The R-Transform for Gaussian Random matrix is well known: $R(z)=z$

Taking the integral and plugging this into the Generating function, we get $G_{A}(\lambda_{i})=\frac{1}{2}z^{2}\bigg|_{z=0}^{z=\lambda_i}=\frac{1}{2}\lambda^{2}_{i}$ $\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)=\frac{1}{2}\sum\limits_{i=1}^{M}\lambda^{2}_{i}=\frac{1}{2}Tr[\mathbf{A}^{2}]$

So when X is Random-Like , the layer / matrix contribution is like the Frobenius Norm (but squared), and thus average Generalization Error is given by a Frobenius Product Norm (squared). Thing is, this is never observed in any production model, as every single model we have examined has a Heavy-Tailed ESD. So we need something else.

Levy Random Matrix: Very Heavy Tailed Phase–but with $\alpha < 3$

Aswe have argued previously, due to finite size effects, we expect that the Very Heavy Tailed matrices appearing in DNNs will more resemble Levy Random matrices that the Random-Like Phase. So for now, we will close one eye and extend the results for $\alpha < 3$ to $\alpha\in[2,6]$.

The R-Transform for a Levy Random Matrix has been given by Burda $R(z)=z^{\alpha-1}$

Taking the integral and plugging this into the Generating function, we get $G_{A}(\lambda_{i})=\frac{1}{\alpha}z^{\alpha}\bigg|_{z=0}^{z=\lambda_i}=\frac{1}{\alpha}\lambda^{\alpha}_{i}$ $\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)=\frac{1}{\alpha}Tr[\mathbf{A}^{\alpha}]$

Towards our Heavy Tailed Quality Metric

1. Let us pull the power law exponent $\alpha$ out of the Trace, effectively ignoring cross terms in the sum over $\lambda^{\alpha}$ $Tr[\mathbf{A}^{\alpha}]\simeq Tr[\mathbf{A}]^{\alpha}$

2. We also assume we can replace the Trace of $\mathbf{A}$ with its largest eigenvalue $\lambda_{max}$ , which is actually a good approximation for very heavy tailed Levy matrices, when $\alpha\ll 2$ $Tr[\mathbf{A}]=\sum\lambda_{i}\simeq \lambda_{max}$

This gives an simple expression for the HCIZ integral expression for the layer contribution to the generalization error $\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)\simeq\lambda_{max}^{\alpha}$

Taking the logarithm of both sides, gives our expression $log\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)\simeq\;\alpha\log\lambda_{max}$

We have now derived the our Heavy Tailed Quality metric using a matrix generalization of the classic Student Teacher model, with the help of some modern Random Matrix Theory.

QED

I hope this has convince you that there is still a lot of very interesting theory to develop for AI / Deep Neural Networks. And that you will stay tuned for the published form of this work. And remember…

pip install weightwatcher

A big thanks to Michael Mahoney at UC Berkeley for collaborating with me on this work , and to Mirco Milletari’ (Microsoft), who has been extremely helpful. And to my good friend Matt Lee (Triaxiom Capital, LLC) for long discussions about theoretical physics, RMT, quant finance, etc., , and for encouraging me to publish this.

And here’s the Empirical Evidence that this actually works:

If you have gotten this far,

and would be willing to read the draft of the real theory paper, that would be great. Please just email me or ping me on Linkedin