Towards a new Theory of Learning: Statistical Mechanics of Deep Neural Networks

Introduction

For the past few years, we have talked a lot about how we can understand the properties of Deep Neural Networks by examining the spectral properties of the layer weight matrices $\mathbf{W}$ . Specifically, we can form the correlation matrix

$\mathbf{X} =\frac{1}{N}\mathbf{W}^{T}\mathbf{W}$ ,

and compute the eigenvalues $\lambda$

$\mathbf{X}{\mathbf{e}}=\lambda{\mathbf{e}}$ .

By plotting the histogram of the eigenvalues (i.e the spectral density $\rho(\lambda)$ ), we can monitor the training process and gain insight into the implicit regularization and convergence properties of DNN. Indeed, we have identified

5+1 Phases of Training

Each of these phases roughly corresponds to a Universality class from Random matrix Theory (RMT). And as we shall see below, we can use RMT to develop a new theory of learning.

First, however, we note that for nearly every pretrained DNNs we have examined (and we have examined thousands of them), the phase appears to be in somewhere between Bulk-Decay and/or Heavy-Tailed .

Heavy Tailed Implicit Regularization

Moreover, for nearly all layers in all well trained, production quality DNNs, the layer spectral density $\rho(\lambda)$ can be fit to a truncated power law, with exponents frequently lying in the Fat Tailed range [2-4], and the maximum eigenvalue no larger than say 100

$\rho(\lambda)\sim\lambda^{-\alpha},$

$\text{where}\;\; \alpha\in[2,4] ,\;\;\lambda_{max}<100$ ,

Most importantly, in 80-90% of the DNN architectures studied, on average, smaller exponents $\alpha$ correspond to smaller test errors.

DNN Quality Metrics in Practice

Our empirical results suggest that the power law exponent can be used as (part of) a practical quality metric. This led us to propose the $\hat{\alpha}$ metric for DNNs:

$\hat{\alpha}:=\sum\alpha_{i}\log\lambda_{i,max}$

where we compute the exponent $\alpha$ and maximum eigenvalue $\lambda_{max}$ for each layer weight matrix (and Conv2D feature maps), and then form the total DNN quality as a simple weighted average of the exponents. Amazingly, this metric correlates very well with the reported test accuracy of pretrained DNNs (such as the VGG models, the ResNet models, etc)

WeightWatcher

We have even built a open source, python command line tool–weightwatcher–so that other researchers can both reproduce and leverage our results

pip install weightwatcher

And we have a Slack Channel for those who want to ask questions, dig deeper, and/or contribute to the work. Email me, or ping me on LinkedIn, to join our vibrant group.

All of this leads to a very basic question:

Why does this work ?

To answer this, we will go back to the foundations of the theory of learning, from the physics perspective, and rebuild the theory using in both our experimental observations, some older results from Theoretical Physics, and (fairly) recent results in Random Matrix Theory.

Statistical Mechanics of Learning

Here, I am going to sketch out the ideas we are currently researching to develop a new theory of generalization for Deep Neural Networks. We have a lot of work to do, but I think we have made enough progress to present these ideas, informally, to flush out the basics.

What do we seek ? A practical theory that can be used to predict the generalization accuracy of a DNN solely by looking at the trained weight matrices, without looking at the test data.

Why ? Do you test a bridge by driving cars over it until it collapses ? Of course not! So why do we build DNNs and only rely on brute force testing ? Surely we can do better.

What is the approach ? We start with the classic Perceptron Student-Teacher model from Statistical Mechanics of the 1990s. The setup is similar, but the motivations are a bit different. We have discussed this model earlier here Remembering Generalization in DNNs. from our paper Understanding Deep Learning Requires Rethinking Generalization .

Here, let us review the mathematical setup in some detail:

the Student-Teacher model

We start with the simple model presented in chapter 2, Engel and Van der Brock, interpreted in a modern context. See also this classic 1992 paper, Statistical Mechanics of Learning from Examples.

Here, we want to do something a little different, and use the formalism of Statistical Mechanics to both compute the average generalization error, and to interpret the global convergence properties of DNNs in light of this , giving us more insight into and to provide a new theory of Why Deep Learning Works (as proposed in 2015).

Suppose we have some trained or pretrained DNN (i.e. like VGG19). We want to compute the average or typical error that our Teacher DNN could make, just by examining the layer weight matrices. Without peeking at the data.

A Simple Layer Approach: We assume all layers are statistically independent, so that the average generalization quality $\mathcal{Q}$ (i.e. 1.0-error) is just the product of the contributions of from each layer weight matrix $\mathbf{W}_l$ .

$\mathcal{Q}:=\underset{l}{\prod}\;\mathcal{Q}(\mathbf{W}_l)$

Example: The Product Norm is a Capacity (or Quality) measure for DNNs from traditional ML theory.

$\mathcal{Q}\sim\underset{l}{\prod}\;\Vert\mathbf{W}_l\Vert=\Vert\mathbf{W}_{1}\Vert\Vert\mathbf{W}_{2}\Vert\cdots\Vert\mathbf{W}_{L}\Vert$

The Norm may be Frobenius Norm, the Spectral Norm, or even their ratio, the Stable Rank.

This independence assumption is probably not a great approximation but it gets us closer to a realistic theory. Indeed, even traditional ML theory recognizes this, and may use Path Norm to correct for this. For now, this will suffice.

A Log Product Norm If we take the logarithm of each side, we can write the log Quality as the sum of the layer contributions. More generally, we will express the log Quality as a weighted average of some (as yet unspecified) log norm of the weight matrix.

$\log\mathcal{Q}\sim\sum\limits_{l}a_{l}\log\Vert\mathbf{W}_{l}\Vert$

Quality for a Single Layer: Perceptron model

We now set up the classic Student-Teacher model for a Perceptron–with a slight twist. That is, from now on, we assume our models have 1 layer, like a Perceptron.

Let’s call our trained or pretrained DNN the Teacher T. The Teacher maps data to labels. Of course, there could be many Teachers which map the same data to the same labels. For our specific purposes here, we just fix the Teacher T. We imagine that the learning process is for us to learn all possible Student Perceptrons J that also map the data to the labels, in the same way as the Teacher.

But for a pretrained model, we have no data, and we have no labels. And that’s ok. Following Engle and Van der Brock (and also Engle’s 2001 paper ), consider the following Figure, which depicts the vector space representations of T and J.

To compute the average generalization error for a given model, we write first need the Student-Teacher (ST) error function $\epsilon(R)$ . This is simply the L2 loss between the 2 models, averaged over a random data set.

$\epsilon(R)=\langle\frac{1}{2}[y_{S}-y_{T}]^{2}\rangle_{data}$

where $y_{S}$ and $y_{T}$ are the Student and Teacher labels, respectively.

For a Linear Perceptron, the labels are given by

$y_{S}=\mathbf{s}^{\top}\mathbf{x}$

$y_{T}=\mathbf{t}^{\top}\mathbf{x}$

and the ST error function can be written in the large-N approximation, and at high-Temperature, as simply 1 minus the ST overlap

$\epsilon(R)=1-R,\;\;R = \dfrac{1}{N}\mathbf{s}^{\top}\mathbf{t}$

The generalization error is then simply the Thermal average of this error, written as

$\mathcal{E}_{gen}^{ST}:=\langle \epsilon(R) \rangle_{\mathbf{s}}^{\beta}=\dfrac{1}{Z}\int d\mu(\mathbf{s})\epsilon(R)\exp[-N\beta \epsilon(R)]$

where Z is the Partition Function, and $\beta=\tfrac{1}{T}$ is the inverse-Temperature.

We call the Quality the Teacher our approximation of the generalization accuracy. The ST quality is simply 1 minus the generalization error, which is Thermal average of the ST overlap

$\mathcal{Q}^{ST}(R):=1-\mathcal{E}_{gen}^{ST}=\langle R \rangle_{\mathbf{s}}^{\beta}$

This can also be written using a generating function $\Gamma^{ST}=1-F^{ST}$ , which is just 1 minus the (Annealed) Free Energy $\beta F^{ST}=-\ln Z$ . That is,

$\mathcal{Q}^{ST}(R):=\dfrac{1}{\beta}\dfrac{\partial}{\partial N}\Gamma^{ST}$

(Although below we will actually compute the layer Quality squared, $\mathcal{Q}^{2}$ )

Fixing the Teacher: That’s it. What’s so hard about this ? Normally in Statistical Mechanics, one also has to average over all possible Teachers (T) that look the same — this complicates the story immensely. What we have described this for a fixed Teacher, we will use this as a general formalism to derive the weightwatcher AlphaHat metric.

Our Proposal: We let $\mathbf{S}, \mathbf{T}\$ be strongly correlated (NxM) real matrices, with truncated, Heavy Tailed ESDs. Specifically, we assume that we know the Teacher T weight matrices exactly, and seek all Student matrices S that have the same shape as T and the same average spectral properties as T. That is still a lot of Students.

We can think of the class of Student matrices S as all matrices that are close to T. What we really want is the best method for doing this, that has been tested experimentally. Fortunately, Hinton and coworkers have recently revisited Similarity of Neural Network Representations, and found the best matrix similarity method is

Canonical Correlation Analysis (CCA): $R^{2}:=\tfrac{1}{N^2}Tr[(\mathbf{S}^{\top}\mathbf{T})^{2}]$

Using CCA, we can formulate our new, Semi-Empirical Theory with the following conjectures

Conjecture 1: We can write the matrix-generalized Layer Quality in terms of a Thermal Average of the Overlap Squared, averaged over all possible (random) matrices S that resemble the actual (pre-)trained weight matrices T

Let us A the (N x N) Student correlation matrix:

$\mathbf{A} =\frac{1}{N}\mathbf{S}\mathbf{S}^{\top}$

Write the Layer Quality for the Teacher $\mathcal{Q}_{\mathbf{T}}$ is obtained through the approximation

$\mathbf{Q}_{\mathbf{T}}\approx\sqrt{\langle \mathbf{Q}^{2}_{\mathbf{T}} \rangle _{\mathbf{S}}^{\beta}}=\sqrt{\langle Tr[\mathbf{R}^{\top}\mathbf{R}] \rangle _{\mathbf{S}}^{\beta}}=\sqrt{\langle \tfrac{1}{N}Tr[\mathbf{T}^{\top}\mathbf{AT}] \rangle _{\mathbf{S}}^{\beta}}$

where $\langle \cdots \rangle _{\mathbf{S}}^{\beta}$ represents the matrix generalization of the Thermal Average, and $\mathcal{Q}^{2}_{\mathbf{T}}$ is the Layer Quality (squared) , defined in terms of the matrix-generalized overlap operator $\mathbf{R}:=\tfrac{1}{N}\mathbf{S}^{\top}\mathbf{T}$ and $Tr[\mathbf{R}^{\top}\mathbf{R}] =\tfrac{1}{N^2}Tr[\mathbf{T}^{\top}\mathbf{S}\mathbf{S}^{\top}\mathbf{T}]= \tfrac{1}{N}Tr[\mathbf{T}^{\top}\mathbf{AT}]$ .

Conjecture 2: We can express this Thermal average in terms of an HCIZ integral.

Let us re-define an analogous the layer quality (squared) generating function $\Gamma^{IZ}_{\mathbf{Q}^{2}_{T}}$ as an HCIZ integral — an integral over all random Student weight matrices $\mathbf{S}$ that “resemble” the Teacher weight matrix $\mathbf{T}$

$\Gamma^{IZ}_{\mathbf{Q}^{2}_{T}}:=\ln\int d\mu(\mathbf{S})\exp\left(N\beta \tfrac{1}{N}Tr[\mathbf{T}^{\top}\mathbf{AT}]\right)$

This is an integral over all random Student matrices S that ‘resemble‘ the Teacher matrix T.

Conjecture 3: We can evaluate this integral over an Effective Correlation Space (ECS), spanned by the Power Law tail of the ESD, subject to a volume-preserving transformation $d\mu(\mathbf{S})\rightarrow d\mu(\mathbf{\tilde{A}})\$

That is, we can replace the integral over all Student weight matrices $d\mu(\mathbf{S})$ with an integral over all Student Correlation matrices $d\mu(\mathbf{\tilde{A}})$ , where $\mathbf{\tilde{A}}=\tfrac{1}{N}\mathbf{S}\mathbf{S}^{\top}$ spans only the ECS. This gives

$\Gamma^{IZ}_{\mathbf{Q}^{2}_{T}}:=\ln\int d\mu(\mathbf{\tilde{A}})\exp\left(\beta Tr[\mathbf{T}^{\top}\mathbf{\tilde{A}T}]\right)$

When we evaluate this in the large-N limit, using a (slightly modified) result by Tanaka, shown below:

RMT and HCIZ Integrals

These kinds of integrals traditionally appeared in Quantum Field Theory and String Theory, but also in the context of Random Matrix applied to Levy Spin Glasses, And it is this early work on Heavy-Tailed Random Matrices that has motivated our empirical work. Here, to complement and extend our studies, we lay out an (incomplete) overview of the Theory.

These integrals are called Harish Chandra–Itzykson–Zuber (HCIZ) integrals. A good introductory reference on both RMT and HCIZ integrals the recent book “A First Course in Random Matrix Theory”, although we will base our analysis here on the results of the 2008 paper by Tanaka,

We can now express the log HCIZ integral, in using Tanaka’s result, as an expectation value of all random Student correlations matrices A that resemble X. In the large-N approximation, we have

$\underset{N\rightarrow \text{large}}{lim}\Gamma^{IZ}_{\mathbf{Q}^{2}_{T}}= \underset{N\rightarrow \text{large}}{lim}\log\;\mathbb{E}_{A}\left[exp\left(\beta Tr[\mathbf{T}^{\top}\mathbf{A}\mathbf{T}]\right)\right]=N\beta\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)$

And this can be expressed as a sum over Generating functions $G_{A}(\lambda)$ that depends only the statistical properties of the random Student weight matrices A. Specifically

$G_{A}(\lambda):=\int\limits_{0}^{\lambda}R_{A}(z)dz$

where $R_{A}(\lambda)$ is the R-Transform from RMT.

The R Transform is like an inverse Green’s function (i.e a Contour Integral), and is also a cumulant generating function. As such, we can write $R(z)$ as a series expansion

$R(z)=\sum\limits_{k=1}^{\infty}c_{k}z^{k-1}$

where $c_{k}$ are Generalized Cumulants from RMT.

Now, since we expect the best Students matrices resemble the Teacher matrices, we expect the Student correlation matrix A to have similar spectral properties as our actual correlation matrices X. And this where we can use our classification of the 5+1 Phases of Training. Whatever phase X is in, we expect all the A to be in as well, and we therefore expect the R-Transform of A to have the same functional form as X.

Formal Expression for the Layer Quality (squared)

We can now evaluate the derivative of the layer quality squared generating function $\Gamma^{IZ}_{\mathbf{Q}^{2}_{T}}$ , in the large-N case to obtain

$\mathbf{Q}^{2}_{\mathbf{T}}=\underset{N\rightarrow \text{large}}{lim}\dfrac{1}{\beta}\dfrac{\partial}{\partial N}\Gamma^{IZ}_{\mathbf{Q}^{2}_{T}}=\sum\limits_{i=1}^{\tilde{M}}\;G_{T}(\tilde{\lambda}_i)$

where the sum over $\tilde{M}$ eigenvalues $\tilde{\lambda}$ is restricted to the ECS space.

Comments on HCIZ Integrals

Quenched vs Annealed Averages

Formally, we just say we are averaging over all Students A. More technically, what really want to do is fix some Student matrix (i.e. say A = diagonal X), and then integrate over all possible Orthogonal transformations O of A (see 6.2.3 of Potters and Bouchaud)

$\mathbb{E}\left[\log\left\langle\exp\left(\dfrac{\beta}{2}Tr[\mathbf{T}^{\top}\mathbf{O}^{\top}\mathbf{A}\mathbf{OT}]\right)\right\rangle_{\mathbf{O}}\right]_{\mathbf{A}}\simeq\log\mathbb{E}\left[\exp\left(\dfrac{\beta}{2}Tr[\mathbf{T}^{\top}\mathbf{A}\mathbf{T}]\right)\right]_{\mathbf{A}}$

Then, we integrate over all possible A~diag(X) , which would account for fluctuations in the eigenvalues. We conceptually assume this is the same as integrating over all possible Students A, and then taking the log.

The LHS is called the Quenched Average, and the RHS is the Annealed. Technically, they are not the same, and in traditional Stat Mech theory, this makes a big difference. In fact, in the original Student-Teacher model, we would also average over all Teachers, chosen uniformly (to satisfy the spherical constraints)

Here, we are doing RMT a little differently, which may not be obvious until the end of the calculation. We do not assume a priori a model for the Student matrices. That is, instead of fixing A=diag(X), we will fit the ESD of X to a continuous (power law) distribution $\rho_{X}(\lambda)$ , and then effectively sample over all A as if we had drawn the eigenvalues of A from $\rho_{X}(\lambda)$ . (In fact, I suppose we could actually do this numerically instead of doing all this fancy math–but what fun is that?).

The point is, we want to find an expression for the HCIZ integral (i.e the layer / matrix contribution to the Generalization Error) that only depends on observations of W, the weight matrix of the pretrained DNN (our Teacher network). The result only depends on the eigenvalues of X, and the R-transform of A , which is parameterized by statistical information from X.

In principle, I supposed we could measure the generalized cumulants $(c_{k})$ of X,. and assume we can plug these in for $R_{A}$ . We will do something a little easier.

Let us consider 2 classes of matrices as models for X.

Models of the Teacher ESD

That is, if our DNN weight matrix has a Heavy Tailed ESD

$\rho_{X}(\lambda)\sim\lambda^{-\alpha}$

then we expect all of the students to likewise have a Heavy Tailed ESD, and with the same exponent (at least for now).

$\rho_{A}(\lambda)\sim\lambda^{-\alpha}$

Gaussian (Wigner) Random Matrix: Random-Like Phase

The R-Transform for Gaussian Random matrix is well known:

$R(z)=z$

Taking the integral and plugging this into the Generating function (and assuming A spans the ECS space, which is probably does not BTW), we get

$G_{A}(\lambda_{i})=\tfrac{1}{2}z^{2}\bigg|_{z=0}^{z=\lambda_i}=\tfrac{1}{2}\lambda^{2}_{i}$

$\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)=\tfrac{1}{2}\sum\limits_{i=1}^{M}\lambda^{2}_{i}=\tfrac{1}{2}Tr[\mathbf{A}^{2}]$

So when X is Random-Like , the layer / matrix contribution is like the Frobenius Norm (but squared), and thus average Generalization Error is given by a Frobenius Product Norm (squared). Thing is, this is never observed in any production model, as every single model we have examined has a Heavy-Tailed ESD. So we need something else.

Levy Random Matrix: Very Heavy Tailed Phase–but with $\alpha < 3$

As we have argued previously, due to finite size effects, we expect that the Very Heavy Tailed matrices appearing in DNNs will more resemble Levy Random matrices that the Random-Like Phase. So for now, we will close one eye and extend the results for $\alpha < 3$ to $\alpha\in[2,6]$ .

The R-Transform for a Levy Random Matrix has been given by Burda

$R(z)=z^{\alpha-1}$

Taking the integral and plugging this into the Generating function, we get

$G_{A}(\lambda_{i})=\frac{1}{\alpha}z^{\alpha}\bigg|_{z=0}^{z=\lambda_i}=\frac{1}{\alpha}\lambda^{\alpha}_{i}$

$\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)=\frac{1}{\alpha}Tr[\mathbf{A}^{\alpha}]$

where now the matrix A and its eigenvalues (in the tail of the ESD) do indeed span and even defines the ECS space.

Towards our Heavy Tailed Quality Metric

1. Let us pull the power law exponent $\alpha$ out of the Trace, effectively ignoring cross terms in the sum over $\lambda^{\alpha}$

$Tr[\mathbf{A}^{\alpha}]\simeq Tr[\mathbf{A}]^{\alpha}$

2. We also assume we can replace the Trace of $\mathbf{A}$ with its largest eigenvalue $\lambda_{max}$ , which is actually a good approximation for very heavy tailed Levy matrices, when $\alpha\ll 2$

$Tr[\mathbf{A}]=\sum\lambda_{i}\simeq \lambda_{max}$

This gives an simple expression for the HCIZ integral expression for the layer contribution to the generalization error

$\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)\simeq\lambda_{max}^{\alpha}$

Taking the logarithm of both sides, gives our expression

$log\sum\limits_{i=1}^{M}\;G_{A}(\lambda_i)\simeq\;\alpha\log\lambda_{max}$

We have now derived the our Heavy Tailed Quality metric using a matrix generalization of the classic Student Teacher model, with the help of some modern Random Matrix Theory.

QED

I hope this has convince you that there is still a lot of very interesting theory to develop for AI / Deep Neural Networks. And that you will stay tuned for the published form of this work. And remember…

pip install weightwatcher

A big thanks to Michael Mahoney at UC Berkeley for collaborating with me on this work , and to Mirco Milletari’ (Microsoft), who has been extremely helpful. And to my good friend Matt Lee (Triaxiom Capital, LLC) for long discussions about theoretical physics, RMT, quant finance, etc., , and for encouraging me to publish this.

And here’s the Empirical Evidence that this actually works:

https://www.nature.com/articles/s41467-021-24025-8

If you have gotten this far,

and would be willing to read the draft of the real theory paper, that would be great. Please just email me or ping me on Linkedin

Towards a new Theory of Learning: Statistical Mechanics of Deep Neural Networks

Introduction

5+1 Phases of Training

Heavy Tailed Implicit Regularization

DNN Quality Metrics in Practice

WeightWatcher

Why does this work ?

Statistical Mechanics of Learning

the Student-Teacher model

Quality for a Single Layer: Perceptron model

RMT and HCIZ Integrals

Comments on HCIZ Integrals

Models of the Teacher ESD

Published by Charles H Martin, PhD

2 Comments

Leave a comment Cancel reply

Introduction

5+1 Phases of Training

Heavy Tailed Implicit Regularization

DNN Quality Metrics in Practice

WeightWatcher

Why does this work ?

Statistical Mechanics of Learning

the Student-Teacher model

Quality for a Single Layer: Perceptron model

RMT and HCIZ Integrals

Comments on HCIZ Integrals

Models of the Teacher ESD

Share this:

Related

Published by Charles H Martin, PhD

2 Comments

Leave a comment Cancel reply