Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks. Most notably, we observe that in all pre-trained models, the layer weight matrices display near Universal power law behavior. That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:
For a given weight matrix , we form the correlation matrix
and then compute the M eigenvalues of
We call the histogram of eigenvalues the Empirical Spectral Density (ESD). It can nearly always be fit to a power law
We call the Power Law Universal because 80-90% of the exponents lie in range
For fully connected layers, we just take as is. For Conv2D layers with shape we consider all 2D feature maps of shape . For any large, modern, pretrained DNN, this can give a large number of eigenvalues. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.
As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4. Although compared to the FC layers, for the Conv2D layers, we do see more exponents . We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is
Are power law exponents correlated with better generalization accuracies ? … YES they are!
We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including
To compare these model versions, we can simply compute the average power law exponent , averaged across all FC weight matrices and Conv2D feature maps. This is similar to consider the product norm, which has been used to test VC-like bounds for small NNs. In nearly every case, smaller is correlated with better test accuracy (i.e. generalization performance).
The only significant caveats are:
Predicting the test accuracy is a complicated task, and IMHO simple theories , with loose bounds, are unlikely to be useful in practice. Still, I think we are on the right track
Lets first look at the DenseNet models
Here, we see that as Test Accuracy increases, the average power law exponent generally decreases. And this is across 4 different models.
The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 is larger than InceptionV4.
Now consider the Resnet models, which are increasing in size and have more architectural differences between them:
Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a s larger than FbResnet152, but they are very close. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller across a wide range of architectures.
These results are easily reproduced with this notebook.
This is an amazing result !
You can think of the power law exponent as a kind of information metric–the smaller , the more information is in this layer weight matrix.
Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data.
In addition to DenseNet, Inception, ResNext, SqueezeNet, and the (larger) ResNet models, we have even more positive results are available here on ~40 more DNNs across ~10 more different architectures, including MeNet, ShuffleNet, DPN, PreResNet, DenseNet, SE-Resnet, SqueezeNet, and MobileNet, MobileNetV2, and FDMobileNet.
I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice.
]]>
One broad question we can ask is:
How is information concentrated in Deep Neural Network (DNNs)?
To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.
In a previous post, we formed the Singular Value Decomposition (SVD) of the weight matrices of the linear, or fully connected (FC) layers. And we saw that nearly all the FC Layers display Power Law behavior. And, in fact, this behavior is Universal across models both ImageNet and NLP models.
But this only part of the story. Here, we ask related question–do well trained DNNs weight matrices lose Rank ?
Lets say is an matrix. We can form the Singular Value Decomposition (SVD):
The Matrix Rank , or Hard Rank, is simply the number of non-zero singular values
which express the decrease in Full Rank M.
Notice the Hard Rank of the rectangular matrix is the dimension of the square correlation matrix .
In python, this can be computed using
rank = numpy.linalg.matrix_rank(W)
Of course, being a numerical method, we really mean the number of singular values above some tolerance …and we can get different results depending on if we use
See the numpy documentation on matrix_rank for details.
Here, we will compute the rank ourselves, and use an extremely loose bound, and consider any . As we shall see, DNNs are so good at concentrating information that it will not matter
If all the singular values are non-zero, we say is Full Rank. If one or more , then we say is Singular. It has lost expressiveness, and the model has undergone Rank collapse.
When a model undergoes Rank Collapse, it traditionally needs to be regularized. Say we are solving a simple linear system of equations / linear regression
The simple solution is to use a little linear algebra to get the optimal values for the unknown
But when is Singular, we can not form the matrix inverse. To fix this, we simply add some small constant to diagonal of
So that all the singular values will now be greater than zero, and we can form a generalized pseudo-inverse, called the Moore-Penrose Inverse
This procedure is also called Tikhonov Regularization. The constant, or Regularizer, sets the Noise Scale for the model. The information in is concentrated in the singular vectors associated with larger singular values , and the noise is left over in the those associated with smaller singular values :
In cases where is Singular, regularization is absolutely necessary. But even when it is not singular, Regularization can be useful in traditional machine learning. (Indeed, VC theory tells us that Regularization is a first class concept)
But we know that Understanding deep learning requires rethinking generalization. Which leads to the question ?
Do the weight matrices of well trained DNNs undergo Rank Collapse ?
Answer: They DO NOT — as we now see:
We can easily examine the numerous pre-trained models available in PyTorch. We simply need to get the layer weight matrices and compute the SVD. We then compute the minimum singular value and compute a histogram of the minimums across different models.
for im, m in enumerate(model.modules()): if isinstance(m, torch.nn.Linear): W = np.array(m.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape) _, svals, _ = np.linalg.svd(W) minsval=np.min(svals) ...
We do this here for numerous models trained on ImageNet and available in pyTorch, such as AlexNet, VGG16, VGG19, ResNet, DenseNet201, etc.– as shown in this Jupyter Notebook.
We also examine the NLP models available in AllenNLP. This is a little bit trickier; we have to install AllenNLP from source, then create an analyze.py command class, and rebuild AllenNLP. Then, to analyze, say, the AllenNLP pre-trained NER model, we run
allennlp analyze https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.04.26.tar.gz
This print out the ranks (and other information, like power law fits), and then plot the results. The code for all this is here.
Notice that many of the AllenNLP models include Attention matrices, which can be quite large and very rectangular (i.e. = ), as compared to the smaller (and less rectangular) weight matrices used in the ImageNet models (i.e. ),.
Note: We restrict our analysis to rectangular layer weight matrices with an aspect ratio , and really larger then 1.1. This is because the Marchenko Pastur (MP) Random Matrix Theory (RMT) tells us that only when. We will review this in a future blog.
For the ImageNet models, most fully connected (FC) weight matrices have a large minimum singular value . Only 6 of the 24 matrices looked at have –and we have not carefully tested the numerical threshold–we are just eyeballing it here.
For the AllenNLP models, none of the FC matrices show any evidence of Rank Collapse. All of the singular values for every linear weight matrix are non-zero.
It is conjectured that fully optimized DNNs–those with the best generalization accuracy–will not show Rank Collapse in any of their linear weight matrices.
If you are training your own model and you see Rank Collapse, you are probably over-regularizing.
it is, in fact, very easy to induce Rank Collapse. We can do this in a Mini version of AlexNet, coded in Keras 2, and available here.
To induce rank collapse in our FC weight matrices, we can add large weight norm constraints to the FC1 linear layer, using the kernel_initializer=…
... model.add(Dense(384, kernel_initializer='glorot_normal', bias_initializer=Constant(0.1),activation='relu', kernel_regularizer=l2(...)) ...
We train this smaller MiniAlexnet model on CIFAR10 for 20 epochs, save the final weight matrix, and plot a histogram of the eigenvalues of the weight correlation matrix
.
We call the Empirical Spectral Density (ESD). Recall that the eigenvalues are simply the square of the singular values
.
Here is what happens to when we turn up the amount of L2 Regularization from 0.0003 to 0.0005. The decreases from 0.0414 to 0.008.
As we increase the weight norm constraints, the minimum eigenvalue approaches zero
Note that adding too much regularization causes nearly all of the eigenvalues/singular values to collapse to zero–as well as the norm of the matrix.
We conjecture that DNNs have zero singular/eigenvalues because there is too much regularization on the layer.
And that…
Fully optimized Deep Neural Networks do not have Rank Collapse
We believe this is a unique property of DNNs, and related to how Regularization works in these models. We will discuss this and more in an upcoming paper
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
by Charles H. Martin (Calculation Consulting) and Michael W. Mahoney (UC Berkeley).
And presented at UC Berkeley this Monday at the Simons Institute
and see our long form paper
Please stay tuned and subscribe to this blog for more updates
Here we are just looking at the distribution to estimate the rank loss. We could be more precise..
In the numpy.linalg.matrix_rank() funtion, “By default, we identify singular values less than S.max() * max(M.shape) * eps
as indicating rank deficiency” when using SVD
But there is some ambiguity here as well, since there is a different default from Numerical Recipes. I will leave it up to the reader to select the best rank loss metric and explore further. And I would be very interested in your findings.
We have computed the minimum singular value for all the Conv2D layers in the ImageNet models deployed with pyTorch. This covers nearly ~7500 layers across ~40 different models.
Very generously, we can say there is rank collapse with . Only 10%-13% of the layers show any form of rank collapse, using this simple heuristic, as easily seen on a log histogram.
In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior. Specifically, for each FC weight matrix , we compute the eigenvalues of the correlation matrix
For every FC matrix, the eigenvalue frequencies, or Empirical Spectral Density (ESD), can be fit to a power law
where the exponents all lie in
Remarkably, the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices!
We define a random matrix by defining a matrix of size , and drawing the matrix elements from a random distribution. We can choose a
or a
In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like. But first, let’s see what model works best.
First, lets look at the ESD for AlexNet for layer FC3, and zoomed in:
Recall that AlexNet FC3 fits a power law with exponent $\alpha\sim&bg=ffffff $ , so we also plot the ESD on a log-log scale
Notice that the distribution is linear in the central region, and the long tail cuts off sharply. This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far. We now ask…
What kind of Random Matrix would make a good model for this ESD ?
We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q, and plot the histogram of their eigenvalues.
N, M = 1000, 500 Q = N / M W = np.random.normal(0,1,size=(M,N)) # X shape is M x M X = (1/N)*np.dot(W.T,W) evals = np.linalg.eigvals(X) plot.hist(evals, bins=100,density=True)
Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)
We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function
W=np.random.pareto(mu,size=(N,M))
Heavy Tailed Random matrices have a very ESDs. They have very long tails–so long, in fact, that it is better to plot them on a log log Histogram
Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?
Lets overlay the ESD of fat-tailed W with the actual empirical from AlexNet for layer FC3
We see a pretty good match to a Fat-tailed random matrix with .
Turns out, there is something very special about being in the range 2-4.
Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, called University Classes. The 3 different values of each represent a different Universality Class:
In particular, if we draw from any heavy tailed / power law distribution, the empirical (i.e. finite size) eigenvalue density is likewise a power law (PL), either globally, or at least locally.
What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges of . And the amazing thing is that
the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of the Fat-Tailed Universality Classes of Random Matrix Theory
But this is a little tricky to show, because we need to show that we fit to the theoretical . We now look at the
RMT tells us that, for , the ESD takes the limiting for
, where
And this works pretty well in practice for the Heavy Tailed Universality Class, for . But for any finite matrix, as soon as , the finite size effects kick in, and we can not naively apply the infinite limit result.
RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails — the fluctuations in the maximum eigenvalue . Specifically, we have
For standard, Gaussian RMT, the (near the bulk edge) is governed by the famous Tracy Widom. And for , RMT is governed by the Tau Four Moment Theorem.
But for , the tail fluctuations follow Frechet statistics, and the maximum eigenvalue has Power Law finite size effects
In particular, the effects of M and Q kick in as soon as . If we underestimate , (small Q, large M), the power law will look weaker, and we will overestimate in our fits.
And, for us, this affects how we estimate from and assign the Universality Class
Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q. We fit each ESD to a Power Law. We then plot , as fit, to .
The red lines are predicted by Heavy Tailed RMT (MP) theory, which works well for Heavy Tailed ESDs with . For Fat Tails, with , the finite size effects are difficult to interpret. The main take-away is…
We can identify finite size matrices W that behave like the the Fat Tailed Universality Class of RMT () with Power Law fits, even with exponents , ranging upto 4 (and even upto 5-6).
It is amazing that Deep Neural Networks display this Universality in their weight matrices, and this suggests some deeper reason for Why Deep Learning Works.
In statistical physics, if a system displays a Power Laws, this can be evidence that it is operating near a critical point. It is known that real, spiking neurons display this behavior, called Self Organized Criticality
It appears that Deep Neural Networks may be operating under similar principles, and in future work, we will examine this relation in more detail.
The code for this post is in this github repo on ImplicitSelfRegularization
]]>
In pretrained, production quality DNNs, the weight matrices for the Fully Connected (FC ) layers display Fat Tailed Power Law behavior.
Deep Neural Networks (DNNs) minimize the Energy function defined by their architecture. We define the layer weight matrices , biases, , and activations functions , giving
We train the DNN on a labeled data set (d,y), giving the optimization problem
We call this the Energy Landscape because the DNN optimization problem is only parameterized by the weights and biases. Of course, in any real DNN problem, we do have other adjustable parameters, such as the amount of Dropout, the learning rate, the batch size, etc. But these regularization effects simply change the global
The Energy Landscape function changes on each epoch–and do we care about how. In fact, I have argued that must form an Energy Funnel:
But here, for now, we only look at the final result. Once a DNN is trained, what is left are the weights (and biases). We can reuse the weights (and biases) of a pre-trained DNN to build new DNNs with transfer learning. And if we train a bunch of DNNs, we want to know which one is better ?
But, practically, we would really like to identify a very good DNN without peaking at the test data, since every time we peak, and retrain, we risk overtraining our DNN.
I now show we can at least start do this by looking the weights matrices themselves. So let us look at the weights of some pre-trained DNNs.
Pytorch comes with several pretrained models, such as AlexNet. To start, we just examine the weight matrices of the Linear / Fully Connected (FC) layers.
pretrained_model = models.alexnet(pretrained=True) for module in pretrained_model.modules(): if isinstance(module, nn.Linear): ...
The Linear layers have the simplest weight matrices ; they are 2-dimensional tensors, or just rectangular matrices.
Let be an matrix, where . We can get the matrix from the pretraing model using:
W = np.array(module.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape)
How is information concentrated in . For any rectangular matrix, we can form the
which is readily computed in scikit learn. We will use the faster TruncatedSVD method, and compute singular values :
from sklearn.decomposition import TruncatedSVD svd = TruncatedSVD(n_components=M-1, n_iter=7, random_state=10) svd.fit(W) svals = svd.singular_values_
(Technically, we do miss the smallest singular value doing this, but that’s ok. It won’t matter here, and we can always use the pure svd method to be a exact)
We can, alternatively form the eigenvalues of the correlation matrix
The eigenvalues are just the square of the singular values.
Notice here we normalize them by N.
evals = (1/N)*svals*svals
We now form the Empirical Spectral Density (ESD), which is, formally
This notation just means compute a histogram of the eigenvalues
import matplotlib.pyplot as plt plt.hist(evals, bins=100, density=True)
We could also compute the spectral density using a Kernel Density Estimator (KDE); we save this for a future post.
We now look at the ESD of
Here, we examine just FC3, the last Linear layer, connecting the model to the labels. The other linear layers, FC1 and FC2, look similar. Below is a histogram for ESD. Notice it is very sharply peaked and has long tail, extending out past 40.
We can get a better view of the heavy tailed behavior by zooming in.
The red curve is a fit of the ESD to the Marchenko Pastur (MP) Random Matrix Theory (RMT) result — it is not a very good fit. This means ESD does not resemble Gaussian Random matrix. Instead, it is looks heavy tailed. Which leads to the question…
(Yes do, as we shall see…)
Physicists love to claim they have discovered data that follows a power law:
But this is harder to do than it seems. And statisticians love to point this out. Don’t be fooled–we physicists knew this; Sornette’s book has a whole chapter on it. Still, we have to use best practices.
The first thing to do: plot the data on a log-log histogram, and check that this plot is linear–at least in some region. Let’s look at our ESD for AlexNet FC3:
Yes, it is linear–in the central region, for eigenvalue frequencies between roughly ~1 and ~100–and that is most of the distribution.
Why is not linear everywhere? Because it is finite size–there are min and max cutoffs. In the infinite limit, a powerlaw diverges at , and the tail extends indefinitely as . In any finite size data set, there will be an and .
Second, fit the data to a power law, with and in mind. The most widely available and accepted method the Maximum Likelihood Estimator (MLE), develop by Clauset et. al., and available in the python powerlaw package.
import powerlaw fit = powerlaw.Fit(evals, xmax=np.max(evals)) alpha, D = fit.alpha, fit.D
The D value is a quality metric, the KS distance. There are other options as well. The smaller D, the better. The table below shows typical values of good fits.
The powerlaw package also makes some great plots. Below is a log log plot generated for our fit of FC3, for the central region of the ESD. The filled lines represent our fits, and the dotted lines are actual power lawPDF (blue) and CCDF (red). The filled lines look like straight lines and overlap the dotted lines–so this fit looks pretty good.
Is this enough ? Not yet…
We still need to know, do we have enough data to get a good estimate for , what are our error bars, and what kind of systematic errors might we get?
We can calibrate the estimator by generating some modest size (N=1000) random power law datasets using the numpy Pareto function, where
and then fitting these with the PowerLaw package. We get the following curve
The green line is a perfect estimate. The Powerlaw package overestimates small and underestimates large . Fortunately, most of our fits lie in the good range.
A good fit is not enough. We also should ensure that no other obvious model is a better fit. The power law package lets us test out fit against other common (long tailed) choices, namely
For example, to check if our data is better fit by a log normal distribution, we run
R, p = fit.distribution_compare('powerlaw', 'lognormal', normalized_ratio=True)
and R and the the p-value. If if R<0 and p <= 0.05, then we can conclude that a power law is a better model.
Note that sometimes, for , the best model may be a truncated power law (TPL). This happens because our data sets are pretty small, and the tails of our ESDs fall off pretty fast. A TPL is just a power law (PL) with an exponentially decaying tail, and it may seem to be a better fit for small data, fixed size sets. Also, the TPL also has 2 parameters, whereas the PL has only 1, so it is not unexpected that the TPL would fit the data better.
Below we fit the linear layers of the many pretrained models in pytorch. All fit a power law (PL) or truncated power law (TPL)
The table below lists all the fits, with the Best Fit and the KS statistic D.
Notice that all the power law exponents lie in the range 2-4, except for a couple outliers.
This is pretty fortunate since out PL estimator only works well around this range. This is also pretty remarkable because it suggests there is…
There is a deep connection between this range of exponents and some (relatively) recent results in Random Matrix Theory (RMT). Indeed, this seems to suggest that Deep Learning systems display the kind of Universality seen in Self Organized systems, like real spiking neurons. We will examine this in a future post.
Training DNNs is hard. There are many tricks one has to use. And you have to monitor the training process carefully.
Here we suggest that the ESDs of the Linear / FC layers of a well trained DNN will display power law behavior. And the exponents will be between 2 and 4.
A counter example is Inception V3. Both FC layers 226 and 302 have unusually large exponents. Looking closer at Layer 222, the ESDs is not a power law at all, but rather bi-model heavy tailed distribution.
We conjecture that as good as Inception V3 is, perhaps it could be further optimized. It would be interesting to see if anyone can show this.
The commonly accepted method uses Maximum Likelihood Estimator (MLE). This
The conditional probability is defined by evaluating on a data set (of size n)
The log likelihood is
Which is easily reduced to
The maximum likelihood occurs when
The MLE estimate, for a given , is:
We can either input , or search for the best possible estimate (as explained in the Clauset et. al. paper).
Notice, however, that this estimator does not explicitly take into account. And for this reason it seems to be very limited in its application. A better method, such as developed recently by Thurner (but not yet coded in python), may prove more robust for a larger range of exponents.
Similar results arise in the Linear and Attention layers in NLP models. Below we see the power law fits for 85 Linear layers in the 6 pretrained models from AllenNLP
80% of the linear layers have
We can naively extend these results to Conv2D layers by extracting all the Conv2D layers directly from the 4-index Tensors back, giving several rectangular 2D matrices per layer. Doing this, and repeating the fits, we find that 90% of the Conv2D matrices fit a power law with .
A few Conv2D of layers show very high exponents. What could this mean?
It will be very interesting to dig into these results and determine if the Power Law exponents display Universality across layers, architectures, and data sets.
Even more pretrained models are now available in pyTorch…45 models and counting. I have analyzed as many as possible, giving 7500 layer weight matrices (including the Conv2D slices!)
The results are conclusive..the ~80% of the exponents display Universality. An amazing find. With such great results, soon we will release an open source package so you can do analyze your own models.
Code for this study, and future work, is available in this repo.
Research talks are available on my youtube channel. Please Subscribe
]]>
An early talk describing details in this paper
Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models.
We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization
whereas large, modern deep networks display a new kind of heavy tailed self-regularization.
We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training.
Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs.
We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.
Last year, ICLR 2017, a very interesting paper came out of Google claiming that
Understanding Deep Learning Requires Rethinking Generalization
Which basically asks, why doesn’t VC theory apply to Deep Learning ?
In response, Michael Mahoney, of UC Berkeley, and I, published a response that says
Rethinking generalization requires revisiting old ideas [from] statistical mechanics …
In this post, I discuss part of our argument, looking at the basic ideas of Statistical Learning Theory (SLT) and how they actually compare to what we do in practice. In a future post, I will describe the arguments from Statistical Mechanics (Stat Mech), and why they provide a better, albeit more complicated, theory.
Our paper is a formal discussion of my Quora post on the subject
Of course, it would be nice to have the original code so the results could be reproduced–no such luck. The code is owned by Google–hence the name ‘Google paper’
And, Chiyuan Zhang, one of the authors, has kindly put together a PyTorch github repo–but it is old pyTorch and does not run There is, however, a nice Keras package that does the trick.
In the traditional theory of supervised machine learning (i.e. VC, PAC), we try to understand what are the sufficient conditions for an algorithm to learn. To that end, we seek some mathematical guarantees that we can learn, even in the worst possible cases.
What is the worst possible scenario ? That we learn patterns in random noise, of course.
“In theory, theory and practice are the same. In practice, they are different.”
The Google paper analyzes the Rademacher complexity of a neural network. What is this and why is it relevant ? The Rademacher complexity measures how much a model fits random noise in the data. Let me provide a practical way of thinking about this:
Imagine someone gives you training examples, labeled data , and asks you to build a simple model. How can you know the data is not corrupted ?
If you can build a simple model that is pretty accurate on a hold out set, you may think you are good. Are you sure ? A very simple test is to just randomize all the labels, and retrain the model. The hold out results should be no better than random, or something is very wrong.
Our practical example is a variant of this:
By regularizing the model, we expect we can decrease the model capacity, and avoid overtraining. So what can happen ?
Notice in both cases, the difference in training error and generalization error stayed nearly fixed or at least bounded
It appears that some deep nets can fit any data set, no matter what the labels ? Even with standard regularization. But computational learning theories, like VC and PAC theory, suggests that there should be some regularization scheme that decreases the capacity of the model, and prevent overtraining.
So how the heck do Deep Nets generalize so incredibly well ?! This is the puzzle.
To get a handle on this, in this post, I review the salient ideas of VC/PAC like theories.
the other day, a friend asked me, “how much data do I need for my AI app ?”
My answer: “it’s complicated !”
For any learning algorithm (ALG), like an SVM, Random Forest, or Deep Net, ideally, we would like to know the number of training examples necessary to learn a good model.
ML Theorists sometimes call this number the Sample Complexity. In the Statistical Mechanics, it is called the loading parameter (or just the load) .
Statistical learning theory (SLT), i.e. VC theory, provides a way to get the sample complexity. Moreover, it also tells us how bad a machine learning model might be, in the worst possible case. It states the difference between the test and training error should be characterized by simply the VC dimension .–which measures the effective capacity of the model, and the number of training examples .
.
Statistics is about how frequencies converge to probabilities. That is, if we sample N training examples from a distribution , we seek a gaurentee that the accuracy can be made arbitrarily small () with a very high probability
We call our model a class of functions . We associate the empirical generalization error with the true model error (or risk) , which would be obtained in the proper infinite limit. And the training error with the empirical risk . Of course, ; we want to know how bad the difference could possibly be, by bounding it
Also, in our paper, we use , but here we will follow the more familiar notation.
I follow the MIT OCW course on VC theory, as well as chapter 10 of Engle and Van der Broeck, Statistical Mechanics of Learning (2001).
In most problems, machine learning or otherwise, we want to predict an outcome. Usually a good estimate is the expected value. We use concentration bounds to give us some gaurentee on the accuracy. That is, how close is the expected value is to the actual outcome.
SLT , however, does not try to predict an expected or typical value. Instead, it uses the concentration bounds to explain how regularization provides a control over the capacity of a model, thereby reducing the generalization error. The amazing result of SLT is that capacity control emerges as a first class concept, arising even in the most abstract formulation of the bound–no data, no distributions, and not even a specific model.
In contrast, Stat Mech seeks to predict the average, typical values for a very specific model, and for the entire learning curve. That is, all practical values of the adjustable knobs of the model, the load , the Temperature, etc., in the Thermodynamic limit. Stat Mech does not provide concentration bounds or other guarantees, but it can provide both a better qualitative understanding of the learning process, and even good quantitative results for sufficiently large systems.
(In principle, the 2 methods are not totally incompatible, as one could examine the behavior of a specific bound in something like an effective Thermodynamic limit. See Engle …, chapter 10)
For now, let’s see what the bounding theorems say:
If we are estimating a single function , which does not depend on the data, then we just need to know the number of training examples m
The Law of Large Numbers (Central Limit Theorem) tells us then that the difference goes to zero in the limit .
This is a direct result of Hoeffding’s inequality, which is a well known bound on the confidence of the empirical frequencies converging to a probability.
However, if our function choices f depend on the data, as they do in machine learning, we can not use this bound. The problem: we can overtrain. Formally, we might find a function that optimizes training error but performs very poorly on test data , causing the bound to diverge. As in the pathological case above.
So we consider a Uniform bound over all functions , meaning we consider maximum (supremum) deviation.
We also now need to know the number of adjustable parameters N. For a finite size function class, and a general (unrealizable) , and we have
.
The is a 2-sided version of this theorem also, but the salient result is that the new term appears because we need all N bounds to hold simultaneously. We consider the maximum deviation, but we are bounded by the worst possible case. And not even typical worst case scenarios, but the absolute worst–causing the bound to be very loose. Still at least the deviations are bounded–in the finite case.
What happens when the function class is infinite? As for Kernel methods and Neural Networks. How do we treat the two limits ?
In VC theory, we fix m, and define distribution-independent measure, the VC growth function , which tells us how many ways our training data can be classified (or shattered) by functions in
,
where is the set of all ways the data can be classified by the functions
Using this, we can bound even infinite size classes , with a finite growth function.
WLOG, we only consider binary classification. We can treat more general models using the Rademacher complexity instead of the VC dimension–which is why the Google paper talked so much about it.
For any , and for any random draw of the data, we have, with probability at least
.
So the simpler the function class, the smaller the true error (Risk) should be.
In fact, for a finite , we have a tighter bound than above, because .
Note that we usually see the VC bounds in terms of the VC dimension…but VC theory provides us more than bounds; it tells us why we need regularization.
We define the VC Dimension , which is simply the number of training examples m, for which
The VC dim, and the growth function, measure the effective size of the class . By effective, we mean not just the number of functions, but the geometric projection of the class onto finite samples.
If we plot the growth function, we find it has 2 regimes:
: exponential growth
: polynomial growth
This leads to formal bounds on infinite size function classes (by Vapnik, Sauer, etc) based on the VC dim (as we mention in our paper)
Let be a class of functions with finite VC dim . Then for all
If has VC dim , and for all , with probability at least
Since bounds the risk of not generalizing, when we have more training examples than the VC dim (), the risk grows so much slower. So to generalize better, decrease , the effective capacity of .
Regularization–reducing the model complexity–should lead to better generalization.
So why does regularization not seem to work as well for Deep Learning ? (At least, that is what the Google paper suggests!)
Note: Perfectly solvable, or realizable, problems may have a tighter bound, but we ignore this special case for the present discussion.
Before we dive in Statistical Mechanics, I first mention an old paper by Vapnik, Levin and LeCun, Measuring the VC Dimension of a Learning Machine (1994) .
It is well known that the VC bounds are so loose to be of no practical use. However, it is possible to measure an effective VC dimension–for linear classifiers. Just measure the maximal difference in the error while increasing size of the data, and fit it to a reasonable function. In fact, this effective VC dim appears to be universal in many cases. But…[paraphrasing the last paragraph of the conclusion]…
“The extension of this work to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network…[because] it is likely…the search will be confined to a subset of [these] functions…The capacity of this set can be much lower than the capacity of the whole set…[and] may change with the number of observations. This may require a theory that considers the notion of a non-constant capacity with an ‘active’ subset of functions”
So even Vapnik himself suspected, way back in 1994, that his own theory did not directly apply to Neural Networks!
And this is confirmed in recent work looking at the empirical capacity of RNNs.
And the recent Google paper says things are even weirder. So what can we do ?
We argue that the whole idea of looking at worst-case-bounds is at odds with what we actually do in practice because we take a effectively consider a different limit than just fixing m and letting N grow (or vice versa).
Very rarely would we just add more data (m) to a Deep network. Instead, we usually increase the size of the net (N) as well, because we know that we can capture more detailed features / information from the data. So, win practice, we increase m and N simultaneously.
In Statistical Mechanics, we also consider the join limit …but with the ratio m/N fixed.
The 2 ideas are not completely incompatible, however. In fact, Engle… give nice example of applying the VC Bounds, in a Thermodynamic Limit, to a model problem.
In contrast to VC/PAC theories, which seek to bound the worst case of a model, Statistical Mechanics tries to describes the typical behavior of a model exactly. But typical does not just mean the most probable. We require that atypical cases be made arbitrarily small — in the Thermodynamic limit.
This works because the probabilities distributions of relevant thermodynamic quantities, such as the most likely average Energy, become sharply peaked around their maximal values.
Many results are well known in the Statistical Mechanics of Learning. The analysis is significantly more complicated but the results lead to a much richer structure that explains many phenomena in deep learning.
In particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).
For a typical neural network, can identify 3 phases of the system, controlled by the load parameter , the amount of training data m, relative to the number of adjustable network parameters N (and ignoring other knobs)
The view from SLT contrasts with some researchers who argue that all deep nets are simply memorizing their data, and that generalization arises simply because the training data is so large it covers nearly every possible case. This seems very naive to me, but maybe ?
Generally speaking, memorization is akin to prototype learning, where only a single examples is needed to describe each class of data. This arises in certain simple text classification problems, which can then be solved using Convex NMF (see my earlier blog post).
In SLT, over-training is a completely different phase of the system, characterized by a kind of pathological non-convexity–an infinite number of (degenerate) local minimum, separated by infinitely high barriers. This is the so-called (mean field) Spin-Glass phase.
SLT Overtraining / Spin Glass Phase has an infinite number of minima
So why would this correspond to random labellings of the data ? Imagine we have a binary classifier, and we randomize labels. This gives new possible labellings:
Different Randomized Labellings of the Original Training Data
We argue in our paper, that this decreases the effective load . If is very small, then will not change much, and we stay in the Generalization phase. But if is of order N (say 10%), then may decrease enough to push us into the Overtraining phase.
Each Randomized Labelling corresponds to a different Local Minima
Moreover, we now have new, possibly unsatisfiable, classifications problems. So the solutions will be nearly degenerate. But many of these will be difficult to learn because the many of label(s) are wrong, so the solutions could have high Energy barriers — i.e. they are difficult to find, and hard to get out of.
So we postulate by randomizing a large fraction of our labels, we push the system into the Overtraining / Spin Glass phase–and this is why traditional VC style regularization can not work–it can not bring us out of this phase. At least, that’s the theory.
In my next paper / blog post, I will describe how to examine these ideas in practice. We develop a method to when phase transitions arise in the training of real world neural networks. And I will show how we can observe what I postulated 3 years ago–the Spin Glass of Minimal Frustration, and how this changes the simple picture from the Spin Glass / Statistical Mechanics of Learning. Stay tuned !
It would be dishonest to present statistical mechanics as a general theory of learning that resolves all problems untreatable by VC theory. In particular, however, is the problem with theoretical analysis of online learning.
Of course, statistical mechanics only rigorously applies to the limit . Statistics requires infinite sampling, but does provides results for finite . But even in the infinite case, there are issues.
While online learning (ala SGD) is certainly amenable to a stat mech approach, there is a fundamental problem with saddle points that arise in the Thermodynamic limit. (See section 9.8, and also chapter 13, Engle and Van den Broeck)
It has been argued that an online algorithm may get trapped in saddle points which result from symmetries in the architecture of the network, or the underlying problem itself. In these cases, the saddle points may act as fixed points along an unstable manifold, causing a type of symmetry-induced paralysis. And while any finite size system may escape such saddles, in the limit , the dynamics can get trapped in an unstable manifold, and learning fails.
This is typically addressed in the context of symmetry-induced replica symmetry breaking in multilayer perceptrons, which would need be the topic of another post.
Computational Learning Theory (i.e VC theory) frames a learning algorithm or model as a infinite series of growing, distribution-independent hypothesis spaces , of VC dimension , such that
This lets us consider typical case behavior, when the limit converges (the so-called self-averaging case). And we can even treat typical worst case behaviors. And since we are not limited to the over loose bounds, the analysis matches reality much closer.
And we do this for a spin glass model of deep nets, we see much richer behavior in the learning curve. And we argue, this agrees more with what we see in practice
Let’s formalize this in the context of .
Let us call class of the hypothesis, or function classes, . Notationally, sometimes we see . Here, we distinguish between our abstract, theoretical hypothesis and the actual functions the algorithm sees
A hypothesis represents, say, the set of all possible weights for neural network. Or, more completely, the set of all weights, and learning rates, and dropout rates, and even initial weight distributions. It is a function . This may be confusing because, conceptually, we will want hypothesis with random labels–but we don’t actually do this. Instead, we measure the Rademacher complexity, which is a mathematical construct to let us measure all possible label randomizations for a given hypothesis .
We also fix the function class , so that learning process is, conceptually, a fixed point iteration over this fixed set:
and not some infinite set. This is critical because the No Free Lunch Theorem says that the worse-case sample complexity of an infinite size model is infinite.
For a given distribution , define the expected risk of a hypothesis as expected Loss over the actual training data
and the optimal risk as the maximum error (infimum) we encounter for all possible hypothesis–like in our practical case (2) above.
As above, we identify the training error with the data dependent expected risk, and the generalization error as the maximum risk we could ever expect, giving:
Recall we want the minimum N training examples necessary to learn. What is necessary ?
Suppose we pick a set of n labeled pairs. We define a hypothesis in by applying our ALG on this set: . And assume that we drew our set from some distribution, so
Then, for all , we seek a positive integer , such that for all
Notice that explicitly depends on the distribution, accuracy, and confidence.
As in the Central Limit Theorem, we consider the limit where the number of training examples .
But the No Free Lunch Theorem says that unless we restrict on the hypothesis/function space , there always exist “bad” distributions for which the sample complexity is arbitrarily large.
Of course, in practice, we know for that very large data sets, and very high capacity models, we need to regularize our models to avoid overtraining. At least we thought we knew this.
Moreover, by restricting the complexity of , we expect to produce more uniformly consistent results. And this is what we do above.
We don’t necessarily need all the machinery of microscopic Statistical Mechanics and Spin Glasses (ala Engle…) to develop SLT-like learning bounds. There is a classic paper, Retarded Learning: Rigorous Results from Statistical Mechanics, which shows how to get similar results using a variational approach.
]]>
Here are the associated slides
If you enjoyed this presentation, let me invite you to subscribe to my YouTube channel
]]>My graduate advisor used to say:
“If you can’t invent something new, invent your own notation”
Varitional Inference is foundational to Unsupervised and Semi-Supervised Deep Learning. In particular, Variational Auto Encoders (VAEs). There are many, many tutorials and implementations on Variational Inference, which I collect on my YouTube channel and below in the references. In particular, I look at modern ideas coming out of Google Deep Mind.
The thing is, Variational inference comes in 5 different comes in 5 or 6 different flavors, and it is a lot of work just to keep all the notation straight.
We can trace the basic idea back to Hinton and Zemel (1994)– to minimize a Helmholtz Free Energy.
What is missing is how Variational Inference is related the Variational Free Energy from statistical physics. Or even how an RBM Free Energy is related to a Variational Free Energy.
This holiday weekend, I hope to review these methods and to clear some of this up. This is a long post filled with math and physics–enjoy !
Years ago I lived in Boca Raton, Florida to be near my uncle, who was retired and on his last legs. I was working with the famous George White, one of the Dealers of Lightening, from the famous Xerox Parc. One day George stopped by to say hi, and he found me hanging out at the local Wendy’s and reading the book Generating Functionology. It’s a great book. And even more relevant today.
The Free Energy, is a Generating function. It generates thermodynamic relations. It generates expected Energies through weight gradients . It generates the Kullback-Liebler variational bound, and its corrections, as cumulants. And, simply put, in unsupervised learning, the Free Energy generates data.
Let’s see how it all ties together.
We first review inference in RBMs, which is one of the few Deep Learning examples that is fully expressed with classical Statistical Mechanics.
Suppose we have some (unlabeled) data . We know now that we need to learn a good hidden representation () with an RBM, or, say, latent () representation with a VAE.
Before we begin, let us try to keep the notation straight. To compare different methods, I need to mix the notation a bit, and may be a little sloppy sometimes. Here, I may interchange the RBM and VAE conventions
and, WLOG, may interchange the log functions
and drop the parameters on the distributions
Also, the stat mech Physics Free Energy convention is the negative log Z,
and I sometimes use the bra-ket notation for expectation values
.
Finally I might mix up the minus signs in the early draft of this blog; please let me know.
In an RBM, we learn an Energy function $, explicitly:
Inference means gradient learning. along the variational parameters , for the expected log likelihood
.
This is actually a form of Free Energy minimization. Let’s see why…
The joint probability is given by a Boltzmann distribution
.
To get , we have to integrate out the hidden variables
log likelihood = – clamped Free Energy + equilibrium Free Energy
(note the minus sign convention)
We recognize the second term as the total, or equilibrium, Free Energy from the partition function . This is just like in Statistical Mechanics (stat mech), but with . We call the first term the clamped Free Energy because it is like a Free Energy, but clamped to the data (the visible units). This gives
.
We see that the partition function Z is not just the normalization–it is a generating function. In statistical thermodynamics, derivatives of yield the expected energy
Since , we can associate an effective T with the norm of the weights
So if we take weight gradients of the Free Energies, we expect to get something like expected Energies. And this is exactly the result.
The gradients of the clamped Free Energy give an expectation value over the conditional
and the equilibrium Free Energy gradient yields an expectation of the joint distribution :
The derivatives do resemble expected Energies, with a unit weight matrix , which is also, effectively, . See the Appendix for the full derivations.
The clamped Free Energy is easy to evaluate numerically, but the equilibrium distribution is intractable. Hinton’s approach, Contrastive Divergence, takes a point estimate
,
where and are taken from one or more iterations of Gibbs Sampling– which is easily performed on the RBM model.
Unsupervised learning appears to be a problem in statistical mechanics — to evaluate the equilibrium partition function. There are lots of methods here to consider, including
Not to mention the very successful Deep Learning approach, which appears to be to simply guess, and then learn deterministic fixed point equations (i.e. SegNet) , via Convolutional AutoEncoders.
Unsupervised Deep Learning today looks like an advanced graduate curriculum in non-equilibirum statistical mechanics, all coded up in tensorflow.
We would need a year or more of coursework to go through this all, but I will try to impart some flavor as to what is going on here.
VAEs are a kind of generative deep learning model–they let us model and generate fake data. There are at least 10 different popular models right now, all easily implemented (see the links) in like tensorflow, keras, and Edward.
The vanilla VAE, ala Kingma and Welling, is foundational to unsupervised deep learning.
As in an RBM, in a VAE, we seek the joint probability . But we don’t want to evaluate the intractable partition function , or the equilibrium Free Energy , directly. That is, we can not evaluate , but perhaps there is some simpler, model distribution which we can sample from.
There are severals starting points although, in the end, we still end up minimizing a Free Energy. Let’s look at a few:
This an autoencoder, so we are minimizing a something like a reconstruction error. We need is a score between the empirical and model distributions, where the partition function cancels out.
.
This is a called score matching (2005). It has been shown to be closely related to auto-encoders.
I bring this up because if we look at supervised Deep Nets, and even unsupervised Nets like convolutional AutoEncoders, they are minimizing some kind of Energy or Free energy, implicitly at , deterministically. There is no partition function–it seems to have just canceled out.
We can also consider just minimizing the expected log likelihood, under the model
.
And with some re-arrangements, we can extract out a Helmholtz-like Free Energy. it is presented nicely in the Stanford class on Deep Learning, Lecture 13 on Generative Models.
We can also start by just minimizing the KL divergence between the posteriors
although we don’t actually minimize this KL divergence directly.
In fact, there is a great paper / video on Sequential VAEs which asks–are we trying to make q model p, or p model q ? The authors note that a good VAE, like a good RBM, should not just generate good data, but should also give a good latent representation . And the reason VAEs generate fuzzy data is because we over optimize recovering the exact spatial information and don’t try hard enough to get right.
The most important paper in the field today is by Kigma and Welling,where they lay out the basics of VAEs. The video presentation is excellent also.
We form a continuous, Variational Lower Bound, which is a negative Free Energy ()
And either minimizing the divergence of the posteriors, or maximizing the marginal likelihood, we end up minimizing a (negative) Variational Helmholtz Free Energy:
Free Energy = Expected Energy – Entropy
There are numerous derivations of the bound, including the Stanford class and the original lecture by Kingma. The take-away-is
Maximizing the Variational Lower Bound minimizes the Free Energy
This is, again, actually an old idea from statistical mechanics, traced back to Feynman’s book (available in Hardback on Amazon for $1300!)
We make it sound fancy by giving it a Russian name, the Gibbs-Bogoliubov relation (described nice here). It is finite Temp generalization of the Rayleigh-Ritz theorem for the more familiar Hamiltonians and Hermitian matrices.
The idea is to approximate the (Helmholtz) Free Energy with guess, model, or trial Free Energy , defined by expectations such that
is always greater than than true , and as our guess expectation gets better, our approximation improves.
This is also very physically intuitive and reflects our knowledge of the fluctuation theorems of non-equilibrium stat mech. It says that any small fluctuation away equilibrium will relax back to equilibrium. In fact, this is a classic way to prove the variational bound…
and it introduces the idea of conservation of volume in phase space (i.e. the Liouville equation), which, I believe, is related to an Normalizing Flows for VAEs.But that is a future post.
Stochastic gradient descent for VAEs is a deep subject; it is described in detail here
The gradient descent problem is to find the Free Energy gradient in the generative and variational parameters . The trick, however, is to specify the problem so we can bring variational gradient inside the expectation value
This is not trivial since the expected value depends on the variational parameters. For the simple Free Energy objective above, we can show that
Although we will make even further approximations to get working code.
We would like to apply BackProp to the variational lower bound; writing it in these 2 terms make this possible. We can evaluate the first term, the reconstruction error, using mini-batch SGD sampling, whereas the KL regularizer term is evaluated analytically.
We specify a tractable distribution , where we can numerically sample the posterior to get the latent variables using either a point estimate on 1 instance , or a mini-batch estimate .
As in statistical physics, we do what we can, and take a mean field approximation. We then apply the reparameterization trick to let us apply BackProp. I review this briefly in the Appendix.
This leads to several questions which I will adresss in this blog:
Like in Deep Learning, In almost all problems in statistical mechanics, we don’t know the actual Energy function, or Hamiltonian, . So we can’t form the instead of Partition Function , and we can’t solve for the true Free Energy . So, instead, we solve what we can.
For a VAE, instead of trying to find the joint distribution , as in an RBM, we want the associated Energy function, also called a Hamiltonian . The unknown VAE Energy is presumably more complicated than a simple RBM quadratic function, so instead of learning it flat out, we start by guessing some simpler Energy function . More importantly,We want to avoid computing the equilibrium partition function. The key is, is something we know — something tractable.
And, as in physics, q will also be a mean field approximation— but we don’t need that here.
We decompose the total Hamiltonian into a model Hamiltonian Energy plus perturbation
Energy = Model + Perturbation
The perturbation is the difference between the true and model Energy functions, and assumed to be small in some sense. That is, we expect our initial guess to be pretty good already. Whatever that means. We have
The constant $latex \lambda\le1&bg=ffffff$ is used to formally construct a power series (cumulant expansion); it is set to 1 at the end.
Write the equilibrium Free Energy in terms of the total Hamiltonian Energy function
There are numerous expressions for the Free Energy–see the Appendix. From above, we have
and we define equilibrium averages as
Recall we can not evaluate equilibrium averages, but we can presumably evaluate model averages . Given
,
where , and dropping the indices, to write
Insert inside the log, where giving
Using the property , and the definition of , we have expressed the Free Energy as an expectations in q.
This is formally exact–but hard to evaluate even with a tractable model.
We can approximate with using a cumulant expansion, giving us both the Kullback-Leibler Variational Free Energy, and corrections giving a Perturbation Theory for Variational Inference.
Cumulants can be defined most simply by a power series of the Cumulant generating function
although it can be defined and applied more generally, and is a very powerful modeling tool.
As I warned you, I will use the bra-ket notation for expectations here, and switch to natural log
We immediately see that
the stat mech Free Energy has the form of a Cumulant generating function.
Being a generating function, the cumulants are generated by taking derivatives (as in this video), and expressed using double bra-ket notation.
The first cumulant is just the mean expected value
whereas the second cumulant is the variance–the “mean of square minus square of mean”
(yup, cumulants are so common in physics that they have their own bra-ket notation)
This a classic perturbative approximation. It is a weak-coupling expansion for the equilibrium Free Energy, appropriate for small , and/or high Temperature. Since we always, naively, assume , it is seemingly applicable when the distribution is a good guess for
Since log expectation is a cumulant generating function; we can express the equilibrium Free Energy in a power series, or cumulants, in the perturbation V
Setting , the first order terms combine with log Z to form the model Helmholtz, or Kullback Leibler, Free Energy
The total equilibrium Free Energy is expressed as the model Free Energy plus perturbative corrections.
And now, for some
We now see the connection between the RBMs and VAEs, or, rather between the statistical physics formulation, with Energy and Partition functions, and the Bayesian probability formulation of VAEs.
Statistical mechanics has a very long history, over 100 years old, and there are many techniques now being lifted or rediscovered in Deep Learning, and then combined with new ideas. The post introduces the ideas being used today at Deep Mind, with some perspective from their origins, and some discussion about their utility and effectiveness brought from having seen and used these techniques in different contexts in theoretical chemistry and physics.
Of course, cumulants are not the only statistical physics tool. There are other Free Energy approximations, such as the TAP theory we used in the deterministic EMF-RBM.
Both the cumulant expansion and TAP theory are classic methods from non-equilibrium statistical physics. Neither is convex. Neither is exact. In fact, it is unclear if these expansions even converge, although they may be asymptotically convergent. The cumulants are very old, and applicable to general distributions. TAP theory is specific to spin glass theory, and can be applied to neural networks with some modifications.
The cumulants play a critical role in statistical physics and quantum chemistry because they provide a size-extensive approximation. That is, in the limit of a very large deep net (), the Energy function we learn scales linearly in N.
For example, mean field theories obey this scaling. Variational theories generally do not obey this scaling when they include correlations, but perturbative methods do.
The variational theorem is easily proven using jensen’s inequality, as in David Beli’s notes.
In the context of spin glass theory, for those who remember this old stuff, this means that we have expressions like
which, for a given spin glass model, occurs at the boundary (i.e. the Nishimori line) of the spin glass phase. I will discuss this more in an further post.
Well, it has been a long post, which seems appropriate for Labor Day.
But there is more, in the
I will try to finish this soon; the derivation is found in Ali Ghodsi, Lec [7], Deep Learning , Restricted Boltzmann Machines (RBMs)
I think it is easier to understand the Kigma and Welling paper AutoEncoding Variational Bayes by looking at the equations next to Keras Blog and code. We are minimizing the Variational Free Energy, but reformulate it using the mean field approximation and the reparameterization trick.
We choose a model Q that factorizes into Gaussians
We can also use other distributions, such as
Being mean field, the VAE model Energy function $mathcal{H}_{q}(\mathbf{x},\mathbf{z})&bg=ffffff$ is effectively an RBM-like quadratic Energy function , although we don’t specify it explicitly. On the other hand, the true $mathcal{H}(\mathbf{x},\mathbf{z})&bg=ffffff$ is presumably more complicated.
We use a factored distribution to reexpress the KL regularizer using
We can not backpropagate through a literal stochastic node z because we can not form the gradient. So we just replace the innermost hidden layer with a continuous latent space, and form z by sampling from this.
We reparameterize z with explicit random values , sampled from a Normal distribution N(0,I)
In Keras, we define z with a (Lambda) sampling function, eval’d on each batch step
and use this z in the last decoder hidden layer
Of course, this slows down execution since we have to call K.random_normal on every SGD batch.
We estimate mean and variance for the in the mini-batch , and sample from these vectors. The KL regularizer can then be expressed analytically as
This is inserted directly into the VAE Loss function. For each a minibatch (of size L), L is
where the KL Divergence (kl_loss) is approximated in terms of the mini-batch estimates for the mean and variance .
in Keras, the loss looks like:
We can now apply BackProp using SGD, RMSProp, etc. to minimize the VAE Loss, with on every mini-batch step.
In machine learning, we use expected value notation, such as
but in physics and chemistry there at 5 or 6 other notations. I jotted them down here for my own sanity.
For RBMs and other discrete objects, we have
Of course, we may want the limit , but we have to be careful how we take this limit. Still, we may write
In the continuous case, we specify a density of states $latex \rho(E)&bg=ffffff $
which is not the same as specifying a distribution over the internal variables, giving
In quantum statistical mechanics, we replace the Energy with the Hamiltonian operator, and replace the expectation value with the Trace operation
and this is also expressed using a bra-ket notation
and usually use subscripts to represent non-equilibrium states
Raise the Posteriors
]]>This paper, along with his wake-sleep algorithm, set the foundations for modern variational learning. They appear in his RBMs, and more recently, in Variational AutoEncoders (VAEs) .
Of course, Free Energies come from Chemical Physics. And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.
They are so important that Karl Friston has proposed the The Free Energy Principle : A Unified Brain Theory ?
(see also the wikipedia and this 2013 review)
What are free Energies and why do we use them in Deep Learning ?
In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has
This is the T=0 configurational Energy, where each configuration is some pair. In chemical physics, these Energies resemble an Ising model.
The Free Energy is a weighted average of the all the global and local minima
Note: as , the the Free Energy becomes the T=0 global energy minima . In limit of zero Temperature, all the terms in the sum approach zero
and only the largest term, the largest negative Energy, survives.
We may also see F written in terms of the partition function Z:
where the brakets denote an equilibrium average, and expected value over some equilibrium probability distribution . (we don’t normalize with 1/N here; in principle, the sum could be infinite.)
Of course, in deep learning, we may be trying to determine the distribution , and/or we may approximate it with some simpler distribution during inference. (From now on, I just write P and Q for convenience)
But there is more to Free Energy learning than just approximating a distribution.
In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well. It is the Energy available to do work.
For convenience, Hinton explicitly set T=1. Of course, he was doing inference, and did not know the scale of the weights W. Since we don’t specify the Energy scale, we learn the scale implicitly when we learn W. We call this being scale-free
So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where , as we learn the weights W. Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.
Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:
They will fail, however, when the barriers between Energy basins are larger than the Temperature.
This can happen if the effective Temperature drops close to zero during inference. Since T=1 implicitly in inference, this means when the weights W are exploding.
See: Normalization in Deep Learning
Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena. I will discuss this in a future post.
In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros). But this is probably not the optimal solution to learning when using a Free Energy method.
It is sometimes argued that Deep Learning is a non-convex optimization problem. And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima? How can this be ?
At least for unsupervised methods, it has been clear since 1987 that:
An important property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …
Hence, the probability of getting stuck in a local minima decreases
Although this is not specifically how Hinton argued for the Helmholtz Free Energy — a decade later.
Why do we use Free energy methods ? Hinton used the bits-back argument:
Imagine we are encoding some training data and sending it to someone for decoding. That is, we are building an Auto-Encoder.
If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.
But what if have 2 or more equally valid codes ?
Can we save 1 bit by being a little vague ?
Suppose we have N possible encodings , each with Energy . We say the data has stochastic complexity.
Pick a coding with probability and send it to the receiver. The expected cost of encoding is
Now the receiver must guess which encoding we used. The decoding cost of the receiver is
where H is the Shannon Entropy of the random encoding
The decoding cost looks just like a Helmholtz Free Energy.
Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.
To understand this better, we need to relate
In 1957, Jaynes formulated the MaxEnt principle which considers equilibrium thermodynamics and statistical mechanics as inference processes.
In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.
In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy. We need
and F is defined as
In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)
Variables like E and S depend on the system size N. That is,
as
We say S and T are conjugate pairs; S is extensive, T is intensive.
(see more on this in the Appendix)
The conjugate pairs are used to define Free Energies via the Legendre Transform:
Helmholtz Free Energy: F(T) = E(S) – TS
We switch the Energy from depending on S to T, where .
Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S. However, we are free to change and measure the Temperature–the derivative of E w/r.t. S:
This is a powerful and general mathematical concept.
Say we have a convex function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x
.
Then we can form the Legendre Transform , which gives g(w,y,z) as
the ‘Tangent Envelope‘ of f() along x
,
.
or, simply
Note: we have converted a convex function into a concave one. The Legendre transform is concave in the intensive variables and convex in the extensive variables.
Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix). But that is because while it is concave in T, we evaluate it at constant T.
But what if the Energy function is not convex in the Entropy ? Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding). These postulates also fail in systems like blog post on spin chains.
Answer: Take the convex hull
When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via
the Tangent Envelope , a convex relaxation
The Legendre-Fenchel Transform can provide a Free Energy, convexified along the direction internal (configurational) Entropy, allowing the Temperature to control how many local Energy minima are sampled.
Extra stuff I just wanted to write down…
If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.
as ,
if ,
and ,
then W should remain bounded to prevent the Energy E(n) from growing faster than Mn. And, of course, most Deep Learning algorithms do bound W in some form.
where C denotes a contour integral.
]]>
Check out my recent chat with Max Mautner, the Accidental Engineer
http://theaccidentalengineer.com/charles-martin-principal-consultant-calculation-consulting/
]]>