Last year, ICLR 2017, a very interesting paper came out of Google claiming that
Understanding Deep Learning Requires Rethinking Generalization
Which basically asks, why doesn’t VC theory apply to Deep Learning ?
In response, Michael Mahoney, of UC Berkeley, and I, published a response that says
Rethinking generalization requires revisiting old ideas [from] statistical mechanics …
In this post, I discuss part of our argument, looking at the basic ideas of Statistical Learning Theory (SLT) and how they actually compare to what we do in practice. In a future post, I will describe the arguments from Statistical Mechanics (Stat Mech), and why they provide a better, albeit more complicated, theory.
Of course, it would be nice to have the original code so the results could be reproduced–no such luck. The code is owned by Google–hence the name ‘Google paper’
And, Chiyuan Zhang, one of the authors, has kindly put together a PyTorch github repo–but it is old pyTorch and does not run There is, however, a nice Keras package that does the trick.
In the traditional theory of supervised machine learning (i.e. VC, PAC), we try to understand what are the sufficient conditions for an algorithm to learn. To that end, we seek some mathematical guarantees that we can learn, even in the worst possible cases.
What is the worst possible scenario ? That we learn patterns in random noise, of course.
“In theory, theory and practice are the same. In practice, they are different.”
The Google paper analyzes the Rademacher complexity of a neural network. What is this and why is it relevant ? The Rademacher complexity measures how much a model fits random noise in the data. Let me provide a practical way of thinking about this:
Imagine someone gives you training examples, labeled data , and asks you to build a simple model. How can you know the data is not corrupted ?
If you can build a simple model that is pretty accurate on a hold out set, you may think you are good. Are you sure ? A very simple test is to just randomize all the labels, and retrain the model. The hold out results should be no better than random, or something is very wrong.
Our practical example is a variant of this:
By regularizing the model, we expect we can decrease the model capacity, and avoid overtraining. So what can happen ?
Notice in both cases, the difference in training error and generalization error stayed nearly fixed or at least bounded
It appears that some deep nets can fit any data set, no matter what the labels ? Even with standard regularization. But computational learning theories, like VC and PAC theory, suggests that there should be some regularization scheme that decreases the capacity of the model, and prevent overtraining.
So how the heck do Deep Nets generalize so incredibly well ?! This is the puzzle.
To get a handle on this, in this post, I review the salient ideas of VC/PAC like theories.
the other day, a friend asked me, “how much data do I need for my AI app ?”
My answer: “it’s complicated !”
For any learning algorithm (ALG), like an SVM, Random Forest, or Deep Net, ideally, we would like to know the number of training examples necessary to learn a good model.
ML Theorists sometimes call this number the Sample Complexity. In the Statistical Mechanics, it is called the loading parameter (or just the load) .
Statistical learning theory (SLT), i.e. VC theory, provides a way to get the sample complexity. Moreover, it also tells us how bad a machine learning model might be, in the worst possible case. It states the difference between the test and training error should be characterized by simply the VC dimension .–which measures the effective capacity of the model, and the number of training examples .
.
Statistics is about how frequencies converge to probabilities. That is, if we sample N training examples from a distribution , we seek a gaurentee that the accuracy can be made arbitrarily small () with a very high probability
We call our model a class of functions . We associate the empirical generalization error with the true model error (or risk) , which would be obtained in the proper infinite limit. And the training error with the empirical risk . Of course, ; we want to know how bad the difference could possibly be, by bounding it
Also, in our paper, we use , but here we will follow the more familiar notation.
I follow the MIT OCW course on VC theory, as well as chapter 10 of Engle and Van der Broeck, Statistical Mechanics of Learning (2001).
In most problems, machine learning or otherwise, we want to predict an outcome. Usually a good estimate is the expected value. We use concentration bounds to give us some gaurentee on the accuracy. That is, how close is the expected value is to the actual outcome.
SLT , however, does not try to predict an expected or typical value. Instead, it uses the concentration bounds to explain how regularization provides a control over the capacity of a model, thereby reducing the generalization error. The amazing result of SLT is that capacity control emerges as a first class concept, arising even in the most abstract formulation of the bound–no data, no distributions, and not even a specific model.
In contrast, Stat Mech seeks to predict the average, typical values for a very specific model, and for the entire learning curve. That is, all practical values of the adjustable knobs of the model, the load , the Temperature, etc., in the Thermodynamic limit. Stat Mech does not provide concentration bounds or other guarantees, but it can provide both a better qualitative understanding of the learning process, and even good quantitative results for sufficiently large systems.
(In principle, the 2 methods are not totally incompatible, as one could examine the behavior of a specific bound in something like an effective Thermodynamic limit. See Engle …, chapter 10)
For now, let’s see what the bounding theorems say:
If we are estimating a single function , which does not depend on the data, then we just need to know the number of training examples m
The Law of Large Numbers (Central Limit Theorem) tells us then that the difference goes to zero in the limit .
This is a direct result of Hoeffding’s inequality, which is a well known bound on the confidence of the empirical frequencies converging to a probability.
However, if our function choices f depend on the data, as they do in machine learning, we can not use this bound. The problem: we can overtrain. Formally, we might find a function that optimizes training error but performs very poorly on test data , causing the bound to diverge. As in the pathological case above.
So we consider a Uniform bound over all functions , meaning we consider maximum (supremum) deviation.
We also now need to know the number of adjustable parameters N. For a finite size function class, and a general (unrealizable) , and we have
.
The is a 2-sided version of this theorem also, but the salient result is that the new term appears because we need all N bounds to hold simultaneously. We consider the maximum deviation, but we are bounded by the worst possible case. And not even typical worst case scenarios, but the absolute worst–causing the bound to be very loose. Still at least the deviations are bounded–in the finite case.
What happens when the function class is infinite? As for Kernel methods and Neural Networks. How do we treat the two limits ?
In VC theory, we fix m, and define distribution-independent measure, the VC growth function , which tells us how many ways our training data can be classified (or shattered) by functions in
,
where is the set of all ways the data can be classified by the functions
Using this, we can bound even infinite size classes , with a finite growth function.
WLOG, we only consider binary classification. We can treat more general models using the Rademacher complexity instead of the VC dimension–which is why the Google paper talked so much about it.
For any , and for any random draw of the data, we have, with probability at least
.
So the simpler the function class, the smaller the true error (Risk) should be.
In fact, for a finite , we have a tighter bound than above, because .
Note that we usually see the VC bounds in terms of the VC dimension…but VC theory provides us more than bounds; it tells us why we need regularization.
We define the VC Dimension , which is simply the number of training examples m, for which
The VC dim, and the growth function, measure the effective size of the class . By effective, we mean not just the number of functions, but the geometric projection of the class onto finite samples.
If we plot the growth function, we find it has 2 regimes:
: exponential growth
: polynomial growth
This leads to formal bounds on infinite size function classes (by Vapnik, Sauer, etc) based on the VC dim (as we mention in our paper)
Let be a class of functions with finite VC dim . Then for all
If has VC dim , and for all , with probability at least
Since bounds the risk of not generalizing, when we have more training examples than the VC dim (), the risk grows so much slower. So to generalize better, decrease , the effective capacity of .
Regularization–reducing the model complexity–should lead to better generalization.
So why does regularization not seem to work as well for Deep Learning ? (At least, that is what the Google paper suggests!)
Note: Perfectly solvable, or realizable, problems may have a tighter bound, but we ignore this special case for the present discussion.
Before we dive in Statistical Mechanics, I first mention an old paper by Vapnik, Levin and LeCun, Measuring the VC Dimension of a Learning Machine (1994) .
It is well known that the VC bounds are so loose to be of no practical use. However, it is possible to measure an effective VC dimension–for linear classifiers. Just measure the maximal difference in the error while increasing size of the data, and fit it to a reasonable function. In fact, this effective VC dim appears to be universal in many cases. But…[paraphrasing the last paragraph of the conclusion]…
“The extension of this work to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network…[because] it is likely…the search will be confined to a subset of [these] functions…The capacity of this set can be much lower than the capacity of the whole set…[and] may change with the number of observations. This may require a theory that considers the notion of a non-constant capacity with an ‘active’ subset of functions”
So even Vapnik himself suspected, way back in 1994, that his own theory did not directly apply to Neural Networks!
And this is confirmed in recent work looking at the empirical capacity of RNNs.
And the recent Google paper says things are even weirder. So what can we do ?
We argue that the whole idea of looking at worst-case-bounds is at odds with what we actually do in practice because we take a effectively consider a different limit than just fixing m and letting N grow (or vice versa).
Very rarely would we just add more data (m) to a Deep network. Instead, we usually increase the size of the net (N) as well, because we know that we can capture more detailed features / information from the data. So, win practice, we increase m and N simultaneously.
In Statistical Mechanics, we also consider the join limit …but with the ratio m/N fixed.
The 2 ideas are not completely incompatible, however. In fact, Engle… give nice example of applying the VC Bounds, in a Thermodynamic Limit, to a model problem.
In contrast to VC/PAC theories, which seek to bound the worst case of a model, Statistical Mechanics tries to describes the typical behavior of a model exactly. But typical does not just mean the most probable. We require that atypical cases be made arbitrarily small — in the Thermodynamic limit.
This works because the probabilities distributions of relevant thermodynamic quantities, such as the most likely average Energy, become sharply peaked around their maximal values.
Many results are well known in the Statistical Mechanics of Learning. The analysis is significantly more complicated but the results lead to a much richer structure that explains many phenomena in deep learning.
In particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).
For a typical neural network, can identify 3 phases of the system, controlled by the load parameter , the amount of training data m, relative to the number of adjustable network parameters N (and ignoring other knobs)
The view from SLT contrasts with some researchers who argue that all deep nets are simply memorizing their data, and that generalization arises simply because the training data is so large it covers nearly every possible case. This seems very naive to me, but maybe ?
Generally speaking, memorization is akin to prototype learning, where only a single examples is needed to describe each class of data. This arises in certain simple text classification problems, which can then be solved using Convex NMF (see my earlier blog post).
In SLT, over-training is a completely different phase of the system, characterized by a kind of pathological non-convexity–an infinite number of (degenerate) local minimum, separated by infinitely high barriers. This is the so-called (mean field) Spin-Glass phase.
SLT Overtraining / Spin Glass Phase has an infinite number of minima
So why would this correspond to random labellings of the data ? Imagine we have a binary classifier, and we randomize labels. This gives new possible labellings:
Different Randomized Labellings of the Original Training Data
We argue in our paper, that this decreases the effective load . If is very small, then will not change much, and we stay in the Generalization phase. But if is of order N (say 10%), then may decrease enough to push us into the Overtraining phase.
Each Randomized Labelling corresponds to a different Local Minima
Moreover, we now have new, possibly unsatisfiable, classifications problems. So the solutions will be nearly degenerate. But many of these will be difficult to learn because the many of label(s) are wrong, so the solutions could have high Energy barriers — i.e. they are difficult to find, and hard to get out of.
So we postulate by randomizing a large fraction of our labels, we push the system into the Overtraining / Spin Glass phase–and this is why traditional VC style regularization can not work–it can not bring us out of this phase. At least, that’s the theory.
In my next paper / blog post, I will describe how to examine these ideas in practice. We develop a method to when phase transitions arise in the training of real world neural networks. And I will show how we can observe what I postulated 3 years ago–the Spin Glass of Minimal Frustration, and how this changes the simple picture from the Spin Glass / Statistical Mechanics of Learning. Stay tuned !
It would be dishonest to present statistical mechanics as a general theory of learning that resolves all problems untreatable by VC theory. In particular, however, is the problem with theoretical analysis of online learning.
Of course, statistical mechanics only rigorously applies to the limit . Statistics requires infinite sampling, but does provides results for finite . But even in the infinite case, there are issues.
While online learning (ala SGD) is certainly amenable to a stat mech approach, there is a fundamental problem with saddle points that arise in the Thermodynamic limit. (See section 9.8, and also chapter 13, Engle and Van den Broeck)
It has been argued that an online algorithm may get trapped in saddle points which result from symmetries in the architecture of the network, or the underlying problem itself. In these cases, the saddle points may act as fixed points along an unstable manifold, causing a type of symmetry-induced paralysis. And while any finite size system may escape such saddles, in the limit , the dynamics can get trapped in an unstable manifold, and learning fails.
This is typically addressed in the context of symmetry-induced replica symmetry breaking in multilayer perceptrons, which would need be the topic of another post.
Computational Learning Theory (i.e VC theory) frames a learning algorithm or model as a infinite series of growing, distribution-independent hypothesis spaces , of VC dimension , such that
This lets us consider typical case behavior, when the limit converges (the so-called self-averaging case). And we can even treat typical worst case behaviors. And since we are not limited to the over loose bounds, the analysis matches reality much closer.
And we do this for a spin glass model of deep nets, we see much richer behavior in the learning curve. And we argue, this agrees more with what we see in practice
Let’s formalize this in the context of .
Let us call class of the hypothesis, or function classes, . Notationally, sometimes we see . Here, we distinguish between our abstract, theoretical hypothesis and the actual functions the algorithm sees
A hypothesis represents, say, the set of all possible weights for neural network. Or, more completely, the set of all weights, and learning rates, and dropout rates, and even initial weight distributions. It is a function . This may be confusing because, conceptually, we will want hypothesis with random labels–but we don’t actually do this. Instead, we measure the Rademacher complexity, which is a mathematical construct to let us measure all possible label randomizations for a given hypothesis .
We also fix the function class , so that learning process is, conceptually, a fixed point iteration over this fixed set:
and not some infinite set. This is critical because the No Free Lunch Theorem says that the worse-case sample complexity of an infinite size model is infinite.
For a given distribution , define the expected risk of a hypothesis as expected Loss over the actual training data
and the optimal risk as the maximum error (infimum) we encounter for all possible hypothesis–like in our practical case (2) above.
As above, we identify the training error with the data dependent expected risk, and the generalization error as the maximum risk we could ever expect, giving:
Recall we want the minimum N training examples necessary to learn. What is necessary ?
Suppose we pick a set of n labeled pairs. We define a hypothesis in by applying our ALG on this set: . And assume that we drew our set from some distribution, so
Then, for all , we seek a positive integer , such that for all
Notice that explicitly depends on the distribution, accuracy, and confidence.
As in the Central Limit Theorem, we consider the limit where the number of training examples .
But the No Free Lunch Theorem says that unless we restrict on the hypothesis/function space , there always exist “bad” distributions for which the sample complexity is arbitrarily large.
Of course, in practice, we know for that very large data sets, and very high capacity models, we need to regularize our models to avoid overtraining. At least we thought we knew this.
Moreover, by restricting the complexity of , we expect to produce more uniformly consistent results. And this is what we do above.
We don’t necessarily need all the machinery of microscopic Statistical Mechanics and Spin Glasses (ala Engle…) to develop SLT-like learning bounds. There is a classic paper, Retarded Learning: Rigorous Results from Statistical Mechanics, which shows how to get similar results using a variational approach.
]]>
Here are the associated slides
If you enjoyed this presentation, let me invite you to subscribe to my YouTube channel
]]>My graduate advisor used to say:
“If you can’t invent something new, invent your own notation”
Varitional Inference is foundational to Unsupervised and Semi-Supervised Deep Learning. In particular, Variational Auto Encoders (VAEs). There are many, many tutorials and implementations on Variational Inference, which I collect on my YouTube channel and below in the references. In particular, I look at modern ideas coming out of Google Deep Mind.
The thing is, Variational inference comes in 5 different comes in 5 or 6 different flavors, and it is a lot of work just to keep all the notation straight.
We can trace the basic idea back to Hinton and Zemel (1994)– to minimize a Helmholtz Free Energy.
What is missing is how Variational Inference is related the Variational Free Energy from statistical physics. Or even how an RBM Free Energy is related to a Variational Free Energy.
This holiday weekend, I hope to review these methods and to clear some of this up. This is a long post filled with math and physics–enjoy !
Years ago I lived in Boca Raton, Florida to be near my uncle, who was retired and on his last legs. I was working with the famous George White, one of the Dealers of Lightening, from the famous Xerox Parc. One day George stopped by to say hi, and he found me hanging out at the local Wendy’s and reading the book Generating Functionology. It’s a great book. And even more relevant today.
The Free Energy, is a Generating function. It generates thermodynamic relations. It generates expected Energies through weight gradients . It generates the Kullback-Liebler variational bound, and its corrections, as cumulants. And, simply put, in unsupervised learning, the Free Energy generates data.
Let’s see how it all ties together.
We first review inference in RBMs, which is one of the few Deep Learning examples that is fully expressed with classical Statistical Mechanics.
Suppose we have some (unlabeled) data . We know now that we need to learn a good hidden representation () with an RBM, or, say, latent () representation with a VAE.
Before we begin, let us try to keep the notation straight. To compare different methods, I need to mix the notation a bit, and may be a little sloppy sometimes. Here, I may interchange the RBM and VAE conventions
and, WLOG, may interchange the log functions
and drop the parameters on the distributions
Also, the stat mech Physics Free Energy convention is the negative log Z,
and I sometimes use the bra-ket notation for expectation values
.
Finally I might mix up the minus signs in the early draft of this blog; please let me know.
In an RBM, we learn an Energy function $, explicitly:
Inference means gradient learning. along the variational parameters , for the expected log likelihood
.
This is actually a form of Free Energy minimization. Let’s see why…
The joint probability is given by a Boltzmann distribution
.
To get , we have to integrate out the hidden variables
log likelihood = – clamped Free Energy + equilibrium Free Energy
(note the minus sign convention)
We recognize the second term as the total, or equilibrium, Free Energy from the partition function . This is just like in Statistical Mechanics (stat mech), but with . We call the first term the clamped Free Energy because it is like a Free Energy, but clamped to the data (the visible units). This gives
.
We see that the partition function Z is not just the normalization–it is a generating function. In statistical thermodynamics, derivatives of yield the expected energy
Since , we can associate an effective T with the norm of the weights
So if we take weight gradients of the Free Energies, we expect to get something like expected Energies. And this is exactly the result.
The gradients of the clamped Free Energy give an expectation value over the conditional
and the equilibrium Free Energy gradient yields an expectation of the joint distribution :
The derivatives do resemble expected Energies, with a unit weight matrix , which is also, effectively, . See the Appendix for the full derivations.
The clamped Free Energy is easy to evaluate numerically, but the equilibrium distribution is intractable. Hinton’s approach, Contrastive Divergence, takes a point estimate
,
where and are taken from one or more iterations of Gibbs Sampling– which is easily performed on the RBM model.
Unsupervised learning appears to be a problem in statistical mechanics — to evaluate the equilibrium partition function. There are lots of methods here to consider, including
Not to mention the very successful Deep Learning approach, which appears to be to simply guess, and then learn deterministic fixed point equations (i.e. SegNet) , via Convolutional AutoEncoders.
Unsupervised Deep Learning today looks like an advanced graduate curriculum in non-equilibirum statistical mechanics, all coded up in tensorflow.
We would need a year or more of coursework to go through this all, but I will try to impart some flavor as to what is going on here.
VAEs are a kind of generative deep learning model–they let us model and generate fake data. There are at least 10 different popular models right now, all easily implemented (see the links) in like tensorflow, keras, and Edward.
The vanilla VAE, ala Kingma and Welling, is foundational to unsupervised deep learning.
As in an RBM, in a VAE, we seek the joint probability . But we don’t want to evaluate the intractable partition function , or the equilibrium Free Energy , directly. That is, we can not evaluate , but perhaps there is some simpler, model distribution which we can sample from.
There are severals starting points although, in the end, we still end up minimizing a Free Energy. Let’s look at a few:
This an autoencoder, so we are minimizing a something like a reconstruction error. We need is a score between the empirical and model distributions, where the partition function cancels out.
.
This is a called score matching (2005). It has been shown to be closely related to auto-encoders.
I bring this up because if we look at supervised Deep Nets, and even unsupervised Nets like convolutional AutoEncoders, they are minimizing some kind of Energy or Free energy, implicitly at , deterministically. There is no partition function–it seems to have just canceled out.
We can also consider just minimizing the expected log likelihood, under the model
.
And with some re-arrangements, we can extract out a Helmholtz-like Free Energy. it is presented nicely in the Stanford class on Deep Learning, Lecture 13 on Generative Models.
We can also start by just minimizing the KL divergence between the posteriors
although we don’t actually minimize this KL divergence directly.
In fact, there is a great paper / video on Sequential VAEs which asks–are we trying to make q model p, or p model q ? The authors note that a good VAE, like a good RBM, should not just generate good data, but should also give a good latent representation . And the reason VAEs generate fuzzy data is because we over optimize recovering the exact spatial information and don’t try hard enough to get right.
The most important paper in the field today is by Kigma and Welling,where they lay out the basics of VAEs. The video presentation is excellent also.
We form a continuous, Variational Lower Bound, which is a negative Free Energy ()
And either minimizing the divergence of the posteriors, or maximizing the marginal likelihood, we end up minimizing a (negative) Variational Helmholtz Free Energy:
Free Energy = Expected Energy – Entropy
There are numerous derivations of the bound, including the Stanford class and the original lecture by Kingma. The take-away-is
Maximizing the Variational Lower Bound minimizes the Free Energy
This is, again, actually an old idea from statistical mechanics, traced back to Feynman’s book (available in Hardback on Amazon for $1300!)
We make it sound fancy by giving it a Russian name, the Gibbs-Bogoliubov relation (described nice here). It is finite Temp generalization of the Rayleigh-Ritz theorem for the more familiar Hamiltonians and Hermitian matrices.
The idea is to approximate the (Helmholtz) Free Energy with guess, model, or trial Free Energy , defined by expectations such that
is always greater than than true , and as our guess expectation gets better, our approximation improves.
This is also very physically intuitive and reflects our knowledge of the fluctuation theorems of non-equilibrium stat mech. It says that any small fluctuation away equilibrium will relax back to equilibrium. In fact, this is a classic way to prove the variational bound…
and it introduces the idea of conservation of volume in phase space (i.e. the Liouville equation), which, I believe, is related to an Normalizing Flows for VAEs.But that is a future post.
Stochastic gradient descent for VAEs is a deep subject; it is described in detail here
The gradient descent problem is to find the Free Energy gradient in the generative and variational parameters . The trick, however, is to specify the problem so we can bring variational gradient inside the expectation value
This is not trivial since the expected value depends on the variational parameters. For the simple Free Energy objective above, we can show that
Although we will make even further approximations to get working code.
We would like to apply BackProp to the variational lower bound; writing it in these 2 terms make this possible. We can evaluate the first term, the reconstruction error, using mini-batch SGD sampling, whereas the KL regularizer term is evaluated analytically.
We specify a tractable distribution , where we can numerically sample the posterior to get the latent variables using either a point estimate on 1 instance , or a mini-batch estimate .
As in statistical physics, we do what we can, and take a mean field approximation. We then apply the reparameterization trick to let us apply BackProp. I review this briefly in the Appendix.
This leads to several questions which I will adresss in this blog:
Like in Deep Learning, In almost all problems in statistical mechanics, we don’t know the actual Energy function, or Hamiltonian, . So we can’t form the instead of Partition Function , and we can’t solve for the true Free Energy . So, instead, we solve what we can.
For a VAE, instead of trying to find the joint distribution , as in an RBM, we want the associated Energy function, also called a Hamiltonian . The unknown VAE Energy is presumably more complicated than a simple RBM quadratic function, so instead of learning it flat out, we start by guessing some simpler Energy function . More importantly,We want to avoid computing the equilibrium partition function. The key is, is something we know — something tractable.
And, as in physics, q will also be a mean field approximation— but we don’t need that here.
We decompose the total Hamiltonian into a model Hamiltonian Energy plus perturbation
Energy = Model + Perturbation
The perturbation is the difference between the true and model Energy functions, and assumed to be small in some sense. That is, we expect our initial guess to be pretty good already. Whatever that means. We have
The constant $latex \lambda\le1&bg=ffffff$ is used to formally construct a power series (cumulant expansion); it is set to 1 at the end.
Write the equilibrium Free Energy in terms of the total Hamiltonian Energy function
There are numerous expressions for the Free Energy–see the Appendix. From above, we have
and we define equilibrium averages as
Recall we can not evaluate equilibrium averages, but we can presumably evaluate model averages . Given
,
where , and dropping the indices, to write
Insert inside the log, where giving
Using the property , and the definition of , we have expressed the Free Energy as an expectations in q.
This is formally exact–but hard to evaluate even with a tractable model.
We can approximate with using a cumulant expansion, giving us both the Kullback-Leibler Variational Free Energy, and corrections giving a Perturbation Theory for Variational Inference.
Cumulants can be defined most simply by a power series of the Cumulant generating function
although it can be defined and applied more generally, and is a very powerful modeling tool.
As I warned you, I will use the bra-ket notation for expectations here, and switch to natural log
We immediately see that
the stat mech Free Energy has the form of a Cumulant generating function.
Being a generating function, the cumulants are generated by taking derivatives (as in this video), and expressed using double bra-ket notation.
The first cumulant is just the mean expected value
whereas the second cumulant is the variance–the “mean of square minus square of mean”
(yup, cumulants are so common in physics that they have their own bra-ket notation)
This a classic perturbative approximation. It is a weak-coupling expansion for the equilibrium Free Energy, appropriate for small , and/or high Temperature. Since we always, naively, assume , it is seemingly applicable when the distribution is a good guess for
Since log expectation is a cumulant generating function; we can express the equilibrium Free Energy in a power series, or cumulants, in the perturbation V
Setting , the first order terms combine with log Z to form the model Helmholtz, or Kullback Leibler, Free Energy
The total equilibrium Free Energy is expressed as the model Free Energy plus perturbative corrections.
And now, for some
We now see the connection between the RBMs and VAEs, or, rather between the statistical physics formulation, with Energy and Partition functions, and the Bayesian probability formulation of VAEs.
Statistical mechanics has a very long history, over 100 years old, and there are many techniques now being lifted or rediscovered in Deep Learning, and then combined with new ideas. The post introduces the ideas being used today at Deep Mind, with some perspective from their origins, and some discussion about their utility and effectiveness brought from having seen and used these techniques in different contexts in theoretical chemistry and physics.
Of course, cumulants are not the only statistical physics tool. There are other Free Energy approximations, such as the TAP theory we used in the deterministic EMF-RBM.
Both the cumulant expansion and TAP theory are classic methods from non-equilibrium statistical physics. Neither is convex. Neither is exact. In fact, it is unclear if these expansions even converge, although they may be asymptotically convergent. The cumulants are very old, and applicable to general distributions. TAP theory is specific to spin glass theory, and can be applied to neural networks with some modifications.
The cumulants play a critical role in statistical physics and quantum chemistry because they provide a size-extensive approximation. That is, in the limit of a very large deep net (), the Energy function we learn scales linearly in N.
For example, mean field theories obey this scaling. Variational theories generally do not obey this scaling when they include correlations, but perturbative methods do.
The variational theorem is easily proven using jensen’s inequality, as in David Beli’s notes.
In the context of spin glass theory, for those who remember this old stuff, this means that we have expressions like
which, for a given spin glass model, occurs at the boundary (i.e. the Nishimori line) of the spin glass phase. I will discuss this more in an further post.
Well, it has been a long post, which seems appropriate for Labor Day.
But there is more, in the
I will try to finish this soon; the derivation is found in Ali Ghodsi, Lec [7], Deep Learning , Restricted Boltzmann Machines (RBMs)
I think it is easier to understand the Kigma and Welling paper AutoEncoding Variational Bayes by looking at the equations next to Keras Blog and code. We are minimizing the Variational Free Energy, but reformulate it using the mean field approximation and the reparameterization trick.
We choose a model Q that factorizes into Gaussians
We can also use other distributions, such as
Being mean field, the VAE model Energy function $mathcal{H}_{q}(\mathbf{x},\mathbf{z})&bg=ffffff$ is effectively an RBM-like quadratic Energy function , although we don’t specify it explicitly. On the other hand, the true $mathcal{H}(\mathbf{x},\mathbf{z})&bg=ffffff$ is presumably more complicated.
We use a factored distribution to reexpress the KL regularizer using
We can not backpropagate through a literal stochastic node z because we can not form the gradient. So we just replace the innermost hidden layer with a continuous latent space, and form z by sampling from this.
We reparameterize z with explicit random values , sampled from a Normal distribution N(0,I)
In Keras, we define z with a (Lambda) sampling function, eval’d on each batch step
and use this z in the last decoder hidden layer
Of course, this slows down execution since we have to call K.random_normal on every SGD batch.
We estimate mean and variance for the in the mini-batch , and sample from these vectors. The KL regularizer can then be expressed analytically as
This is inserted directly into the VAE Loss function. For each a minibatch (of size L), L is
where the KL Divergence (kl_loss) is approximated in terms of the mini-batch estimates for the mean and variance .
in Keras, the loss looks like:
We can now apply BackProp using SGD, RMSProp, etc. to minimize the VAE Loss, with on every mini-batch step.
In machine learning, we use expected value notation, such as
but in physics and chemistry there at 5 or 6 other notations. I jotted them down here for my own sanity.
For RBMs and other discrete objects, we have
Of course, we may want the limit , but we have to be careful how we take this limit. Still, we may write
In the continuous case, we specify a density of states $latex \rho(E)&bg=ffffff $
which is not the same as specifying a distribution over the internal variables, giving
In quantum statistical mechanics, we replace the Energy with the Hamiltonian operator, and replace the expectation value with the Trace operation
and this is also expressed using a bra-ket notation
and usually use subscripts to represent non-equilibrium states
Raise the Posteriors
]]>This paper, along with his wake-sleep algorithm, set the foundations for modern variational learning. They appear in his RBMs, and more recently, in Variational AutoEncoders (VAEs) .
Of course, Free Energies come from Chemical Physics. And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.
They are so important that Karl Friston has proposed the The Free Energy Principle : A Unified Brain Theory ?
(see also the wikipedia and this 2013 review)
What are free Energies and why do we use them in Deep Learning ?
In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has
This is the T=0 configurational Energy, where each configuration is some pair. In chemical physics, these Energies resemble an Ising model.
The Free Energy is a weighted average of the all the global and local minima
Note: as , the the Free Energy becomes the T=0 global energy minima . In limit of zero Temperature, all the terms in the sum approach zero
and only the largest term, the largest negative Energy, survives.
We may also see F written in terms of the partition function Z:
where the brakets denote an equilibrium average, and expected value over some equilibrium probability distribution . (we don’t normalize with 1/N here; in principle, the sum could be infinite.)
Of course, in deep learning, we may be trying to determine the distribution , and/or we may approximate it with some simpler distribution during inference. (From now on, I just write P and Q for convenience)
But there is more to Free Energy learning than just approximating a distribution.
In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well. It is the Energy available to do work.
For convenience, Hinton explicitly set T=1. Of course, he was doing inference, and did not know the scale of the weights W. Since we don’t specify the Energy scale, we learn the scale implicitly when we learn W. We call this being scale-free
So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where , as we learn the weights W. Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.
Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:
They will fail, however, when the barriers between Energy basins are larger than the Temperature.
This can happen if the effective Temperature drops close to zero during inference. Since T=1 implicitly in inference, this means when the weights W are exploding.
See: Normalization in Deep Learning
Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena. I will discuss this in a future post.
In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros). But this is probably not the optimal solution to learning when using a Free Energy method.
It is sometimes argued that Deep Learning is a non-convex optimization problem. And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima? How can this be ?
At least for unsupervised methods, it has been clear since 1987 that:
An important property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …
Hence, the probability of getting stuck in a local minima decreases
Although this is not specifically how Hinton argued for the Helmholtz Free Energy — a decade later.
Why do we use Free energy methods ? Hinton used the bits-back argument:
Imagine we are encoding some training data and sending it to someone for decoding. That is, we are building an Auto-Encoder.
If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.
But what if have 2 or more equally valid codes ?
Can we save 1 bit by being a little vague ?
Suppose we have N possible encodings , each with Energy . We say the data has stochastic complexity.
Pick a coding with probability and send it to the receiver. The expected cost of encoding is
Now the receiver must guess which encoding we used. The decoding cost of the receiver is
where H is the Shannon Entropy of the random encoding
The decoding cost looks just like a Helmholtz Free Energy.
Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.
To understand this better, we need to relate
In 1957, Jaynes formulated the MaxEnt principle which considers equilibrium thermodynamics and statistical mechanics as inference processes.
In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.
In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy. We need
and F is defined as
In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)
Variables like E and S depend on the system size N. That is,
as
We say S and T are conjugate pairs; S is extensive, T is intensive.
(see more on this in the Appendix)
The conjugate pairs are used to define Free Energies via the Legendre Transform:
Helmholtz Free Energy: F(T) = E(S) – TS
We switch the Energy from depending on S to T, where .
Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S. However, we are free to change and measure the Temperature–the derivative of E w/r.t. S:
This is a powerful and general mathematical concept.
Say we have a convex function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x
.
Then we can form the Legendre Transform , which gives g(w,y,z) as
the ‘Tangent Envelope‘ of f() along x
,
.
or, simply
Note: we have converted a convex function into a concave one. The Legendre transform is concave in the intensive variables and convex in the extensive variables.
Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix). But that is because while it is concave in T, we evaluate it at constant T.
But what if the Energy function is not convex in the Entropy ? Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding). These postulates also fail in systems like blog post on spin chains.
Answer: Take the convex hull
When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via
the Tangent Envelope , a convex relaxation
The Legendre-Fenchel Transform can provide a Free Energy, convexified along the direction internal (configurational) Entropy, allowing the Temperature to control how many local Energy minima are sampled.
Extra stuff I just wanted to write down…
If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.
as ,
if ,
and ,
then W should remain bounded to prevent the Energy E(n) from growing faster than Mn. And, of course, most Deep Learning algorithms do bound W in some form.
where C denotes a contour integral.
]]>
Check out my recent chat with Max Mautner, the Accidental Engineer
http://theaccidentalengineer.com/charles-martin-principal-consultant-calculation-consulting/
]]>It builds upon a Batch Normalization (BN), introduced in 2015– and is now the defacto standard for all CNNs and RNNs. But not so useful for FNNs.
What makes normalization so special? It makes very Deep Networks easier to train, by damping out oscillations in the distribution of activations.
To see this, the diagram below uses data from Figure 1 (from the BN paper) to depict how the distribution evolves for a typical node outputs in the last hidden layer of a typical network:
Very Deep nets can be trained faster and generalize better when the distribution of activations is kept normalized during BackProp.
We regularly see Ultra-Deep ConvNets like Inception, Highway Networks, and ResNet. And giant RNNs for speech recognition, machine translation, etc. But we don’t see powerful Feedforward Neural Nets (FNNS) with more than 4 layers. Until now.
Batch Normalization is great for CNNs and RNNs.
But we still can not build deep MLPs
This new method — Self-Normalization — has been proposed for building very deep MultiLayer Perceptions (MLPs) and other Feed Forward Nets (FNNs).
The idea is just to tweak a the Exponential Linear Unit (ELU) activation function to obtain a Scaled ELU (SELU):
With this new SELU activation function, and a new, alpha Dropout method, it appears we can, now, build very deep MLPs. And this opens the door for Deep Learning applications on very general data sets. That would be great!
The paper is, however, ~100 pages long of pure math! Fun stuff.. but a summary is in order.
I review Normalization in Neural Networks, including Batch Normalization, Self-Normalization, and, of course, some statistical mechanics (it’s kinda my thing).
This is an early draft of the post: comments and questions are welcome
WLOG, consider an MLP, where we call the input to each layer u
The linear transformations at each layer is
,
and we apply standard point-wise activations, like a sigmoid
so that the total set of activations (at each layer) takes the form
The problem is that during SGD training, the distribution of weights W and/or the outputs x can vary widely from iteration to iteration. These large variations lead to instabilities in training that require small learning rates. In particular, if the layer weights W or inputs u blow up, the activations can become saturated:
,
leading to vanishing gradients. Traditionally, this was in MLPs avoided by using larger learning rates, and/or early stopping.
One solution is better activation functions, such as a Rectified Linear Unit (ReLu)
or, for larger networks (depth > 5), an Exponential Linear Unit (ELU):
Which look like:
Indeed, sigmoid and tanh activations came from early work in computational neuroscience. Jack Cowan proposed first the sigmoid function as a model for neuronal activity, and sigmoid and tanh functions arise naturally in statistical mechanics. And sigmoids are still widely used for RBMs and MLPs–ReLUs don’t help much here.
SGD training introduces perturbations in training that propagate through the net, causing large variations in weights and activations. For FNNs, this is a huge problem. But for CNNs and RNNs..not so much. why ?
It has been said the no real theoretical progress has been made in deep nets in 30 years. That is absurd. We did not have ReLus or ELUs. In fact, up until Batch Normalization, we were still using SVM-style regularization techniques for Deep Nets. It is clear now that we need to rethink generalization in deep learning.
We can regularize a network, like a Restricted Boltzmann Machine (RBM), by applying max norm constraints to the weights W.
This can be implemented in training by tweaking the weight update at the end of pass over all the training data
where is an L1 or L2 norm.
I have conjectured that this is actually kind of Temperature control, and prevents the effective Temperature of the network from collapsing to zero.
,
By avoiding a low Temp, and possibly any glassy regimes, we can use a larger effective annealing rate–in modern parlance, larger SGD step sizes.
It makes the network more resilient to changes in scale.
After 30 years of research neural nets, we can now achieve an analogous network normalization automagically.
But first, what is current state-of-the-art in code ? What can we do today with Keras ?
Batch Normalization (BN) Transformation
Tensorflow and other Deep Learning frameworks now include Batch Normalization out-of-the-box. Under-the-hood, this is the basic idea:
At the end of every mini-batch , the layers are whitened. For each node output x (and before activation):
the BN Transform maintains the (internal) zero mean and unit variance ().
We evaluate the sample mini-batch mean and variance , and then normalize, scale, and shift the values:
The final transformation is applied inside the activation function g():
although we can absorb the original layer bias term b into the BN transform, giving
So now, instead of renormalizing the weights W after passing over all the data, we can normalize the node output x=Wu explicitly, for each mini-batch, in the BackProp pass.
Note that
so bounding the weights with max-norm constraints got us part of the way already.
Note that extra scaling and shift parameters appear for each batch (k), and it is necessary to optimize these parameters as a side step during training.
At the end of the transform, we can normalize the network outputs (shown above) of the entire training set (population)
,
where the final statistics are computed as, say, an unbiased estimate over all (m) mini-batches of the training data
.
The key to Batch Normalization (BN) is that:
BN allows us to manipulate the activation function of the network. It is a differentiable transformation that normalizes activations in the network.
It makes the network (even) more resilient to the parameter scale.
It has been known for some time that Deep Nets perform better if the inputs are whitened. And max-norm constraints do re-normalize the layer weights after every mini-batch.
Batch normalization appears to be more stable internally, with the advantages that it:
Still, Batch Norm training slows down BackProp. Can we speed it up ?
A few days ago, the Interwebs was buzzing about the paper Self-Normalizing Neural Networks. HackerNews. Reddit. And my LinkedIn Feed.
These nets use Scaled Exponential Linear Units (SELU), which have implicit self-normalizing properties. Amazingly, the SELU is just a ELU multiplied by
where .
The paper authors have optimized the values as:
**a comment on reddit suggests tanh may work as well
The SELUs have the explicit properties of:
Amazingly, the implicit self-normalizing properties are actually proved–in only about 100 pages–using the Banach Fixed Point Theorem.
They show that, for an FNN using selu(x) actions, there exists a unique attracting and stable fixed point for the mean and variance. (Curiously, this resembles the argument that Deep Learning (RBMs at least) the Variational Renormalization Group (VRG) Transform.
There are, of course, conditions on the weights–things can’t get too crazy. This is hopefully satisfied by selecting initial weights with zero mean and unit variance.
,
(depending how we define terms).
To apply SELUs, we need a special initialization procedure, and a modified version of Dropout, alpha-Dropout,
We select initial weights from a Gaussian distribution with mean 0 and variance , where N is number of weights:
In Statistical Mechanics, this the Temperature is proportional to the variance of the Energy, and therefore sets the Energy scale. Since E ~ W,
SELU Weight initialization is similar in spirit to fixing T=1.
Note that to apply Dropout with an SELU, we desire that the mean and variance are invariant.
We must set random inputs to saturated negative value of SELU, Then, apply an affine transformation, computing relative to dropout rate.
(thanks to ergol.com for the images and discussion).
All of this is provided, in code, with implementations already on github for Tensorflow, PyTorch, Caffe, etc. Soon…Keras?
The key results are presented in Figure 1 of the paper, where SNN = Self Normalizing Networks, and the data sets studies are MNIST and CIFAR.
The original code is available on github
Great discussions on HackerNews and Reddit
We have reviewed several variants of normalization in deep nets, including
Along the way, I have tried to convince you that recent developments in the normalization of Deep Nets represent a culmination over 30 years of research into Neural Network theory, and that early ideas about finite Temperature methods from Statistical Mechanics have evolved into and are deeply related to the Normalization methods employed today to create very Deep Neural Networks
Very early research in Neural Networks lifted idea from statistical mechanics. Early work by Hinton formulated AutoEncoders and the principle of the Minimum Description Length (MDL) as minimizing a Helmholtz Free Energy:
,
where the expected (T=0) Energy is
,
S is the Entropy,
and the Temperature implicitly.
Minimizing F yields the familiar Boltzmann probability distribution
$latex p_{i}=\dfrac{e^{-\beta E_{i}}}{\sum\limits_{j}e^{-\beta E_{j}}}&bg=ffffff $.
When we define an RBM, we parameterize the Energy levels in terms of the configuration of visible and hidden units
,
giving the probability
where is the Partition Function, and, again, T=1 implicitly.
In Stat Mech, we call RBMs a Mean Field model because we can decompose the total Energy and/or conditional probabilities using sigmoid activations for each node
In my 2016 MMDS talk, I proposed that without some explicit Temperature control, RBMs could collapse into a glassy state.
And now, some proof I am not completely crazy:
Another recent 2017 study on the Emergence of Compositional Representations in Restricted Boltzmann Machines, we do indeed see that the RBM effective Temperature does indeed drop well below 1 during training
and that RBMs can exhibit glassy behavior.
I also proposed that RBMs could undergo Entropy collapse at very low Temperatures. This has also now been verified in a recent 2016 paper.
Finally, this 2017 paper:Train longer, generalize better: closing the generalization gap in large batch training of neural networks” proposes that many networks exhibit something like glassy behavior described as “ultra-slow” diffusion behavior.
I will sketch out the proof in some detail if there is demand. Intuitively (& citing comments in HackerNews): ”
Indeed, we train a neural network by running BackProp, thereby minimizing the model error–which is like minimizing an Energy.
But what is this Energy ? Deep Learning (DL) Energy functions look nothing like a typical chemistry or physics Energy. Here, we have Free Energy landscapes, frequently which form funneled landscapes–a trade off between energetic and entropic effects.
And yet, some researchers, like LeCun, have even compared Neural Network Energies functions to spin glass Hamiltonians. To me, this seems off.
The confusion arises from assuming Deep Learning is a non-convex optimization problem that looks similar to the zero-Temperature Energy Landscapes from spin glass theory.
I present a different view. I believe Deep Learning is really optimizing an effective Free Energy function. And this has profound implications on Why Deep Learning Works.
This post will attempt to relate recent ideas in RBM inference to Backprop, and argue that Backprop is minimizing a dynamic, temperature dependent, ruggedly convex, effective Free Energy landscape.
This is a fairly long post, but at least is basic review. I try to present these ideas in a semi-pedagogic way, to the extent I can in a blog post, discussing both RBMs, MLPs, Free Energies, and all that entails.
The Backprop algorithm lets us train a model directly on our data (X) by minimizing the predicted error , where the parameter set includes the weights , biases , and activations of the network.
.
Let’s write
,
where the error could be a mean squared error (MSE), cross entropy, etc. For example, in simple regression, we can minimize the MSE
,
whereas for multi-class classification, we might minimize a categorical cross entropy
where are the labels and is the network output for each training instance .
Notice that is the training error for instance , not a test or holdout error. Notice that, unlike an Support Vector Machine (SVM) or Logistic Regression (LR), we don’t use Cross Validation (CV) during training. We simply minimize the training error– whatever that is.
Of course, we can adjust the network parameters, regularization, etc, to tune the architecture of the network. Although it appears that Understanding deep learning requires rethinking generalization.
At this point, many people say that BackProp leads to a complex, non-convex optimization problem; IMHO, this is naive.
It has been known for 20 years that Deep Learning does not suffer from local minima.
Anyone who thinks it does has never read a research paper or book on neural networks. So what we really would like to know is, Why does Deep Learning Scale ? Or, maybe, why does it work at all ?!
To implement Backprop, we take derivatives and apply the the chain rule to the network outputs , applying it layer-by-layer.
Let’s take a closer look at the layers and activations. Consider a simple 1 layer net:
The Hidden activations are thought to mimic the function of actual neurons, and are computed by applying an activation function , to a linear Energy function ,
Indeed, the sigmoid activation function was first proposed in 1968 by Jack Cowan at the University of Chicago , still used today in models of neural dynamics
Moreover, Cowan pioneered using Statistical Mechanics to study the Neocortex.
And we will need a little Stat Mech to explain what our Energy functions are..but just a little.
While it seems we are simply proposing an arbitrary activation function, we can, in fact, derive the appearance of sigmoid activations–at least when performing inference on a single layer (mean field) Restricted Boltzmann Machine (RBM).
Hugo Larochelle has derived the sigmoid activations nicely for an RBM.
Given the (total) RBM Energy function
The log Energy is an un-normalized probability, such that
Where the normalization factor, Z, is an object from statistical mechanics called the (total) partition function Z
and is an inverse Temperature. In modern machine learning, we implicitly set .
Following Larochelle, we can factor by explicitly writing in terms of sums over the binary hidden activations . This lets us write the conditional probabilities, for each individual neuron as
.
We note that, this formulation was not obvious, and early work on RBMs used methods from statistical field theory to get this result.
We use and in Contrastive Divergence (CD) or other solvers as part of the Gibbs Sampling step for (unsupervised) RBM inference.
CD has been a puzzling algorithm to understand. When first proposed, it was unclear what optimization problem is CD solving? Indeed, Hinton is to have said
Specifically, we run several epochs of:
We will see below that we can cast RBM inference as directly minimizing a Free Energy–something that will prove very useful to related RBMs to MLPs
The sigmoid, and tanh, are an old-fashioned activation(s); today we may prefer to use ReLUs (and Leaky ReLUs).
The sigmoid itself was, at first, just an approximation to the heavyside step function used in neuron models. But the presence of sigmoid activations in the total Energy suggests, at least to me, that Deep Learning Energy functions are more than just random (Morse) functions.
RBMs are a special case of unsupervised nets that still use stochastic sampling. In supervised nets, like MLPs and CNNs (and in unsupervised Autoencoders like VAEs), we use Backprop. But the activations are not conditional probabilities. Let’s look in detail:
Consider a MultiLayer Perceptron, with 1 Hidden layer, and 1 output node
where for each data point, leading to the layer output
and total MLP output
where .
If we add a second layer, we have the iterated layer output:
where .
The final MLP output function has a similar form:
So with a little bit of stat mech, we can derive the sigmoid activation function from a general energy function. And we have activations it in RBMs as well as MLPs.
So when we apply Backprop, what problem are we actually solving ?
Are we simply finding a minima on random high dimensional manifold ? Or can we say something more, given the special structure of these layers of activated energies ?
To train an MLP, we run several epochs of Backprop. Backprop has 2 passes: forward and backward:
Each epoch usually runs small batches of inputs at time. (And we may need to normalize the inputs and control the variances. These details may be important for out analysis, and we will consider them in a later post).
After each pass, we update the weights, using something like an SGD step (or Adam, RMSProp, etc)
For an MSE loss, we evaluate the partial derivatives over the Energy parameters .
Backprop works by the chain rule, and given the special form of the activations, lets us transform the Energy derivatives into a sum of Energy gradients–layer by layer
I won’t go into the details here; there are 1000 blogs on BackProp today (which is amazing!). I will say…
Backprop couples the activation states of the neurons to the Energy parameter gradients through the cycle of forward-backward phases.
In a crude sense, Backprop resembles our more familiar RBM training procedure, where we equilibrate to set the activations, and run gradient descent to set the weights. Here, I show a direct connection, and derive the MLP functional form directly from an RBM.
RBMs are unsupervised; MLPs are supervised. How can we connect them? Crudely, we can think of an MLP as a single layer RBM with a softmax tacked on the end. More rigorously, we can look at Generalized Discriminative RBMs, which solve the conditional probability directly, in terms of the Free Energies, cast in the soft-max form
So the question is, can we extract Free Energy for an MLP ?
I now consider the Backward phase, using the deterministic EMF RBM, as a starting point for understanding MLPs.
An earlier post discusses the EMF RBM, from the context of chemical physics. For a traditional machine learning perspective, see this thesis.
In some sense, this is kind-of obvious. And yet, I have not seen a clear presentation of the ideas in this way. I do rely upon new research, like the EMF RBM, although I also draw upon fundamental ideas from complex systems theory–something popular in my PhD studies, but which is perhaps ancient history now.
The goal is to relate RBMs, MLPs, and basic Stat Mech under single conceptual umbrella.
In the EMF approach, we see RBM inference as a sequence of deterministic annealing steps, from 1 quasi-equilibrium state to another, consisting of 2 steps for each epoch:
At the end of each epoch, we update the weights, with weight (temperature) constraints (i.e. reset the L1 or L2 norm). BTW, it may not obvious that weight regularization is like a Temperature control; I will address this in a later post.
(1) The so-called Forward step solves a fixed point equation (which is similar in spirit to taking n steps of Gibbs sampling). This leads to a pair of coupled, recursion relations for the TAP magnetizations (or just nodes). Suppose we take t+1 iterations. Let us ignore the second order Onsager correction, and consider the mean field updates:
Because these are deterministic steps, we can express the in terms of :
At the end of the recursion, we will have a forward pass that resembles a multi-layer MLP, but that shares weights and biases between layers:
We can now associate an n-layer MLP, with tied weights,
,
to an approximate (mean field) EMF RBM, with n fixed point iterations (ignoring the Onsager correction for now). Of course, an MLP is supervised, and an RBM is unsupervised, so we need to associate the RBM hidden nodes with the MLP output function at the last layer (), prior to adding the MLP output node
This leads naturally to the following conjecture:
The EMF RBM and the BackProp Forward and Backward steps effectively do the same thing–minimize the Free Energy
This is a work in progress
Formally, it is simple and compelling. Is it the whole story…probably not. It is merely an observation–food for thought.
So far, I have only removed the visible magnetizations to obtain the MLP layer function as a function of the original visible units. The unsupervised EMF RBM Free Energy, however, contains expressions in terms of both the hidden and visible magnetizations ( ). To get a final expression, it is necessary to either
The result itself should not be so surprising, since it has already been pointed out by Kingma and Welling, Auto-Encoding Variational Bayes, that a Bernoulli MLP is like a variational decoder. And, of course, VAEs can be formulated with BackProp.
Nore importantly, It is unclear how good the RBM EMF really is. Some followup studies indicate that second order is not as good as, say, AIS, for estimating the partition function. I have coded a python emf_rbm.py module using the scikit-learn interface, and testing is underway. I will blog this soon.
Note that the EMF RBM relies on the Legendre Transform, which is like a convex relaxation. Early results indicates that this does degrade the RBM solution compared to traditional Cd. Maybe BackProp may be effective relaxing the convexity constraint by, say, relaxing the condition that the weights are tied between layers.
Still, I hope this can provide some insight. And there are …
Free Energy is a first class concept in Statistical Mechanics. In machine learning, not always so much. It appears in much of Hinton’s work, and, as a starting point to deriving methods like Variational Auto Encoders and Probabilistic Programing.
But Free Energy minimization plays an important role in non-convex optimization as well. Free energies are a Boltzmann average of the zero-Temperature Energy landscape, and, therefore, convert a non-convex surface into something at least less non-convex.
Indeed, in one of the very first papers on mean field Boltzmann Machines (1987), it is noted that
“An important property of the effective [free] energy function E'(V,0,T) is that it has a smoother landscape than E(S) due to the extra terms. Hence, the probability of getting stuck in a local minima decreases.”
Moreover, in protein folding, we have even stronger effects, which can lead to a ruggedly convex, energy landscape. This arises when the system runs out of configurational entropy (S), and energetic effects (E) dominate.
Most importantly, we want to understand, when does Deep Learning generalize well, and when does it overtrain ?
LeCun has very recently pointed out that Deep Nets fail when they run out of configuration entropy–an argument I also have made from theoretical analysis using the Random Energy Model. So it is becoming more important to understand what the actual energy landscape of a deep net is, how to separate out the entropic and energetic terms, and how to characterize the configurational entropy.
Hopefully the small insight will be useful and lead to a further understanding of Why Deep Learning Works.
]]>A Mean Field Theory Learning Algorithm for Neural Networks
just a couple years after Hinton’s seminal 1985 paper , “A Learning Algorithm for Boltzmann Machines“.
What I really like is how we see the foundations of deep learning arose from statistical physics and theoretical chemistry. My top 10 favorite take-a-ways are:
Happy New Year everyone!
]]>
They are basically a solved problem, and while of academic interest, not really used in complex modeling problems. They were, upto 10 years ago, used for pretraining deep supervised nets. Today, we can train very deep, supervised nets directly.
RBMs are the foundation of unsupervised deep learning–
an unsolved problem.
RBMs appear to outperform Variational Auto Encoders (VAEs) on simple data sets like the Omniglot set–a data set developed for one shot learning, and used in deep learning research.
RBM research continues in areas like semi-supervised learning with deep hybrid architectures, Temperature dependence, infinitely deep RBMs, etc.
Many of basic concepts of Deep Learning are found in RBMs.
Sometimes clients ask, “how is Physical Chemistry related to Deep Learning ?”
In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry,
the Extended Mean Field Restricted Boltzmann Machine: EMF_RBM
(see: Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy )
[Along the way, we will encounter several Nobel Laureates, including the physicists David J Thouless (2016) and Philip W. Anderson (1977), and the physical chemist Lars Onsager (1968).]
RBMs are pretty simple, and easily implemented from scratch. The original EMF_RBM is in Julia; I have ported EMF_RBM to python, in the style of the scikit-learn BernoulliRBM package.
https://github.com/charlesmartin14/emf-rbm/blob/master/EMF_RBM_Test.ipynb
We examined RBMs in the last post on Cheap Learning: Partition Functions and RBMs. I will build upon that here, within the context of statistical mechanics.
RBMs are defined by the Energy function
To train an RBM, we minimize the log likelihood ,
the sum of the clamped and (actual) Free Energies, where
and
The sums range over a space of ,
which is intractable in most cases.
Training an RBM requires computing the log Free Energy; this is hard.
When training an RBM, we
We don’t include label information, although a trained RBM can provide features for a down-stream classifier.
The Extended Mean Field (EMF) RBM is a straightforward application of known statistical mechanics theories.
There are, literally, thousands of papers on spin glasses.
The EMF RBM is a great example of how to operationalize spin glass theory.
The Restricted Boltzmann Machine has a very simple Energy function, which makes it very easy to factorize the partition function Z , explained by Hugo Larochelle, to obtain the conditional probabilities
The conditional probabilities let us apply Gibbs Sampling, which is simply
In statistical mechanics, this is called a mean field theory. This means that the Free Energy (in ) can be written as a simple linear average over the hidden units
.
where is the mean field of the hidden units.
At high Temp., for a spin glass, a mean field model seems very sensible because the spins (i.e. activations) become uncorrelated.
Theoreticians use mean field models like the p-spin spherical spin glass to study deep learning because of their simplicity. Computationally, we frequently need more.
How we can go beyond mean field theory ?
Onsager was awarded 1968 the Nobel Prize in Chemistry for the development of the Onsager Reciprocal Relations, sometimes called the ‘4th law of Thermodynamics’
The Onsager relations provides the theory to treat thermodynamic systems that are in a quasi-stationary, local equilibrium.
Onsager was the first to show how to relate the correlations in the fluctuations to the linear response. And by tying a sequence of quasi-stationary systems together, we can describe an irreversible process…
..like learning. And this is exactly what we need to train an RBM.
In an RBM, the fluctuations are variations in the hidden and visible nodes.
In a BernoulliRBM, the activations can be 0 or 1, so the fluctuation vectors are
The simplest correction to the mean field Free Energy, at each step in training, are the correlations in these fluctuations:
where W is the Energy weight matrix.
Unlike normal RBMs, here is we work in an Interaction Ensemble, so the hidden and visible units become hidden and visible magnetizations:
To simplify (or confuse?) the presentations here, I don’t write magnetizations (until the Appendix).
The corrections make sense under the stationarity constraints, that the Extended Mean Field RBM Free Energy () is at a critical point
,
That is, small changes in the activations do not change the Free Energy.
We will show that we can write
as a Taylor series in , the inverse Temperature, where is the Entropy
,
is the standard, mean field RBM Free energy
,
and is the Onsager correction
.
Given the expressions for the Free Energy, we must now evaluate it.
The Taylor series above is a result of the TAP theory — the Thouless-Anderson-Palmer approach developed for spin glasses.
The TAP theory is outlined in the Appendix; here it is noted that
Thouless just shared the 2016 Nobel Prize in Physics (for his work in topological phase transitions)
Being a series in inverse Temperature , the theory applies at low , or high Temperature. For fixed , this also corresponds to small weights W.
Specifically, the expansion applies at Temperatures above the glass transition–a concept which I describe in a recent video blog.
Here, to implement the EMF_RBM, we set
,
and, instead, apply weight decay to keep the weights W from exploding
where may be an L1 or L2 norm.
Weight Decay acts to keep the Temperature high.
Early RBM computational models were formulated using statistical mechanics (see the Appendix) language, and so included a Temperature parameter, and were solved using techniques like simulated annealing and the (mean field) TAP equations (described below).
Adding Temperature allowed the system to ‘jump’ out of the spurious local minima. So any usable model required a non-zero Temp, and/or some scheme to avoid local minima that generalized poorly. (See: Learning Deep Architectures for AI, by Bengio)
These older approaches did not work well –then — so Hinton proposed the Contrastive Divergence (CD) algorithm. Note that researchers struggled for some years to ‘explain’ what optimization problem CD actually solves.
More that recent work on Temperature Bases RBMs also suggests that higher T solutions perform better, and that
“temperature is an essential parameter controlling the selectivity of the firing neurons in the hidden layer.”
Standard RBM training approximates the (unconstrained) Free Energy, F=ln Z, in the mean field approximation, using (one or more steps of) Gibbs Sampling. This is usually implemented as Contrastive Divergence (CD), or Persistent Contrastive Divergence (PCD).
Using techniques of statistical mechanics, however, it is possible to train an RBM directly, without sampling, by solving a set of deterministic fixed point equations.
Indeed, this approach clarifies how to view an RBM as solving a (determinisitic) fixed point equation of the form
Consider each step, at at fixed (), as a Quasi-Stationary system, which is close to equilibrium, but we don’t need to evaluate ln Z(v,h) exactly.
We can use the stationary conditions to derive a pair of coupled, non-linear equations
They extend the standard formula of sigmoid linear activations with additional, non-linear, inter-layer interactions.
They differs significantly from (simple) Deep Learning activation functions because the activation for each layer explicitly includes information from other layers.
This extension couples the mean () and total fluctuations () between layers. Higher order correlations could also be included, even to infinite order, using techniques from field theory.
We can not satisfy both equations simultaneously, but we can satisfy each condition individually, letting us write a set of recursion relations
These fixed point equations converge to the stationary solution, leading to a local equilibrium. Like Gibbs Sampling, however, we only need a few iterations (say t=3 to 5). Unlike Sampling, however, the EMF RBM is deterministic.
https://github.com/charlesmartin14/emf-rbm/
If there is enough interest, I can do a pull request on sklearn to include it.
The next blog post will demonstrate how the python code in action.
Most older physicists will remember the Hopfield model. They peaked in 1986, although interesting work continued into the late 90s (when I was a post doc).
Originally, Boltzmann machines were introduced as a way to avoid spurious local minima while including ‘hidden’ features into Hopfield Associative Memories (HAM).
Hopfield himself was a theoretical chemist, and his simple model HAMs were of great interest to theoretical chemists and physicists.
Hinton explains Hopfield nets in his on-line lectures on Deep Learning.
The Hopfield Model is a kind of spin glass, which acts like a ‘memory’ that can recognize ‘stored patterns’. It was originally developed as a quasi-stationary solution of more complex, dynamical models of neuronal firing patterns (see the Cowan-Wilson model).
Early theoretical work on HAMs studied analytic approximations to ln Z to compute their capacity (), and their phase diagram. The capacity is simply the number of patterns a network of size N can memorize without getting confused.
The Hopfield model was traditionally run at T=0.
Looking at the T=0 line, at extremely low capacity, the system has stable mixed states that correspond to ‘frozen’ memories. But this is very low capacity, and generally unusable. Also, when the capacity too large, , (which is really not that large), the system abruptly breaks down completely.
There is a small window of capacity, ,with stable pattern equilibria, dominated by frozen out, spin glass states. The problem is, for any realistic system, with correlated data, the system is dominated by spurious local minima which look like low energy spin glass states.
So the Hopfield model suggested that
glass states can be useful minima, but
we want to avoid low energy (spurious) glassy states.
One can try to derive direct mapping between Hopfield Nets and RBMs (under reasonable assumptions). Then the RBM capacity is proportional to the number of Hidden nodes. After that, the analogies stop.
The intuition about RBMs is different since (effectively) they operate at a non-zero Temperature. Additionally, it is unclear to this blogger if the proper description of deep learning is a mean field spin glass, with many useful local minima, or a strongly correlated system, which may behave very differently, and more like a funneled energy landscape.
Thouless-Anderson-Palmer Theory of Spin Glasses
The TAP theory is one of the classic analytic tools used to study spin glasses and even the Hopfield Model.
We will derive the EMF RBM method following the Thouless-Anderson-Palmer (TAP) approach to spin glasses.
On a side note he TAP method introduces us to 2 more Nobel Laureates:
The TAP theory, published in 1977, presented a formal approach to study the thermodynamics of mean field spin glasses. In particular, the TAP theory provides an expression for the average spin
where is like an activation function, C is a constant, and the MeanField and Onsager terms are like our terms above.
In 1977, they argued that the TAP approach would hold for all Temperatures (and external fields), although it was only proven until 25 years later by Talagrand. It is these relatively new, rigorous approaches that are cited by Deep Learning researchers like LeCun , Chaudhari, etc. But many of the cited results have been suggested using the TAP approach. In particular, the structure of the Energy Landscape, has been understood looking at the stationary points of the TAP free energy.
More importantly, the TAP approach can be operationalized, as a new RBM solver.
We start with the RBM Free Energy .
Introduce an ‘external field’ which couples to the spins, adding a linear term to the Free Energy
Physically, would be an external magnetic field which drives the system out-of-equilibrium.
As is standard in statistical mechanics, we take the Legendre Transform, in terms of a set of conjugate variables . These are the magnetizations of each spin under the applied field, and describe how the spins behave outside of equilibrium
The transform which effectively defines a new interaction ensemble . We now set , and note
Define an interaction Free Energy to describe the interaction ensemble
which equals the original Free Energy when .
Note that because we have visible and hidden spins (or nodes), we will identify magnetizations for each
Now, recall we want to avoid the glassy phase; this means we keep the Temperature high. Or low.
We form a low Taylor series expansion in the new ensemble
which, at low order in , we expect to be reasonably accurate even away from equilibrium, at least at high Temp.
This leads to an order-by-order expansion for the Free Energy . The first order () correction is the mean field term. The second order term () is the Onsager correction.
Upto , we have
or
Rather than assume equilibrium, we assume that, at each step during inference –at fixed (–the system satisfies a quasi-stationary condition. Each step reaches a a local saddle point in phase space, s.t.
Applying the stationary conditions lets us write coupled equations for the individual magnetizations that effectively define the (second order), high Temp, quasi-equilibrium states
Notice that the resemble the RBM conditional probabilities; in fact, at first order in , they are the same.
gives a mean field theory, and
And in the late 90’s, mean field TAP theory was attempted, unsuccessfully, to create an RBM solver.
At second order, the magnetizations are coupled through the Onsager corrections. To solve them, we can write down the fixed point equations, shown above.
We can include higher order correction the Free Energy by including more terms in the Taylor Series. This is called a Plefka expansion. The terms can be represented using diagrams
Plefka derived these terms in 1982, although it appears he only published up to the Onsager correction; a recent paper shows how to obtain all high order terms.
The Diagrammatic expansion appears not to have been fully worked out, and is only sketched above.
I can think of at least 3 ways to include these higher terms:
This is similar, in some sense, to the infinite RBM by Larochelle, which uses an resummation trick to include an infinite number of Hidden nodes.
, where
,
Obviously there are lots of interesting things to try.
The current python EMF_RBM only treats binary data, just like the scikit learn BernoulliRBM. So for say MNIST, we have to use the binarized MNIST.
There is some advantage to using Binarized Neural Networks on GPUs.
Still, a non-binary RBM may be useful. Tremel et. al. have suggested how to use real-valued data in the EMF_RBM, although in the context of Compressed Sensing Using Generalized Boltzmann Machines.
]]>