My graduate advisor used to say:
“If you can’t invent something new, invent your own notation”
Varitional Inference is foundational to Unsupervised and Semi-Supervised Deep Learning. In particular, Variational Auto Encoders (VAEs). There are many, many tutorials and implementations on Variational Inference, which I collect on my YouTube channel and below in the references. In particular, I look at modern ideas coming out of Google Deep Mind.
The thing is, Variational inference comes in 5 different comes in 5 or 6 different flavors, and it is a lot of work just to keep all the notation straight.
We can trace the basic idea back to Hinton and Zemel (1994)– to minimize a Helmholtz Free Energy.
What is missing is how Variational Inference is related the Variational Free Energy from statistical physics. Or even how an RBM Free Energy is related to a Variational Free Energy.
This holiday weekend, I hope to review these methods and to clear some of this up. This is a long post filled with math and physics–enjoy !
Years ago I lived in Boca Raton, Florida to be near my uncle, who was retired and on his last legs. I was working with the famous George White, one of the Dealers of Lightening, from the famous Xerox Parc. One day George stopped by to say hi, and he found me hanging out at the local Wendy’s and reading the book Generating Functionology. It’s a great book. And even more relevant today.
The Free Energy, is a Generating function. It generates thermodynamic relations. It generates expected Energies through weight gradients . It generates the Kullback-Liebler variational bound, and its corrections, as cumulants. And, simply put, in unsupervised learning, the Free Energy generates data.
Let’s see how it all ties together.
We first review inference in RBMs, which is one of the few Deep Learning examples that is fully expressed with classical Statistical Mechanics.
Suppose we have some (unlabeled) data . We know now that we need to learn a good hidden representation () with an RBM, or, say, latent () representation with a VAE.
Before we begin, let us try to keep the notation straight. To compare different methods, I need to mix the notation a bit, and may be a little sloppy sometimes. Here, I may interchange the RBM and VAE conventions
and, WLOG, may interchange the log functions
and drop the parameters on the distributions
Also, the stat mech Physics Free Energy convention is the negative log Z,
and I sometimes use the bra-ket notation for expectation values
.
Finally I might mix up the minus signs in the early draft of this blog; please let me know.
In an RBM, we learn an Energy function $, explicitly:
Inference means gradient learning. along the variational parameters , for the expected log likelihood
.
This is actually a form of Free Energy minimization. Let’s see why…
The joint probability is given by a Boltzmann distribution
.
To get , we have to integrate out the hidden variables
log likelihood = – clamped Free Energy + equilibrium Free Energy
(note the minus sign convention)
We recognize the second term as the total, or equilibrium, Free Energy from the partition function . This is just like in Statistical Mechanics (stat mech), but with . We call the first term the clamped Free Energy because it is like a Free Energy, but clamped to the data (the visible units). This gives
.
We see that the partition function Z is not just the normalization–it is a generating function. In statistical thermodynamics, derivatives of yield the expected energy
Since , we can associate an effective T with the norm of the weights
So if we take weight gradients of the Free Energies, we expect to get something like expected Energies. And this is exactly the result.
The gradients of the clamped Free Energy give an expectation value over the conditional
and the equilibrium Free Energy gradient yields an expectation of the joint distribution :
The derivatives do resemble expected Energies, with a unit weight matrix , which is also, effectively, . See the Appendix for the full derivations.
The clamped Free Energy is easy to evaluate numerically, but the equilibrium distribution is intractable. Hinton’s approach, Contrastive Divergence, takes a point estimate
,
where and are taken from one or more iterations of Gibbs Sampling– which is easily performed on the RBM model.
Unsupervised learning appears to be a problem in statistical mechanics — to evaluate the equilibrium partition function. There are lots of methods here to consider, including
Not to mention the very successful Deep Learning approach, which appears to be to simply guess, and then learn deterministic fixed point equations (i.e. SegNet) , via Convolutional AutoEncoders.
Unsupervised Deep Learning today looks like an advanced graduate curriculum in non-equilibirum statistical mechanics, all coded up in tensorflow.
We would need a year or more of coursework to go through this all, but I will try to impart some flavor as to what is going on here.
VAEs are a kind of generative deep learning model–they let us model and generate fake data. There are at least 10 different popular models right now, all easily implemented (see the links) in like tensorflow, keras, and Edward.
The vanilla VAE, ala Kingma and Welling, is foundational to unsupervised deep learning.
As in an RBM, in a VAE, we seek the joint probability . But we don’t want to evaluate the intractable partition function , or the equilibrium Free Energy , directly. That is, we can not evaluate , but perhaps there is some simpler, model distribution which we can sample from.
There are severals starting points although, in the end, we still end up minimizing a Free Energy. Let’s look at a few:
This an autoencoder, so we are minimizing a something like a reconstruction error. We need is a score between the empirical and model distributions, where the partition function cancels out.
.
This is a called score matching (2005). It has been shown to be closely related to auto-encoders.
I bring this up because if we look at supervised Deep Nets, and even unsupervised Nets like convolutional AutoEncoders, they are minimizing some kind of Energy or Free energy, implicitly at , deterministically. There is no partition function–it seems to have just canceled out.
We can also consider just minimizing the expected log likelihood, under the model
.
And with some re-arrangements, we can extract out a Helmholtz-like Free Energy. it is presented nicely in the Stanford class on Deep Learning, Lecture 13 on Generative Models.
We can also start by just minimizing the KL divergence between the posteriors
although we don’t actually minimize this KL divergence directly.
In fact, there is a great paper / video on Sequential VAEs which asks–are we trying to make q model p, or p model q ? The authors note that a good VAE, like a good RBM, should not just generate good data, but should also give a good latent representation . And the reason VAEs generate fuzzy data is because we over optimize recovering the exact spatial information and don’t try hard enough to get right.
The most important paper in the field today is by Kigma and Welling,where they lay out the basics of VAEs. The video presentation is excellent also.
We form a continuous, Variational Lower Bound, which is a negative Free Energy ()
And either minimizing the divergence of the posteriors, or maximizing the marginal likelihood, we end up minimizing a (negative) Variational Helmholtz Free Energy:
Free Energy = Expected Energy – Entropy
There are numerous derivations of the bound, including the Stanford class and the original lecture by Kingma. The take-away-is
Maximizing the Variational Lower Bound minimizes the Free Energy
This is, again, actually an old idea from statistical mechanics, traced back to Feynman’s book (available in Hardback on Amazon for $1300!)
We make it sound fancy by giving it a Russian name, the Gibbs-Bogoliubov relation (described nice here). It is finite Temp generalization of the Rayleigh-Ritz theorem for the more familiar Hamiltonians and Hermitian matrices.
The idea is to approximate the (Helmholtz) Free Energy with guess, model, or trial Free Energy , defined by expectations such that
is always greater than than true , and as our guess expectation gets better, our approximation improves.
This is also very physically intuitive and reflects our knowledge of the fluctuation theorems of non-equilibrium stat mech. It says that any small fluctuation away equilibrium will relax back to equilibrium. In fact, this is a classic way to prove the variational bound…
and it introduces the idea of conservation of volume in phase space (i.e. the Liouville equation), which, I believe, is related to an Normalizing Flows for VAEs.But that is a future post.
Stochastic gradient descent for VAEs is a deep subject; it is described in detail here
The gradient descent problem is to find the Free Energy gradient in the generative and variational parameters . The trick, however, is to specify the problem so we can bring variational gradient inside the expectation value
This is not trivial since the expected value depends on the variational parameters. For the simple Free Energy objective above, we can show that
Although we will make even further approximations to get working code.
We would like to apply BackProp to the variational lower bound; writing it in these 2 terms make this possible. We can evaluate the first term, the reconstruction error, using mini-batch SGD sampling, whereas the KL regularizer term is evaluated analytically.
We specify a tractable distribution , where we can numerically sample the posterior to get the latent variables using either a point estimate on 1 instance , or a mini-batch estimate .
As in statistical physics, we do what we can, and take a mean field approximation. We then apply the reparameterization trick to let us apply BackProp. I review this briefly in the Appendix.
This leads to several questions which I will adresss in this blog:
Like in Deep Learning, In almost all problems in statistical mechanics, we don’t know the actual Energy function, or Hamiltonian, . So we can’t form the instead of Partition Function , and we can’t solve for the true Free Energy . So, instead, we solve what we can.
For a VAE, instead of trying to find the joint distribution , as in an RBM, we want the associated Energy function, also called a Hamiltonian . The unknown VAE Energy is presumably more complicated than a simple RBM quadratic function, so instead of learning it flat out, we start by guessing some simpler Energy function . More importantly,We want to avoid computing the equilibrium partition function. The key is, is something we know — something tractable.
And, as in physics, q will also be a mean field approximation— but we don’t need that here.
We decompose the total Hamiltonian into a model Hamiltonian Energy plus perturbation
Energy = Model + Perturbation
The perturbation is the difference between the true and model Energy functions, and assumed to be small in some sense. That is, we expect our initial guess to be pretty good already. Whatever that means. We have
The constant $latex \lambda\le1&bg=ffffff$ is used to formally construct a power series (cumulant expansion); it is set to 1 at the end.
Write the equilibrium Free Energy in terms of the total Hamiltonian Energy function
There are numerous expressions for the Free Energy–see the Appendix. From above, we have
and we define equilibrium averages as
Recall we can not evaluate equilibrium averages, but we can presumably evaluate model averages . Given
,
where , and dropping the indices, to write
Insert inside the log, where giving
Using the property , and the definition of , we have expressed the Free Energy as an expectations in q.
This is formally exact–but hard to evaluate even with a tractable model.
We can approximate with using a cumulant expansion, giving us both the Kullback-Leibler Variational Free Energy, and corrections giving a Perturbation Theory for Variational Inference.
Cumulants can be defined most simply by a power series of the Cumulant generating function
although it can be defined and applied more generally, and is a very powerful modeling tool.
As I warned you, I will use the bra-ket notation for expectations here, and switch to natural log
We immediately see that
the stat mech Free Energy has the form of a Cumulant generating function.
Being a generating function, the cumulants are generated by taking derivatives (as in this video), and expressed using double bra-ket notation.
The first cumulant is just the mean expected value
whereas the second cumulant is the variance–the “mean of square minus square of mean”
(yup, cumulants are so common in physics that they have their own bra-ket notation)
This a classic perturbative approximation. It is a weak-coupling expansion for the equilibrium Free Energy, appropriate for small , and/or high Temperature. Since we always, naively, assume , it is seemingly applicable when the distribution is a good guess for
Since log expectation is a cumulant generating function; we can express the equilibrium Free Energy in a power series, or cumulants, in the perturbation V
Setting , the first order terms combine with log Z to form the model Helmholtz, or Kullback Leibler, Free Energy
The total equilibrium Free Energy is expressed as the model Free Energy plus perturbative corrections.
And now, for some
We now see the connection between the RBMs and VAEs, or, rather between the statistical physics formulation, with Energy and Partition functions, and the Bayesian probability formulation of VAEs.
Statistical mechanics has a very long history, over 100 years old, and there are many techniques now being lifted or rediscovered in Deep Learning, and then combined with new ideas. The post introduces the ideas being used today at Deep Mind, with some perspective from their origins, and some discussion about their utility and effectiveness brought from having seen and used these techniques in different contexts in theoretical chemistry and physics.
Of course, cumulants are not the only statistical physics tool. There are other Free Energy approximations, such as the TAP theory we used in the deterministic EMF-RBM.
Both the cumulant expansion and TAP theory are classic methods from non-equilibrium statistical physics. Neither is convex. Neither is exact. In fact, it is unclear if these expansions even converge, although they may be asymptotically convergent. The cumulants are very old, and applicable to general distributions. TAP theory is specific to spin glass theory, and can be applied to neural networks with some modifications.
The cumulants play a critical role in statistical physics and quantum chemistry because they provide a size-extensive approximation. That is, in the limit of a very large deep net (), the Energy function we learn scales linearly in N.
For example, mean field theories obey this scaling. Variational theories generally do not obey this scaling when they include correlations, but perturbative methods do.
The variational theorem is easily proven using jensen’s inequality, as in David Beli’s notes.
In the context of spin glass theory, for those who remember this old stuff, this means that we have expressions like
which, for a given spin glass model, occurs at the boundary (i.e. the Nishimori line) of the spin glass phase. I will discuss this more in an further post.
Well, it has been a long post, which seems appropriate for Labor Day.
But there is more, in the
I will try to finish this soon; the derivation is found in Ali Ghodsi, Lec [7], Deep Learning , Restricted Boltzmann Machines (RBMs)
I think it is easier to understand the Kigma and Welling paper AutoEncoding Variational Bayes by looking at the equations next to Keras Blog and code. We are minimizing the Variational Free Energy, but reformulate it using the mean field approximation and the reparameterization trick.
We choose a model Q that factorizes into Gaussians
We can also use other distributions, such as
Being mean field, the VAE model Energy function $mathcal{H}_{q}(\mathbf{x},\mathbf{z})&bg=ffffff$ is effectively an RBM-like quadratic Energy function , although we don’t specify it explicitly. On the other hand, the true $mathcal{H}(\mathbf{x},\mathbf{z})&bg=ffffff$ is presumably more complicated.
We use a factored distribution to reexpress the KL regularizer using
We can not backpropagate through a literal stochastic node z because we can not form the gradient. So we just replace the innermost hidden layer with a continuous latent space, and form z by sampling from this.
We reparameterize z with explicit random values , sampled from a Normal distribution N(0,I)
In Keras, we define z with a (Lambda) sampling function, eval’d on each batch step
and use this z in the last decoder hidden layer
Of course, this slows down execution since we have to call K.random_normal on every SGD batch.
We estimate mean and variance for the in the mini-batch , and sample from these vectors. The KL regularizer can then be expressed analytically as
This is inserted directly into the VAE Loss function. For each a minibatch (of size L), L is
where the KL Divergence (kl_loss) is approximated in terms of the mini-batch estimates for the mean and variance .
in Keras, the loss looks like:
We can now apply BackProp using SGD, RMSProp, etc. to minimize the VAE Loss, with on every mini-batch step.
In machine learning, we use expected value notation, such as
but in physics and chemistry there at 5 or 6 other notations. I jotted them down here for my own sanity.
For RBMs and other discrete objects, we have
Of course, we may want the limit , but we have to be careful how we take this limit. Still, we may write
In the continuous case, we specify a density of states $latex \rho(E)&bg=ffffff $
which is not the same as specifying a distribution over the internal variables, giving
In quantum statistical mechanics, we replace the Energy with the Hamiltonian operator, and replace the expectation value with the Trace operation
and this is also expressed using a bra-ket notation
and usually use subscripts to represent non-equilibrium states
Raise the Posteriors
This paper, along with his wake-sleep algorithm, set the foundations for modern variational learning. They appear in his RBMs, and more recently, in Variational AutoEncoders (VAEs) .
Of course, Free Energies come from Chemical Physics. And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.
They are so important that Karl Friston has proposed the The Free Energy Principle : A Unified Brain Theory ?
(see also the wikipedia and this 2013 review)
What are free Energies and why do we use them in Deep Learning ?
In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has
This is the T=0 configurational Energy, where each configuration is some pair. In chemical physics, these Energies resemble an Ising model.
The Free Energy is a weighted average of the all the global and local minima
Note: as , the the Free Energy becomes the T=0 global energy minima . In limit of zero Temperature, all the terms in the sum approach zero
and only the largest term, the largest negative Energy, survives.
We may also see F written in terms of the partition function Z:
where the brakets denote an equilibrium average, and expected value over some equilibrium probability distribution . (we don’t normalize with 1/N here; in principle, the sum could be infinite.)
Of course, in deep learning, we may be trying to determine the distribution , and/or we may approximate it with some simpler distribution during inference. (From now on, I just write P and Q for convenience)
But there is more to Free Energy learning than just approximating a distribution.
In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well. It is the Energy available to do work.
For convenience, Hinton explicitly set T=1. Of course, he was doing inference, and did not know the scale of the weights W. Since we don’t specify the Energy scale, we learn the scale implicitly when we learn W. We call this being scale-free
So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where , as we learn the weights W. Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.
Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:
They will fail, however, when the barriers between Energy basins are larger than the Temperature.
This can happen if the effective Temperature drops close to zero during inference. Since T=1 implicitly in inference, this means when the weights W are exploding.
See: Normalization in Deep Learning
Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena. I will discuss this in a future post.
In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros). But this is probably not the optimal solution to learning when using a Free Energy method.
It is sometimes argued that Deep Learning is a non-convex optimization problem. And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima? How can this be ?
At least for unsupervised methods, it has been clear since 1987 that:
An important property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …
Hence, the probability of getting stuck in a local minima decreases
Although this is not specifically how Hinton argued for the Helmholtz Free Energy — a decade later.
Why do we use Free energy methods ? Hinton used the bits-back argument:
Imagine we are encoding some training data and sending it to someone for decoding. That is, we are building an Auto-Encoder.
If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.
But what if have 2 or more equally valid codes ?
Can we save 1 bit by being a little vague ?
Suppose we have N possible encodings , each with Energy . We say the data has stochastic complexity.
Pick a coding with probability and send it to the receiver. The expected cost of encoding is
Now the receiver must guess which encoding we used. The decoding cost of the receiver is
where H is the Shannon Entropy of the random encoding
The decoding cost looks just like a Helmholtz Free Energy.
Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.
To understand this better, we need to relate
In 1957, Jaynes formulated the MaxEnt principle which considers equilibrium thermodynamics and statistical mechanics as inference processes.
In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.
In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy. We need
and F is defined as
In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)
Variables like E and S depend on the system size N. That is,
as
We say S and T are conjugate pairs; S is extensive, T is intensive.
(see more on this in the Appendix)
The conjugate pairs are used to define Free Energies via the Legendre Transform:
Helmholtz Free Energy: F(T) = E(S) – TS
We switch the Energy from depending on S to T, where .
Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S. However, we are free to change and measure the Temperature–the derivative of E w/r.t. S:
This is a powerful and general mathematical concept.
Say we have a convex function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x
.
Then we can form the Legendre Transform , which gives g(w,y,z) as
the ‘Tangent Envelope‘ of f() along x
,
.
or, simply
Note: we have converted a convex function into a concave one. The Legendre transform is concave in the intensive variables and convex in the extensive variables.
Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix). But that is because while it is concave in T, we evaluate it at constant T.
But what if the Energy function is not convex in the Entropy ? Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding). These postulates also fail in systems like blog post on spin chains.
Answer: Take the convex hull
When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via
the Tangent Envelope , a convex relaxation
The Legendre-Fenchel Transform can provide a Free Energy, convexified along the direction internal (configurational) Entropy, allowing the Temperature to control how many local Energy minima are sampled.
Extra stuff I just wanted to write down…
If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.
as ,
if ,
and ,
then W should remain bounded to prevent the Energy E(n) from growing faster than Mn. And, of course, most Deep Learning algorithms do bound W in some form.
where C denotes a contour integral.
Check out my recent chat with Max Mautner, the Accidental Engineer
http://theaccidentalengineer.com/charles-martin-principal-consultant-calculation-consulting/
It builds upon a Batch Normalization (BN), introduced in 2015– and is now the defacto standard for all CNNs and RNNs. But not so useful for FNNs.
What makes normalization so special? It makes very Deep Networks easier to train, by damping out oscillations in the distribution of activations.
To see this, the diagram below uses data from Figure 1 (from the BN paper) to depict how the distribution evolves for a typical node outputs in the last hidden layer of a typical network:
Very Deep nets can be trained faster and generalize better when the distribution of activations is kept normalized during BackProp.
We regularly see Ultra-Deep ConvNets like Inception, Highway Networks, and ResNet. And giant RNNs for speech recognition, machine translation, etc. But we don’t see powerful Feedforward Neural Nets (FNNS) with more than 4 layers. Until now.
Batch Normalization is great for CNNs and RNNs.
But we still can not build deep MLPs
This new method — Self-Normalization — has been proposed for building very deep MultiLayer Perceptions (MLPs) and other Feed Forward Nets (FNNs).
The idea is just to tweak a the Exponential Linear Unit (ELU) activation function to obtain a Scaled ELU (SELU):
With this new SELU activation function, and a new, alpha Dropout method, it appears we can, now, build very deep MLPs. And this opens the door for Deep Learning applications on very general data sets. That would be great!
The paper is, however, ~100 pages long of pure math! Fun stuff.. but a summary is in order.
I review Normalization in Neural Networks, including Batch Normalization, Self-Normalization, and, of course, some statistical mechanics (it’s kinda my thing).
This is an early draft of the post: comments and questions are welcome
WLOG, consider an MLP, where we call the input to each layer u
The linear transformations at each layer is
,
and we apply standard point-wise activations, like a sigmoid
so that the total set of activations (at each layer) takes the form
The problem is that during SGD training, the distribution of weights W and/or the outputs x can vary widely from iteration to iteration. These large variations lead to instabilities in training that require small learning rates. In particular, if the layer weights W or inputs u blow up, the activations can become saturated:
,
leading to vanishing gradients. Traditionally, this was in MLPs avoided by using larger learning rates, and/or early stopping.
One solution is better activation functions, such as a Rectified Linear Unit (ReLu)
or, for larger networks (depth > 5), an Exponential Linear Unit (ELU):
Which look like:
Indeed, sigmoid and tanh activations came from early work in computational neuroscience. Jack Cowan proposed first the sigmoid function as a model for neuronal activity, and sigmoid and tanh functions arise naturally in statistical mechanics. And sigmoids are still widely used for RBMs and MLPs–ReLUs don’t help much here.
SGD training introduces perturbations in training that propagate through the net, causing large variations in weights and activations. For FNNs, this is a huge problem. But for CNNs and RNNs..not so much. why ?
It has been said the no real theoretical progress has been made in deep nets in 30 years. That is absurd. We did not have ReLus or ELUs. In fact, up until Batch Normalization, we were still using SVM-style regularization techniques for Deep Nets. It is clear now that we need to rethink generalization in deep learning.
We can regularize a network, like a Restricted Boltzmann Machine (RBM), by applying max norm constraints to the weights W.
This can be implemented in training by tweaking the weight update at the end of pass over all the training data
where is an L1 or L2 norm.
I have conjectured that this is actually kind of Temperature control, and prevents the effective Temperature of the network from collapsing to zero.
,
By avoiding a low Temp, and possibly any glassy regimes, we can use a larger effective annealing rate–in modern parlance, larger SGD step sizes.
It makes the network more resilient to changes in scale.
After 30 years of research neural nets, we can now achieve an analogous network normalization automagically.
But first, what is current state-of-the-art in code ? What can we do today with Keras ?
Batch Normalization (BN) Transformation
Tensorflow and other Deep Learning frameworks now include Batch Normalization out-of-the-box. Under-the-hood, this is the basic idea:
At the end of every mini-batch , the layers are whitened. For each node output x (and before activation):
the BN Transform maintains the (internal) zero mean and unit variance ().
We evaluate the sample mini-batch mean and variance , and then normalize, scale, and shift the values:
The final transformation is applied inside the activation function g():
although we can absorb the original layer bias term b into the BN transform, giving
So now, instead of renormalizing the weights W after passing over all the data, we can normalize the node output x=Wu explicitly, for each mini-batch, in the BackProp pass.
Note that
so bounding the weights with max-norm constraints got us part of the way already.
Note that extra scaling and shift parameters appear for each batch (k), and it is necessary to optimize these parameters as a side step during training.
At the end of the transform, we can normalize the network outputs (shown above) of the entire training set (population)
,
where the final statistics are computed as, say, an unbiased estimate over all (m) mini-batches of the training data
.
The key to Batch Normalization (BN) is that:
BN allows us to manipulate the activation function of the network. It is a differentiable transformation that normalizes activations in the network.
It makes the network (even) more resilient to the parameter scale.
It has been known for some time that Deep Nets perform better if the inputs are whitened. And max-norm constraints do re-normalize the layer weights after every mini-batch.
Batch normalization appears to be more stable internally, with the advantages that it:
Still, Batch Norm training slows down BackProp. Can we speed it up ?
A few days ago, the Interwebs was buzzing about the paper Self-Normalizing Neural Networks. HackerNews. Reddit. And my LinkedIn Feed.
These nets use Scaled Exponential Linear Units (SELU), which have implicit self-normalizing properties. Amazingly, the SELU is just a ELU multiplied by
where .
The paper authors have optimized the values as:
**a comment on reddit suggests tanh may work as well
The SELUs have the explicit properties of:
Amazingly, the implicit self-normalizing properties are actually proved–in only about 100 pages–using the Banach Fixed Point Theorem.
They show that, for an FNN using selu(x) actions, there exists a unique attracting and stable fixed point for the mean and variance. (Curiously, this resembles the argument that Deep Learning (RBMs at least) the Variational Renormalization Group (VRG) Transform.
There are, of course, conditions on the weights–things can’t get too crazy. This is hopefully satisfied by selecting initial weights with zero mean and unit variance.
,
(depending how we define terms).
To apply SELUs, we need a special initialization procedure, and a modified version of Dropout, alpha-Dropout,
We select initial weights from a Gaussian distribution with mean 0 and variance , where N is number of weights:
In Statistical Mechanics, this the Temperature is proportional to the variance of the Energy, and therefore sets the Energy scale. Since E ~ W,
SELU Weight initialization is similar in spirit to fixing T=1.
Note that to apply Dropout with an SELU, we desire that the mean and variance are invariant.
We must set random inputs to saturated negative value of SELU, Then, apply an affine transformation, computing relative to dropout rate.
(thanks to ergol.com for the images and discussion).
All of this is provided, in code, with implementations already on github for Tensorflow, PyTorch, Caffe, etc. Soon…Keras?
The key results are presented in Figure 1 of the paper, where SNN = Self Normalizing Networks, and the data sets studies are MNIST and CIFAR.
The original code is available on github
Great discussions on HackerNews and Reddit
We have reviewed several variants of normalization in deep nets, including
Along the way, I have tried to convince you that recent developments in the normalization of Deep Nets represent a culmination over 30 years of research into Neural Network theory, and that early ideas about finite Temperature methods from Statistical Mechanics have evolved into and are deeply related to the Normalization methods employed today to create very Deep Neural Networks
Very early research in Neural Networks lifted idea from statistical mechanics. Early work by Hinton formulated AutoEncoders and the principle of the Minimum Description Length (MDL) as minimizing a Helmholtz Free Energy:
,
where the expected (T=0) Energy is
,
S is the Entropy,
and the Temperature implicitly.
Minimizing F yields the familiar Boltzmann probability distribution
$latex p_{i}=\dfrac{e^{-\beta E_{i}}}{\sum\limits_{j}e^{-\beta E_{j}}}&bg=ffffff $.
When we define an RBM, we parameterize the Energy levels in terms of the configuration of visible and hidden units
,
giving the probability
where is the Partition Function, and, again, T=1 implicitly.
In Stat Mech, we call RBMs a Mean Field model because we can decompose the total Energy and/or conditional probabilities using sigmoid activations for each node
In my 2016 MMDS talk, I proposed that without some explicit Temperature control, RBMs could collapse into a glassy state.
And now, some proof I am not completely crazy:
Another recent 2017 study on the Emergence of Compositional Representations in Restricted Boltzmann Machines, we do indeed see that the RBM effective Temperature does indeed drop well below 1 during training
and that RBMs can exhibit glassy behavior.
I also proposed that RBMs could undergo Entropy collapse at very low Temperatures. This has also now been verified in a recent 2016 paper.
Finally, this 2017 paper:Train longer, generalize better: closing the generalization gap in large batch training of neural networks” proposes that many networks exhibit something like glassy behavior described as “ultra-slow” diffusion behavior.
I will sketch out the proof in some detail if there is demand. Intuitively (& citing comments in HackerNews): ”
Indeed, we train a neural network by running BackProp, thereby minimizing the model error–which is like minimizing an Energy.
But what is this Energy ? Deep Learning (DL) Energy functions look nothing like a typical chemistry or physics Energy. Here, we have Free Energy landscapes, frequently which form funneled landscapes–a trade off between energetic and entropic effects.
And yet, some researchers, like LeCun, have even compared Neural Network Energies functions to spin glass Hamiltonians. To me, this seems off.
The confusion arises from assuming Deep Learning is a non-convex optimization problem that looks similar to the zero-Temperature Energy Landscapes from spin glass theory.
I present a different view. I believe Deep Learning is really optimizing an effective Free Energy function. And this has profound implications on Why Deep Learning Works.
This post will attempt to relate recent ideas in RBM inference to Backprop, and argue that Backprop is minimizing a dynamic, temperature dependent, ruggedly convex, effective Free Energy landscape.
This is a fairly long post, but at least is basic review. I try to present these ideas in a semi-pedagogic way, to the extent I can in a blog post, discussing both RBMs, MLPs, Free Energies, and all that entails.
The Backprop algorithm lets us train a model directly on our data (X) by minimizing the predicted error , where the parameter set includes the weights , biases , and activations of the network.
.
Let’s write
,
where the error could be a mean squared error (MSE), cross entropy, etc. For example, in simple regression, we can minimize the MSE
,
whereas for multi-class classification, we might minimize a categorical cross entropy
where are the labels and is the network output for each training instance .
Notice that is the training error for instance , not a test or holdout error. Notice that, unlike an Support Vector Machine (SVM) or Logistic Regression (LR), we don’t use Cross Validation (CV) during training. We simply minimize the training error– whatever that is.
Of course, we can adjust the network parameters, regularization, etc, to tune the architecture of the network. Although it appears that Understanding deep learning requires rethinking generalization.
At this point, many people say that BackProp leads to a complex, non-convex optimization problem; IMHO, this is naive.
It has been known for 20 years that Deep Learning does not suffer from local minima.
Anyone who thinks it does has never read a research paper or book on neural networks. So what we really would like to know is, Why does Deep Learning Scale ? Or, maybe, why does it work at all ?!
To implement Backprop, we take derivatives and apply the the chain rule to the network outputs , applying it layer-by-layer.
Let’s take a closer look at the layers and activations. Consider a simple 1 layer net:
The Hidden activations are thought to mimic the function of actual neurons, and are computed by applying an activation function , to a linear Energy function ,
Indeed, the sigmoid activation function was first proposed in 1968 by Jack Cowan at the University of Chicago , still used today in models of neural dynamics
Moreover, Cowan pioneered using Statistical Mechanics to study the Neocortex.
And we will need a little Stat Mech to explain what our Energy functions are..but just a little.
While it seems we are simply proposing an arbitrary activation function, we can, in fact, derive the appearance of sigmoid activations–at least when performing inference on a single layer (mean field) Restricted Boltzmann Machine (RBM).
Hugo Larochelle has derived the sigmoid activations nicely for an RBM.
Given the (total) RBM Energy function
The log Energy is an un-normalized probability, such that
Where the normalization factor, Z, is an object from statistical mechanics called the (total) partition function Z
and is an inverse Temperature. In modern machine learning, we implicitly set .
Following Larochelle, we can factor by explicitly writing in terms of sums over the binary hidden activations . This lets us write the conditional probabilities, for each individual neuron as
.
We note that, this formulation was not obvious, and early work on RBMs used methods from statistical field theory to get this result.
We use and in Contrastive Divergence (CD) or other solvers as part of the Gibbs Sampling step for (unsupervised) RBM inference.
CD has been a puzzling algorithm to understand. When first proposed, it was unclear what optimization problem is CD solving? Indeed, Hinton is to have said
Specifically, we run several epochs of:
We will see below that we can cast RBM inference as directly minimizing a Free Energy–something that will prove very useful to related RBMs to MLPs
The sigmoid, and tanh, are an old-fashioned activation(s); today we may prefer to use ReLUs (and Leaky ReLUs).
The sigmoid itself was, at first, just an approximation to the heavyside step function used in neuron models. But the presence of sigmoid activations in the total Energy suggests, at least to me, that Deep Learning Energy functions are more than just random (Morse) functions.
RBMs are a special case of unsupervised nets that still use stochastic sampling. In supervised nets, like MLPs and CNNs (and in unsupervised Autoencoders like VAEs), we use Backprop. But the activations are not conditional probabilities. Let’s look in detail:
Consider a MultiLayer Perceptron, with 1 Hidden layer, and 1 output node
where for each data point, leading to the layer output
and total MLP output
where .
If we add a second layer, we have the iterated layer output:
where .
The final MLP output function has a similar form:
So with a little bit of stat mech, we can derive the sigmoid activation function from a general energy function. And we have activations it in RBMs as well as MLPs.
So when we apply Backprop, what problem are we actually solving ?
Are we simply finding a minima on random high dimensional manifold ? Or can we say something more, given the special structure of these layers of activated energies ?
To train an MLP, we run several epochs of Backprop. Backprop has 2 passes: forward and backward:
Each epoch usually runs small batches of inputs at time. (And we may need to normalize the inputs and control the variances. These details may be important for out analysis, and we will consider them in a later post).
After each pass, we update the weights, using something like an SGD step (or Adam, RMSProp, etc)
For an MSE loss, we evaluate the partial derivatives over the Energy parameters .
Backprop works by the chain rule, and given the special form of the activations, lets us transform the Energy derivatives into a sum of Energy gradients–layer by layer
I won’t go into the details here; there are 1000 blogs on BackProp today (which is amazing!). I will say…
Backprop couples the activation states of the neurons to the Energy parameter gradients through the cycle of forward-backward phases.
In a crude sense, Backprop resembles our more familiar RBM training procedure, where we equilibrate to set the activations, and run gradient descent to set the weights. Here, I show a direct connection, and derive the MLP functional form directly from an RBM.
RBMs are unsupervised; MLPs are supervised. How can we connect them? Crudely, we can think of an MLP as a single layer RBM with a softmax tacked on the end. More rigorously, we can look at Generalized Discriminative RBMs, which solve the conditional probability directly, in terms of the Free Energies, cast in the soft-max form
So the question is, can we extract Free Energy for an MLP ?
I now consider the Backward phase, using the deterministic EMF RBM, as a starting point for understanding MLPs.
An earlier post discusses the EMF RBM, from the context of chemical physics. For a traditional machine learning perspective, see this thesis.
In some sense, this is kind-of obvious. And yet, I have not seen a clear presentation of the ideas in this way. I do rely upon new research, like the EMF RBM, although I also draw upon fundamental ideas from complex systems theory–something popular in my PhD studies, but which is perhaps ancient history now.
The goal is to relate RBMs, MLPs, and basic Stat Mech under single conceptual umbrella.
In the EMF approach, we see RBM inference as a sequence of deterministic annealing steps, from 1 quasi-equilibrium state to another, consisting of 2 steps for each epoch:
At the end of each epoch, we update the weights, with weight (temperature) constraints (i.e. reset the L1 or L2 norm). BTW, it may not obvious that weight regularization is like a Temperature control; I will address this in a later post.
(1) The so-called Forward step solves a fixed point equation (which is similar in spirit to taking n steps of Gibbs sampling). This leads to a pair of coupled, recursion relations for the TAP magnetizations (or just nodes). Suppose we take t+1 iterations. Let us ignore the second order Onsager correction, and consider the mean field updates:
Because these are deterministic steps, we can express the in terms of :
At the end of the recursion, we will have a forward pass that resembles a multi-layer MLP, but that shares weights and biases between layers:
We can now associate an n-layer MLP, with tied weights,
,
to an approximate (mean field) EMF RBM, with n fixed point iterations (ignoring the Onsager correction for now). Of course, an MLP is supervised, and an RBM is unsupervised, so we need to associate the RBM hidden nodes with the MLP output function at the last layer (), prior to adding the MLP output node
This leads naturally to the following conjecture:
The EMF RBM and the BackProp Forward and Backward steps effectively do the same thing–minimize the Free Energy
This is a work in progress
Formally, it is simple and compelling. Is it the whole story…probably not. It is merely an observation–food for thought.
So far, I have only removed the visible magnetizations to obtain the MLP layer function as a function of the original visible units. The unsupervised EMF RBM Free Energy, however, contains expressions in terms of both the hidden and visible magnetizations ( ). To get a final expression, it is necessary to either
The result itself should not be so surprising, since it has already been pointed out by Kingma and Welling, Auto-Encoding Variational Bayes, that a Bernoulli MLP is like a variational decoder. And, of course, VAEs can be formulated with BackProp.
Nore importantly, It is unclear how good the RBM EMF really is. Some followup studies indicate that second order is not as good as, say, AIS, for estimating the partition function. I have coded a python emf_rbm.py module using the scikit-learn interface, and testing is underway. I will blog this soon.
Note that the EMF RBM relies on the Legendre Transform, which is like a convex relaxation. Early results indicates that this does degrade the RBM solution compared to traditional Cd. Maybe BackProp may be effective relaxing the convexity constraint by, say, relaxing the condition that the weights are tied between layers.
Still, I hope this can provide some insight. And there are …
Free Energy is a first class concept in Statistical Mechanics. In machine learning, not always so much. It appears in much of Hinton’s work, and, as a starting point to deriving methods like Variational Auto Encoders and Probabilistic Programing.
But Free Energy minimization plays an important role in non-convex optimization as well. Free energies are a Boltzmann average of the zero-Temperature Energy landscape, and, therefore, convert a non-convex surface into something at least less non-convex.
Indeed, in one of the very first papers on mean field Boltzmann Machines (1987), it is noted that
“An important property of the effective [free] energy function E'(V,0,T) is that it has a smoother landscape than E(S) due to the extra terms. Hence, the probability of getting stuck in a local minima decreases.”
Moreover, in protein folding, we have even stronger effects, which can lead to a ruggedly convex, energy landscape. This arises when the system runs out of configurational entropy (S), and energetic effects (E) dominate.
Most importantly, we want to understand, when does Deep Learning generalize well, and when does it overtrain ?
LeCun has very recently pointed out that Deep Nets fail when they run out of configuration entropy–an argument I also have made from theoretical analysis using the Random Energy Model. So it is becoming more important to understand what the actual energy landscape of a deep net is, how to separate out the entropic and energetic terms, and how to characterize the configurational entropy.
Hopefully the small insight will be useful and lead to a further understanding of Why Deep Learning Works.
A Mean Field Theory Learning Algorithm for Neural Networks
just a couple years after Hinton’s seminal 1985 paper , “A Learning Algorithm for Boltzmann Machines“.
What I really like is how we see the foundations of deep learning arose from statistical physics and theoretical chemistry. My top 10 favorite take-a-ways are:
Happy New Year everyone!
They are basically a solved problem, and while of academic interest, not really used in complex modeling problems. They were, upto 10 years ago, used for pretraining deep supervised nets. Today, we can train very deep, supervised nets directly.
RBMs are the foundation of unsupervised deep learning–
an unsolved problem.
RBMs appear to outperform Variational Auto Encoders (VAEs) on simple data sets like the Omniglot set–a data set developed for one shot learning, and used in deep learning research.
RBM research continues in areas like semi-supervised learning with deep hybrid architectures, Temperature dependence, infinitely deep RBMs, etc.
Many of basic concepts of Deep Learning are found in RBMs.
Sometimes clients ask, “how is Physical Chemistry related to Deep Learning ?”
In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry,
the Extended Mean Field Restricted Boltzmann Machine: EMF_RBM
(see: Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy )
[Along the way, we will encounter several Nobel Laureates, including the physicists David J Thouless (2016) and Philip W. Anderson (1977), and the physical chemist Lars Onsager (1968).]
RBMs are pretty simple, and easily implemented from scratch. The original EMF_RBM is in Julia; I have ported EMF_RBM to python, in the style of the scikit-learn BernoulliRBM package.
https://github.com/charlesmartin14/emf-rbm/blob/master/EMF_RBM_Test.ipynb
We examined RBMs in the last post on Cheap Learning: Partition Functions and RBMs. I will build upon that here, within the context of statistical mechanics.
RBMs are defined by the Energy function
To train an RBM, we minimize the log likelihood ,
the sum of the clamped and (actual) Free Energies, where
and
The sums range over a space of ,
which is intractable in most cases.
Training an RBM requires computing the log Free Energy; this is hard.
When training an RBM, we
We don’t include label information, although a trained RBM can provide features for a down-stream classifier.
The Extended Mean Field (EMF) RBM is a straightforward application of known statistical mechanics theories.
There are, literally, thousands of papers on spin glasses.
The EMF RBM is a great example of how to operationalize spin glass theory.
The Restricted Boltzmann Machine has a very simple Energy function, which makes it very easy to factorize the partition function Z , explained by Hugo Larochelle, to obtain the conditional probabilities
The conditional probabilities let us apply Gibbs Sampling, which is simply
In statistical mechanics, this is called a mean field theory. This means that the Free Energy (in ) can be written as a simple linear average over the hidden units
.
where is the mean field of the hidden units.
At high Temp., for a spin glass, a mean field model seems very sensible because the spins (i.e. activations) become uncorrelated.
Theoreticians use mean field models like the p-spin spherical spin glass to study deep learning because of their simplicity. Computationally, we frequently need more.
How we can go beyond mean field theory ?
Onsager was awarded 1968 the Nobel Prize in Chemistry for the development of the Onsager Reciprocal Relations, sometimes called the ‘4th law of Thermodynamics’
The Onsager relations provides the theory to treat thermodynamic systems that are in a quasi-stationary, local equilibrium.
Onsager was the first to show how to relate the correlations in the fluctuations to the linear response. And by tying a sequence of quasi-stationary systems together, we can describe an irreversible process…
..like learning. And this is exactly what we need to train an RBM.
In an RBM, the fluctuations are variations in the hidden and visible nodes.
In a BernoulliRBM, the activations can be 0 or 1, so the fluctuation vectors are
The simplest correction to the mean field Free Energy, at each step in training, are the correlations in these fluctuations:
where W is the Energy weight matrix.
Unlike normal RBMs, here is we work in an Interaction Ensemble, so the hidden and visible units become hidden and visible magnetizations:
To simplify (or confuse?) the presentations here, I don’t write magnetizations (until the Appendix).
The corrections make sense under the stationarity constraints, that the Extended Mean Field RBM Free Energy () is at a critical point
,
That is, small changes in the activations do not change the Free Energy.
We will show that we can write
as a Taylor series in , the inverse Temperature, where is the Entropy
,
is the standard, mean field RBM Free energy
,
and is the Onsager correction
.
Given the expressions for the Free Energy, we must now evaluate it.
The Taylor series above is a result of the TAP theory — the Thouless-Anderson-Palmer approach developed for spin glasses.
The TAP theory is outlined in the Appendix; here it is noted that
Thouless just shared the 2016 Nobel Prize in Physics (for his work in topological phase transitions)
Being a series in inverse Temperature , the theory applies at low , or high Temperature. For fixed , this also corresponds to small weights W.
Specifically, the expansion applies at Temperatures above the glass transition–a concept which I describe in a recent video blog.
Here, to implement the EMF_RBM, we set
,
and, instead, apply weight decay to keep the weights W from exploding
where may be an L1 or L2 norm.
Weight Decay acts to keep the Temperature high.
Early RBM computational models were formulated using statistical mechanics (see the Appendix) language, and so included a Temperature parameter, and were solved using techniques like simulated annealing and the (mean field) TAP equations (described below).
Adding Temperature allowed the system to ‘jump’ out of the spurious local minima. So any usable model required a non-zero Temp, and/or some scheme to avoid local minima that generalized poorly. (See: Learning Deep Architectures for AI, by Bengio)
These older approaches did not work well –then — so Hinton proposed the Contrastive Divergence (CD) algorithm. Note that researchers struggled for some years to ‘explain’ what optimization problem CD actually solves.
More that recent work on Temperature Bases RBMs also suggests that higher T solutions perform better, and that
“temperature is an essential parameter controlling the selectivity of the firing neurons in the hidden layer.”
Standard RBM training approximates the (unconstrained) Free Energy, F=ln Z, in the mean field approximation, using (one or more steps of) Gibbs Sampling. This is usually implemented as Contrastive Divergence (CD), or Persistent Contrastive Divergence (PCD).
Using techniques of statistical mechanics, however, it is possible to train an RBM directly, without sampling, by solving a set of deterministic fixed point equations.
Indeed, this approach clarifies how to view an RBM as solving a (determinisitic) fixed point equation of the form
Consider each step, at at fixed (), as a Quasi-Stationary system, which is close to equilibrium, but we don’t need to evaluate ln Z(v,h) exactly.
We can use the stationary conditions to derive a pair of coupled, non-linear equations
They extend the standard formula of sigmoid linear activations with additional, non-linear, inter-layer interactions.
They differs significantly from (simple) Deep Learning activation functions because the activation for each layer explicitly includes information from other layers.
This extension couples the mean () and total fluctuations () between layers. Higher order correlations could also be included, even to infinite order, using techniques from field theory.
We can not satisfy both equations simultaneously, but we can satisfy each condition individually, letting us write a set of recursion relations
These fixed point equations converge to the stationary solution, leading to a local equilibrium. Like Gibbs Sampling, however, we only need a few iterations (say t=3 to 5). Unlike Sampling, however, the EMF RBM is deterministic.
https://github.com/charlesmartin14/emf-rbm/
If there is enough interest, I can do a pull request on sklearn to include it.
The next blog post will demonstrate how the python code in action.
Most older physicists will remember the Hopfield model. They peaked in 1986, although interesting work continued into the late 90s (when I was a post doc).
Originally, Boltzmann machines were introduced as a way to avoid spurious local minima while including ‘hidden’ features into Hopfield Associative Memories (HAM).
Hopfield himself was a theoretical chemist, and his simple model HAMs were of great interest to theoretical chemists and physicists.
Hinton explains Hopfield nets in his on-line lectures on Deep Learning.
The Hopfield Model is a kind of spin glass, which acts like a ‘memory’ that can recognize ‘stored patterns’. It was originally developed as a quasi-stationary solution of more complex, dynamical models of neuronal firing patterns (see the Cowan-Wilson model).
Early theoretical work on HAMs studied analytic approximations to ln Z to compute their capacity (), and their phase diagram. The capacity is simply the number of patterns a network of size N can memorize without getting confused.
The Hopfield model was traditionally run at T=0.
Looking at the T=0 line, at extremely low capacity, the system has stable mixed states that correspond to ‘frozen’ memories. But this is very low capacity, and generally unusable. Also, when the capacity too large, , (which is really not that large), the system abruptly breaks down completely.
There is a small window of capacity, ,with stable pattern equilibria, dominated by frozen out, spin glass states. The problem is, for any realistic system, with correlated data, the system is dominated by spurious local minima which look like low energy spin glass states.
So the Hopfield model suggested that
glass states can be useful minima, but
we want to avoid low energy (spurious) glassy states.
One can try to derive direct mapping between Hopfield Nets and RBMs (under reasonable assumptions). Then the RBM capacity is proportional to the number of Hidden nodes. After that, the analogies stop.
The intuition about RBMs is different since (effectively) they operate at a non-zero Temperature. Additionally, it is unclear to this blogger if the proper description of deep learning is a mean field spin glass, with many useful local minima, or a strongly correlated system, which may behave very differently, and more like a funneled energy landscape.
Thouless-Anderson-Palmer Theory of Spin Glasses
The TAP theory is one of the classic analytic tools used to study spin glasses and even the Hopfield Model.
We will derive the EMF RBM method following the Thouless-Anderson-Palmer (TAP) approach to spin glasses.
On a side note he TAP method introduces us to 2 more Nobel Laureates:
The TAP theory, published in 1977, presented a formal approach to study the thermodynamics of mean field spin glasses. In particular, the TAP theory provides an expression for the average spin
where is like an activation function, C is a constant, and the MeanField and Onsager terms are like our terms above.
In 1977, they argued that the TAP approach would hold for all Temperatures (and external fields), although it was only proven until 25 years later by Talagrand. It is these relatively new, rigorous approaches that are cited by Deep Learning researchers like LeCun , Chaudhari, etc. But many of the cited results have been suggested using the TAP approach. In particular, the structure of the Energy Landscape, has been understood looking at the stationary points of the TAP free energy.
More importantly, the TAP approach can be operationalized, as a new RBM solver.
We start with the RBM Free Energy .
Introduce an ‘external field’ which couples to the spins, adding a linear term to the Free Energy
Physically, would be an external magnetic field which drives the system out-of-equilibrium.
As is standard in statistical mechanics, we take the Legendre Transform, in terms of a set of conjugate variables . These are the magnetizations of each spin under the applied field, and describe how the spins behave outside of equilibrium
The transform which effectively defines a new interaction ensemble . We now set , and note
Define an interaction Free Energy to describe the interaction ensemble
which equals the original Free Energy when .
Note that because we have visible and hidden spins (or nodes), we will identify magnetizations for each
Now, recall we want to avoid the glassy phase; this means we keep the Temperature high. Or low.
We form a low Taylor series expansion in the new ensemble
which, at low order in , we expect to be reasonably accurate even away from equilibrium, at least at high Temp.
This leads to an order-by-order expansion for the Free Energy . The first order () correction is the mean field term. The second order term () is the Onsager correction.
Upto , we have
or
Rather than assume equilibrium, we assume that, at each step during inference –at fixed (–the system satisfies a quasi-stationary condition. Each step reaches a a local saddle point in phase space, s.t.
Applying the stationary conditions lets us write coupled equations for the individual magnetizations that effectively define the (second order), high Temp, quasi-equilibrium states
Notice that the resemble the RBM conditional probabilities; in fact, at first order in , they are the same.
gives a mean field theory, and
And in the late 90’s, mean field TAP theory was attempted, unsuccessfully, to create an RBM solver.
At second order, the magnetizations are coupled through the Onsager corrections. To solve them, we can write down the fixed point equations, shown above.
We can include higher order correction the Free Energy by including more terms in the Taylor Series. This is called a Plefka expansion. The terms can be represented using diagrams
Plefka derived these terms in 1982, although it appears he only published up to the Onsager correction; a recent paper shows how to obtain all high order terms.
The Diagrammatic expansion appears not to have been fully worked out, and is only sketched above.
I can think of at least 3 ways to include these higher terms:
This is similar, in some sense, to the infinite RBM by Larochelle, which uses an resummation trick to include an infinite number of Hidden nodes.
, where
,
Obviously there are lots of interesting things to try.
The current python EMF_RBM only treats binary data, just like the scikit learn BernoulliRBM. So for say MNIST, we have to use the binarized MNIST.
There is some advantage to using Binarized Neural Networks on GPUs.
Still, a non-binary RBM may be useful. Tremel et. al. have suggested how to use real-valued data in the EMF_RBM, although in the context of Compressed Sensing Using Generalized Boltzmann Machines.
This is the question posed by a recent article. Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines (RBMs).
Here, I will discuss some aspects of this paper, in the context of RBMs.
We can use RBMs for unsupervised learning, as a clustering algorithm, for pretraining larger nets, and for generating sample data. Mostly, however, RBMs are an older, toy model useful for understanding unsupervised Deep Learning.
We define an RBM with an Energy Function
and it’s associated Partition Function
The joint probability is then
and the probability of the visible units is computed by marginalizing over the hidden units
Note we also mean the probability of observing the data X={v}, given the weights W.
The Likelihood is just the log of the probability
We can break this into 2 parts:
is just the standard Free Energy
We call the clamped Free Energy
because it is like a Free Energy, but with the visible units clamped to the data X.
The clamped is easy to evaluate in the RBM formalism, whereas is computationally intractable.
Knowing the is ‘like’ knowing the equilibrium distribution function, and methods like RBMs appear to approximate in some form or another.
Training an RBM proceeds iteratively by approximating the Free Energies at each step,
and then updating W with a gradient step
RBMs are usually trained via Contrastive Divergence (CD or PCD). The Energy function, being quadratic, lets us readily factor Z using a mean field approximation, leading to simple expressions for the conditional probabilities
and the weight update rule
RBM codes may use the terminology of positive and negative phases:
: The expectation is evaluated, or clamped, on the data.
: The expectation is to be evaluated on the prior distribution . We also say is evaluated in the limit of infinite sampling, at the so-called equilibrium distribution. But we don’t take the infinite limit.
CD approximates –effectively evaluating the (mean field) Free Energy — by running only 1 (or more) steps of Gibbs Sampling.
So we may see
or, more generally, and in some code bases, something effectively like
Initialize the positive and negative
Run N iterations of:
Run 1 Step of Gibbs Sampling to get the negative :
sample the hiddens given the (current) visibles:
sample the visibles given the hiddens (above):
Calculate the weight gradient:
Apply Weight decay or other regularization (optional):
Apply a momentum (optional):
Update the Weights:
What is Cheap about learning ? A technical proof in the Appendix notes that
knowing the Partition function is not the same as knowing the underlying distribution .
This is because the Energy can be rescaled, or renormalized, in many different ways, without changing .
This is a also key idea in Statistical Mechanics.
The Partition function is a generating function; we can write all the macroscopic, observable thermodynamic quantities as partial derivatives of . And we can do this without knowing the exact distribution functions or energies–just their renormalized forms.
Of course, our W update rule is a derivative of
The proof is technically straightforward, albeit a bit odd at first.
Let’s start with the visible units . Write
We now introduce the hidden units, , into the model, so that we have a new, joint probability distribution
and a new, Renormalized , partition function
Where RG means Renormalization Group. We have already discussed that the general RBM approach resembles the Kadanoff Variational Renormalization Group (VRG) method, circa 1975. This new paper points out a small but important technical oversight made in the ML literature, namely that
having does not imply
That is, just because we can estimate the Partition function well does not mean we know the probability distributions.
Why? Define an arbitrary non-constant function and write
.
K is for Kadanoff RG Transform, and ln K is the normalization.
We can now write an joint Energy with the same Partition function as our RBM , but with completely different joint probability distributions. Let
Notice what we are actually doing. We use the K matrix to define the RBM joint Energy function. In RBM theory, we restrict to a quadratic form, and use variational procedure to learn the weights , thereby learning K.
In a VRG approach, we have the additional constraint that we restrict the form of K to satisfy constraints on it’s partition function, or, really, how the Energy function is normalized. Hence the name ‘Renormalization.‘ This is similar, in spirit, but not necessarily in form, to how the RBM training regularizes the weights (above).
Write the total, or renormalized, Z as
Expanding the Energy function explicitly, we have
where the Kadanoff normalization factor appears now the denominator.
We can can break the double sum into sums over v and h
Identify in the numerator
which factors out, giving a very simple expression in h
In the technical proof in the paper, the idea is that since h is just a dummy variable, we can replace h with v. We have to be careful here since this seems to only applies to the case where we have the same number of hidden and visible units–a rare case. In an earlier post on VRG, I explain more clearly how to construct an RG transform for RBMs. Still, the paper is presenting a counterargument for arguments sake, so, following the argument in the paper, let’s say
This is like saying we constrain the Free Energy at each layer to be the same.
This is also another kind of Layer Normalization–a very popular method for modern Deep Learning methods these days.
So, by construction, the renormalized and data Partition functions are identical
The goal of Renormalization Group theory is to redefine the Energy function on a difference scale, while retaining the macroscopic observables.
But , and apparently this has been misstated in some ML papers and books, the marginalized probabilities can be different !
To get the marginals, let’s integrate out only the h variables
Looking above, we can write this in terms of K and its normalization
which implies
RBMs let us represent data using a smaller set of hidden features. This is, effectively, Variational Renormalization Group algorithm, in which we approximate the Partition function, at each step in the RBM learning procedure, without having to learn the underlying joining probability distribution. And this is easier. Cheaper.
In other words, Deep Learning is not Statistics. It is more like Statistical Mechanics.
And the hope is that we can learn from this old scientific field — which is foundational to chemistry and physics — to improve our deep learning models.
Shortly after this paper came out, Comment on “Why does deep and cheap learning work so well?”that the proof in the Appended is indeed wrong–as I suspected and pointed out above.
It is noted that the point of the RG theory is to preserve the Free Energy form one layer to another, and, in VRG, this is expressed as a trace condition on the Transfer operator
where
It is, however, technically possible to preserve the Free Energy and not preserve the trace condition. Indeed, because is not-constant, thereby violating the trace condition.
From this bloggers perspective, the idea of preserving Free Energy, via either a trace condition, or, say, by layer normalization, is the import point. And this may mean to only approximately satisfy the trace condition.
In Quantum Chemistry, there is a similar requirement, referred to as a Size-Consistency and/ or Size-Extensivity condition. And these requirements proven essential to obtaining highly accurate, ab initio solutions of the molecular electronic Schrodinger equation–whether implemented exactly or approximately.
And, I suspect, a similar argument, at least in spirit if not in proof, is at play in Deep Learning.
Please chime in our my YouTube Community Channel
see also: https://m.reddit.com/r/MachineLearning/comments/4zbr2k/what_is_your_opinion_why_is_the_concept_of/)
Stay tuned for a video link, and a blog post to accompany this.
Comments, questions, and bug fixes are welcome.