This paper, along with his wake-sleep algorithm, set the foundations for modern variational learning. They appear in his RBMs, and more recently, in Variational AutoEncoders (VAEs) .
Of course, Free Energies come from Chemical Physics. And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.
They are so important that Karl Friston has proposed the The Free Energy Principle : A Unified Brain Theory ?
(see also the wikipedia and this 2013 review)
What are free Energies and why do we use them in Deep Learning ?
In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has
This is the T=0 configurational Energy, where each configuration is some pair. In chemical physics, these Energies resemble an Ising model.
The Free Energy is a weighted average of the all the global and local minima
Note: as , the the Free Energy becomes the T=0 global energy minima . In limit of zero Temperature, all the terms in the sum approach zero
and only the largest term, the largest negative Energy, survives.
We may also see F written in terms of the partition function Z:
where the brakets denote an equilibrium average, and expected value over some equilibrium probability distribution . (we don’t normalize with 1/N here; in principle, the sum could be infinite.)
Of course, in deep learning, we may be trying to determine the distribution , and/or we may approximate it with some simpler distribution during inference. (From now on, I just write P and Q for convenience)
But there is more to Free Energy learning than just approximating a distribution.
In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well. It is the Energy available to do work.
For convenience, Hinton explicitly set T=1. Of course, he was doing inference, and did not know the scale of the weights W. Since we don’t specify the Energy scale, we learn the scale implicitly when we learn W. We call this being scale-free
So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where , as we learn the weights W. Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.
Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:
They will fail, however, when the barriers between Energy basins are larger than the Temperature.
This can happen if the effective Temperature drops close to zero during inference. Since T=1 implicitly in inference, this means when the weights W are exploding.
See: Normalization in Deep Learning
Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena. I will discuss this in a future post.
In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros). But this is probably not the optimal solution to learning when using a Free Energy method.
It is sometimes argued that Deep Learning is a non-convex optimization problem. And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima? How can this be ?
At least for unsupervised methods, it has been clear since 1987 that:
An important property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …
Hence, the probability of getting stuck in a local minima decreases
Although this is not specifically how Hinton argued for the Helmholtz Free Energy — a decade later.
Why do we use Free energy methods ? Hinton used the bits-back argument:
Imagine we are encoding some training data and sending it to someone for decoding. That is, we are building an Auto-Encoder.
If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.
But what if have 2 or more equally valid codes ?
Can we save 1 bit by being a little vague ?
Suppose we have N possible encodings , each with Energy . We say the data has stochastic complexity.
Pick a coding with probability and send it to the receiver. The expected cost of encoding is
Now the receiver must guess which encoding we used. The decoding cost of the receiver is
where H is the Shannon Entropy of the random encoding
The decoding cost looks just like a Helmholtz Free Energy.
Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.
To understand this better, we need to relate
In 1957, Jaynes formulated the MaxEnt principle which considers equilibrium thermodynamics and statistical mechanics as inference processes.
In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.
In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy. We need
and F is defined as
In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)
Variables like E and S depend on the system size N. That is,
as
We say S and T are conjugate pairs; S is extensive, T is intensive.
(see more on this in the Appendix)
The conjugate pairs are used to define Free Energies via the Legendre Transform:
Helmholtz Free Energy: F(T) = E(S) – TS
We switch the Energy from depending on S to T, where .
Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S. However, we are free to change and measure the Temperature–the derivative of E w/r.t. S:
This is a powerful and general mathematical concept.
Say we have a convex function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x
.
Then we can form the Legendre Transform , which gives g(w,y,z) as
the ‘Tangent Envelope‘ of f() along x
,
.
or, simply
Note: we have converted a convex function into a concave one. The Legendre transform is concave in the intensive variables and convex in the extensive variables.
Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix). But that is because while it is concave in T, we evaluate it at constant T.
But what if the Energy function is not convex in the Entropy ? Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding). These postulates also fail in systems like blog post on spin chains.
Answer: Take the convex hull
When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via
the Tangent Envelope , a convex relaxation
The Legendre-Fenchel Transform can provide a Free Energy, convexified along the direction internal (configurational) Entropy, allowing the Temperature to control how many local Energy minima are sampled.
Extra stuff I just wanted to write down…
If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.
as ,
if ,
and ,
then W should remain bounded to prevent the Energy E(n) from growing faster than Mn. And, of course, most Deep Learning algorithms do bound W in some form.
where C denotes a contour integral.
Check out my recent chat with Max Mautner, the Accidental Engineer
http://theaccidentalengineer.com/charles-martin-principal-consultant-calculation-consulting/
It builds upon a Batch Normalization (BN), introduced in 2015– and is now the defacto standard for all CNNs and RNNs. But not so useful for FNNs.
What makes normalization so special? It makes very Deep Networks easier to train, by damping out oscillations in the distribution of activations.
To see this, the diagram below uses data from Figure 1 (from the BN paper) to depict how the distribution evolves for a typical node outputs in the last hidden layer of a typical network:
Very Deep nets can be trained faster and generalize better when the distribution of activations is kept normalized during BackProp.
We regularly see Ultra-Deep ConvNets like Inception, Highway Networks, and ResNet. And giant RNNs for speech recognition, machine translation, etc. But we don’t see powerful Feedforward Neural Nets (FNNS) with more than 4 layers. Until now.
Batch Normalization is great for CNNs and RNNs.
But we still can not build deep MLPs
This new method — Self-Normalization — has been proposed for building very deep MultiLayer Perceptions (MLPs) and other Feed Forward Nets (FNNs).
The idea is just to tweak a the Exponential Linear Unit (ELU) activation function to obtain a Scaled ELU (SELU):
With this new SELU activation function, and a new, alpha Dropout method, it appears we can, now, build very deep MLPs. And this opens the door for Deep Learning applications on very general data sets. That would be great!
The paper is, however, ~100 pages long of pure math! Fun stuff.. but a summary is in order.
I review Normalization in Neural Networks, including Batch Normalization, Self-Normalization, and, of course, some statistical mechanics (it’s kinda my thing).
This is an early draft of the post: comments and questions are welcome
WLOG, consider an MLP, where we call the input to each layer u
The linear transformations at each layer is
,
and we apply standard point-wise activations, like a sigmoid
so that the total set of activations (at each layer) takes the form
The problem is that during SGD training, the distribution of weights W and/or the outputs x can vary widely from iteration to iteration. These large variations lead to instabilities in training that require small learning rates. In particular, if the layer weights W or inputs u blow up, the activations can become saturated:
,
leading to vanishing gradients. Traditionally, this was in MLPs avoided by using larger learning rates, and/or early stopping.
One solution is better activation functions, such as a Rectified Linear Unit (ReLu)
or, for larger networks (depth > 5), an Exponential Linear Unit (ELU):
Which look like:
Indeed, sigmoid and tanh activations came from early work in computational neuroscience. Jack Cowan proposed first the sigmoid function as a model for neuronal activity, and sigmoid and tanh functions arise naturally in statistical mechanics. And sigmoids are still widely used for RBMs and MLPs–ReLUs don’t help much here.
SGD training introduces perturbations in training that propagate through the net, causing large variations in weights and activations. For FNNs, this is a huge problem. But for CNNs and RNNs..not so much. why ?
It has been said the no real theoretical progress has been made in deep nets in 30 years. That is absurd. We did not have ReLus or ELUs. In fact, up until Batch Normalization, we were still using SVM-style regularization techniques for Deep Nets. It is clear now that we need to rethink generalization in deep learning.
We can regularize a network, like a Restricted Boltzmann Machine (RBM), by applying max norm constraints to the weights W.
This can be implemented in training by tweaking the weight update at the end of pass over all the training data
where is an L1 or L2 norm.
I have conjectured that this is actually kind of Temperature control, and prevents the effective Temperature of the network from collapsing to zero.
,
By avoiding a low Temp, and possibly any glassy regimes, we can use a larger effective annealing rate–in modern parlance, larger SGD step sizes.
It makes the network more resilient to changes in scale.
After 30 years of research neural nets, we can now achieve an analogous network normalization automagically.
But first, what is current state-of-the-art in code ? What can we do today with Keras ?
Batch Normalization (BN) Transformation
Tensorflow and other Deep Learning frameworks now include Batch Normalization out-of-the-box. Under-the-hood, this is the basic idea:
At the end of every mini-batch , the layers are whitened. For each node output x (and before activation):
the BN Transform maintains the (internal) zero mean and unit variance ().
We evaluate the sample mini-batch mean and variance , and then normalize, scale, and shift the values:
The final transformation is applied inside the activation function g():
although we can absorb the original layer bias term b into the BN transform, giving
So now, instead of renormalizing the weights W after passing over all the data, we can normalize the node output x=Wu explicitly, for each mini-batch, in the BackProp pass.
Note that
so bounding the weights with max-norm constraints got us part of the way already.
Note that extra scaling and shift parameters appear for each batch (k), and it is necessary to optimize these parameters as a side step during training.
At the end of the transform, we can normalize the network outputs (shown above) of the entire training set (population)
,
where the final statistics are computed as, say, an unbiased estimate over all (m) mini-batches of the training data
.
The key to Batch Normalization (BN) is that:
BN allows us to manipulate the activation function of the network. It is a differentiable transformation that normalizes activations in the network.
It makes the network (even) more resilient to the parameter scale.
It has been known for some time that Deep Nets perform better if the inputs are whitened. And max-norm constraints do re-normalize the layer weights after every mini-batch.
Batch normalization appears to be more stable internally, with the advantages that it:
Still, Batch Norm training slows down BackProp. Can we speed it up ?
A few days ago, the Interwebs was buzzing about the paper Self-Normalizing Neural Networks. HackerNews. Reddit. And my LinkedIn Feed.
These nets use Scaled Exponential Linear Units (SELU), which have implicit self-normalizing properties. Amazingly, the SELU is just a ELU multiplied by
where .
The paper authors have optimized the values as:
**a comment on reddit suggests tanh may work as well
The SELUs have the explicit properties of:
Amazingly, the implicit self-normalizing properties are actually proved–in only about 100 pages–using the Banach Fixed Point Theorem.
They show that, for an FNN using selu(x) actions, there exists a unique attracting and stable fixed point for the mean and variance. (Curiously, this resembles the argument that Deep Learning (RBMs at least) the Variational Renormalization Group (VRG) Transform.
There are, of course, conditions on the weights–things can’t get too crazy. This is hopefully satisfied by selecting initial weights with zero mean and unit variance.
,
(depending how we define terms).
To apply SELUs, we need a special initialization procedure, and a modified version of Dropout, alpha-Dropout,
We select initial weights from a Gaussian distribution with mean 0 and variance , where N is number of weights:
In Statistical Mechanics, this the Temperature is proportional to the variance of the Energy, and therefore sets the Energy scale. Since E ~ W,
SELU Weight initialization is similar in spirit to fixing T=1.
Note that to apply Dropout with an SELU, we desire that the mean and variance are invariant.
We must set random inputs to saturated negative value of SELU, Then, apply an affine transformation, computing relative to dropout rate.
(thanks to ergol.com for the images and discussion).
All of this is provided, in code, with implementations already on github for Tensorflow, PyTorch, Caffe, etc. Soon…Keras?
The key results are presented in Figure 1 of the paper, where SNN = Self Normalizing Networks, and the data sets studies are MNIST and CIFAR.
The original code is available on github
Great discussions on HackerNews and Reddit
We have reviewed several variants of normalization in deep nets, including
Along the way, I have tried to convince you that recent developments in the normalization of Deep Nets represent a culmination over 30 years of research into Neural Network theory, and that early ideas about finite Temperature methods from Statistical Mechanics have evolved into and are deeply related to the Normalization methods employed today to create very Deep Neural Networks
Very early research in Neural Networks lifted idea from statistical mechanics. Early work by Hinton formulated AutoEncoders and the principle of the Minimum Description Length (MDL) as minimizing a Helmholtz Free Energy:
,
where the expected (T=0) Energy is
,
S is the Entropy,
and the Temperature implicitly.
Minimizing F yields the familiar Boltzmann probability distribution
$latex p_{i}=\dfrac{e^{-\beta E_{i}}}{\sum\limits_{j}e^{-\beta E_{j}}}&bg=ffffff $.
When we define an RBM, we parameterize the Energy levels in terms of the configuration of visible and hidden units
,
giving the probability
where is the Partition Function, and, again, T=1 implicitly.
In Stat Mech, we call RBMs a Mean Field model because we can decompose the total Energy and/or conditional probabilities using sigmoid activations for each node
In my 2016 MMDS talk, I proposed that without some explicit Temperature control, RBMs could collapse into a glassy state.
And now, some proof I am not completely crazy:
Another recent 2017 study on the Emergence of Compositional Representations in Restricted Boltzmann Machines, we do indeed see that the RBM effective Temperature does indeed drop well below 1 during training
and that RBMs can exhibit glassy behavior.
I also proposed that RBMs could undergo Entropy collapse at very low Temperatures. This has also now been verified in a recent 2016 paper.
Finally, this 2017 paper:Train longer, generalize better: closing the generalization gap in large batch training of neural networks” proposes that many networks exhibit something like glassy behavior described as “ultra-slow” diffusion behavior.
I will sketch out the proof in some detail if there is demand. Intuitively (& citing comments in HackerNews): ”
Indeed, we train a neural network by running BackProp, thereby minimizing the model error–which is like minimizing an Energy.
But what is this Energy ? Deep Learning (DL) Energy functions look nothing like a typical chemistry or physics Energy. Here, we have Free Energy landscapes, frequently which form funneled landscapes–a trade off between energetic and entropic effects.
And yet, some researchers, like LeCun, have even compared Neural Network Energies functions to spin glass Hamiltonians. To me, this seems off.
The confusion arises from assuming Deep Learning is a non-convex optimization problem that looks similar to the zero-Temperature Energy Landscapes from spin glass theory.
I present a different view. I believe Deep Learning is really optimizing an effective Free Energy function. And this has profound implications on Why Deep Learning Works.
This post will attempt to relate recent ideas in RBM inference to Backprop, and argue that Backprop is minimizing a dynamic, temperature dependent, ruggedly convex, effective Free Energy landscape.
This is a fairly long post, but at least is basic review. I try to present these ideas in a semi-pedagogic way, to the extent I can in a blog post, discussing both RBMs, MLPs, Free Energies, and all that entails.
The Backprop algorithm lets us train a model directly on our data (X) by minimizing the predicted error , where the parameter set includes the weights , biases , and activations of the network.
.
Let’s write
,
where the error could be a mean squared error (MSE), cross entropy, etc. For example, in simple regression, we can minimize the MSE
,
whereas for multi-class classification, we might minimize a categorical cross entropy
where are the labels and is the network output for each training instance .
Notice that is the training error for instance , not a test or holdout error. Notice that, unlike an Support Vector Machine (SVM) or Logistic Regression (LR), we don’t use Cross Validation (CV) during training. We simply minimize the training error– whatever that is.
Of course, we can adjust the network parameters, regularization, etc, to tune the architecture of the network. Although it appears that Understanding deep learning requires rethinking generalization.
At this point, many people say that BackProp leads to a complex, non-convex optimization problem; IMHO, this is naive.
It has been known for 20 years that Deep Learning does not suffer from local minima.
Anyone who thinks it does has never read a research paper or book on neural networks. So what we really would like to know is, Why does Deep Learning Scale ? Or, maybe, why does it work at all ?!
To implement Backprop, we take derivatives and apply the the chain rule to the network outputs , applying it layer-by-layer.
Let’s take a closer look at the layers and activations. Consider a simple 1 layer net:
The Hidden activations are thought to mimic the function of actual neurons, and are computed by applying an activation function , to a linear Energy function ,
Indeed, the sigmoid activation function was first proposed in 1968 by Jack Cowan at the University of Chicago , still used today in models of neural dynamics
Moreover, Cowan pioneered using Statistical Mechanics to study the Neocortex.
And we will need a little Stat Mech to explain what our Energy functions are..but just a little.
While it seems we are simply proposing an arbitrary activation function, we can, in fact, derive the appearance of sigmoid activations–at least when performing inference on a single layer (mean field) Restricted Boltzmann Machine (RBM).
Hugo Larochelle has derived the sigmoid activations nicely for an RBM.
Given the (total) RBM Energy function
The log Energy is an un-normalized probability, such that
Where the normalization factor, Z, is an object from statistical mechanics called the (total) partition function Z
and is an inverse Temperature. In modern machine learning, we implicitly set .
Following Larochelle, we can factor by explicitly writing in terms of sums over the binary hidden activations . This lets us write the conditional probabilities, for each individual neuron as
.
We note that, this formulation was not obvious, and early work on RBMs used methods from statistical field theory to get this result.
We use and in Contrastive Divergence (CD) or other solvers as part of the Gibbs Sampling step for (unsupervised) RBM inference.
CD has been a puzzling algorithm to understand. When first proposed, it was unclear what optimization problem is CD solving? Indeed, Hinton is to have said
Specifically, we run several epochs of:
We will see below that we can cast RBM inference as directly minimizing a Free Energy–something that will prove very useful to related RBMs to MLPs
The sigmoid, and tanh, are an old-fashioned activation(s); today we may prefer to use ReLUs (and Leaky ReLUs).
The sigmoid itself was, at first, just an approximation to the heavyside step function used in neuron models. But the presence of sigmoid activations in the total Energy suggests, at least to me, that Deep Learning Energy functions are more than just random (Morse) functions.
RBMs are a special case of unsupervised nets that still use stochastic sampling. In supervised nets, like MLPs and CNNs (and in unsupervised Autoencoders like VAEs), we use Backprop. But the activations are not conditional probabilities. Let’s look in detail:
Consider a MultiLayer Perceptron, with 1 Hidden layer, and 1 output node
where for each data point, leading to the layer output
and total MLP output
where .
If we add a second layer, we have the iterated layer output:
where .
The final MLP output function has a similar form:
So with a little bit of stat mech, we can derive the sigmoid activation function from a general energy function. And we have activations it in RBMs as well as MLPs.
So when we apply Backprop, what problem are we actually solving ?
Are we simply finding a minima on random high dimensional manifold ? Or can we say something more, given the special structure of these layers of activated energies ?
To train an MLP, we run several epochs of Backprop. Backprop has 2 passes: forward and backward:
Each epoch usually runs small batches of inputs at time. (And we may need to normalize the inputs and control the variances. These details may be important for out analysis, and we will consider them in a later post).
After each pass, we update the weights, using something like an SGD step (or Adam, RMSProp, etc)
For an MSE loss, we evaluate the partial derivatives over the Energy parameters .
Backprop works by the chain rule, and given the special form of the activations, lets us transform the Energy derivatives into a sum of Energy gradients–layer by layer
I won’t go into the details here; there are 1000 blogs on BackProp today (which is amazing!). I will say…
Backprop couples the activation states of the neurons to the Energy parameter gradients through the cycle of forward-backward phases.
In a crude sense, Backprop resembles our more familiar RBM training procedure, where we equilibrate to set the activations, and run gradient descent to set the weights. Here, I show a direct connection, and derive the MLP functional form directly from an RBM.
RBMs are unsupervised; MLPs are supervised. How can we connect them? Crudely, we can think of an MLP as a single layer RBM with a softmax tacked on the end. More rigorously, we can look at Generalized Discriminative RBMs, which solve the conditional probability directly, in terms of the Free Energies, cast in the soft-max form
So the question is, can we extract Free Energy for an MLP ?
I now consider the Backward phase, using the deterministic EMF RBM, as a starting point for understanding MLPs.
An earlier post discusses the EMF RBM, from the context of chemical physics. For a traditional machine learning perspective, see this thesis.
In some sense, this is kind-of obvious. And yet, I have not seen a clear presentation of the ideas in this way. I do rely upon new research, like the EMF RBM, although I also draw upon fundamental ideas from complex systems theory–something popular in my PhD studies, but which is perhaps ancient history now.
The goal is to relate RBMs, MLPs, and basic Stat Mech under single conceptual umbrella.
In the EMF approach, we see RBM inference as a sequence of deterministic annealing steps, from 1 quasi-equilibrium state to another, consisting of 2 steps for each epoch:
At the end of each epoch, we update the weights, with weight (temperature) constraints (i.e. reset the L1 or L2 norm). BTW, it may not obvious that weight regularization is like a Temperature control; I will address this in a later post.
(1) The so-called Forward step solves a fixed point equation (which is similar in spirit to taking n steps of Gibbs sampling). This leads to a pair of coupled, recursion relations for the TAP magnetizations (or just nodes). Suppose we take t+1 iterations. Let us ignore the second order Onsager correction, and consider the mean field updates:
Because these are deterministic steps, we can express the in terms of :
At the end of the recursion, we will have a forward pass that resembles a multi-layer MLP, but that shares weights and biases between layers:
We can now associate an n-layer MLP, with tied weights,
,
to an approximate (mean field) EMF RBM, with n fixed point iterations (ignoring the Onsager correction for now). Of course, an MLP is supervised, and an RBM is unsupervised, so we need to associate the RBM hidden nodes with the MLP output function at the last layer (), prior to adding the MLP output node
This leads naturally to the following conjecture:
The EMF RBM and the BackProp Forward and Backward steps effectively do the same thing–minimize the Free Energy
This is a work in progress
Formally, it is simple and compelling. Is it the whole story…probably not. It is merely an observation–food for thought.
So far, I have only removed the visible magnetizations to obtain the MLP layer function as a function of the original visible units. The unsupervised EMF RBM Free Energy, however, contains expressions in terms of both the hidden and visible magnetizations ( ). To get a final expression, it is necessary to either
The result itself should not be so surprising, since it has already been pointed out by Kingma and Welling, Auto-Encoding Variational Bayes, that a Bernoulli MLP is like a variational decoder. And, of course, VAEs can be formulated with BackProp.
Nore importantly, It is unclear how good the RBM EMF really is. Some followup studies indicate that second order is not as good as, say, AIS, for estimating the partition function. I have coded a python emf_rbm.py module using the scikit-learn interface, and testing is underway. I will blog this soon.
Note that the EMF RBM relies on the Legendre Transform, which is like a convex relaxation. Early results indicates that this does degrade the RBM solution compared to traditional Cd. Maybe BackProp may be effective relaxing the convexity constraint by, say, relaxing the condition that the weights are tied between layers.
Still, I hope this can provide some insight. And there are …
Free Energy is a first class concept in Statistical Mechanics. In machine learning, not always so much. It appears in much of Hinton’s work, and, as a starting point to deriving methods like Variational Auto Encoders and Probabilistic Programing.
But Free Energy minimization plays an important role in non-convex optimization as well. Free energies are a Boltzmann average of the zero-Temperature Energy landscape, and, therefore, convert a non-convex surface into something at least less non-convex.
Indeed, in one of the very first papers on mean field Boltzmann Machines (1987), it is noted that
“An important property of the effective [free] energy function E'(V,0,T) is that it has a smoother landscape than E(S) due to the extra terms. Hence, the probability of getting stuck in a local minima decreases.”
Moreover, in protein folding, we have even stronger effects, which can lead to a ruggedly convex, energy landscape. This arises when the system runs out of configurational entropy (S), and energetic effects (E) dominate.
Most importantly, we want to understand, when does Deep Learning generalize well, and when does it overtrain ?
LeCun has very recently pointed out that Deep Nets fail when they run out of configuration entropy–an argument I also have made from theoretical analysis using the Random Energy Model. So it is becoming more important to understand what the actual energy landscape of a deep net is, how to separate out the entropic and energetic terms, and how to characterize the configurational entropy.
Hopefully the small insight will be useful and lead to a further understanding of Why Deep Learning Works.
A Mean Field Theory Learning Algorithm for Neural Networks
just a couple years after Hinton’s seminal 1985 paper , “A Learning Algorithm for Boltzmann Machines“.
What I really like is how we see the foundations of deep learning arose from statistical physics and theoretical chemistry. My top 10 favorite take-a-ways are:
Happy New Year everyone!
They are basically a solved problem, and while of academic interest, not really used in complex modeling problems. They were, upto 10 years ago, used for pretraining deep supervised nets. Today, we can train very deep, supervised nets directly.
RBMs are the foundation of unsupervised deep learning–
an unsolved problem.
RBMs appear to outperform Variational Auto Encoders (VAEs) on simple data sets like the Omniglot set–a data set developed for one shot learning, and used in deep learning research.
RBM research continues in areas like semi-supervised learning with deep hybrid architectures, Temperature dependence, infinitely deep RBMs, etc.
Many of basic concepts of Deep Learning are found in RBMs.
Sometimes clients ask, “how is Physical Chemistry related to Deep Learning ?”
In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry,
the Extended Mean Field Restricted Boltzmann Machine: EMF_RBM
(see: Training Restricted Boltzmann Machines via the Thouless-Anderson-Palmer Free Energy )
[Along the way, we will encounter several Nobel Laureates, including the physicists David J Thouless (2016) and Philip W. Anderson (1977), and the physical chemist Lars Onsager (1968).]
RBMs are pretty simple, and easily implemented from scratch. The original EMF_RBM is in Julia; I have ported EMF_RBM to python, in the style of the scikit-learn BernoulliRBM package.
https://github.com/charlesmartin14/emf-rbm/blob/master/EMF_RBM_Test.ipynb
We examined RBMs in the last post on Cheap Learning: Partition Functions and RBMs. I will build upon that here, within the context of statistical mechanics.
RBMs are defined by the Energy function
To train an RBM, we minimize the log likelihood ,
the sum of the clamped and (actual) Free Energies, where
and
The sums range over a space of ,
which is intractable in most cases.
Training an RBM requires computing the log Free Energy; this is hard.
When training an RBM, we
We don’t include label information, although a trained RBM can provide features for a down-stream classifier.
The Extended Mean Field (EMF) RBM is a straightforward application of known statistical mechanics theories.
There are, literally, thousands of papers on spin glasses.
The EMF RBM is a great example of how to operationalize spin glass theory.
The Restricted Boltzmann Machine has a very simple Energy function, which makes it very easy to factorize the partition function Z , explained by Hugo Larochelle, to obtain the conditional probabilities
The conditional probabilities let us apply Gibbs Sampling, which is simply
In statistical mechanics, this is called a mean field theory. This means that the Free Energy (in ) can be written as a simple linear average over the hidden units
.
where is the mean field of the hidden units.
At high Temp., for a spin glass, a mean field model seems very sensible because the spins (i.e. activations) become uncorrelated.
Theoreticians use mean field models like the p-spin spherical spin glass to study deep learning because of their simplicity. Computationally, we frequently need more.
How we can go beyond mean field theory ?
Onsager was awarded 1968 the Nobel Prize in Chemistry for the development of the Onsager Reciprocal Relations, sometimes called the ‘4th law of Thermodynamics’
The Onsager relations provides the theory to treat thermodynamic systems that are in a quasi-stationary, local equilibrium.
Onsager was the first to show how to relate the correlations in the fluctuations to the linear response. And by tying a sequence of quasi-stationary systems together, we can describe an irreversible process…
..like learning. And this is exactly what we need to train an RBM.
In an RBM, the fluctuations are variations in the hidden and visible nodes.
In a BernoulliRBM, the activations can be 0 or 1, so the fluctuation vectors are
The simplest correction to the mean field Free Energy, at each step in training, are the correlations in these fluctuations:
where W is the Energy weight matrix.
Unlike normal RBMs, here is we work in an Interaction Ensemble, so the hidden and visible units become hidden and visible magnetizations:
To simplify (or confuse?) the presentations here, I don’t write magnetizations (until the Appendix).
The corrections make sense under the stationarity constraints, that the Extended Mean Field RBM Free Energy () is at a critical point
,
That is, small changes in the activations do not change the Free Energy.
We will show that we can write
as a Taylor series in , the inverse Temperature, where is the Entropy
,
is the standard, mean field RBM Free energy
,
and is the Onsager correction
.
Given the expressions for the Free Energy, we must now evaluate it.
The Taylor series above is a result of the TAP theory — the Thouless-Anderson-Palmer approach developed for spin glasses.
The TAP theory is outlined in the Appendix; here it is noted that
Thouless just shared the 2016 Nobel Prize in Physics (for his work in topological phase transitions)
Being a series in inverse Temperature , the theory applies at low , or high Temperature. For fixed , this also corresponds to small weights W.
Specifically, the expansion applies at Temperatures above the glass transition–a concept which I describe in a recent video blog.
Here, to implement the EMF_RBM, we set
,
and, instead, apply weight decay to keep the weights W from exploding
where may be an L1 or L2 norm.
Weight Decay acts to keep the Temperature high.
Early RBM computational models were formulated using statistical mechanics (see the Appendix) language, and so included a Temperature parameter, and were solved using techniques like simulated annealing and the (mean field) TAP equations (described below).
Adding Temperature allowed the system to ‘jump’ out of the spurious local minima. So any usable model required a non-zero Temp, and/or some scheme to avoid local minima that generalized poorly. (See: Learning Deep Architectures for AI, by Bengio)
These older approaches did not work well –then — so Hinton proposed the Contrastive Divergence (CD) algorithm. Note that researchers struggled for some years to ‘explain’ what optimization problem CD actually solves.
More that recent work on Temperature Bases RBMs also suggests that higher T solutions perform better, and that
“temperature is an essential parameter controlling the selectivity of the firing neurons in the hidden layer.”
Standard RBM training approximates the (unconstrained) Free Energy, F=ln Z, in the mean field approximation, using (one or more steps of) Gibbs Sampling. This is usually implemented as Contrastive Divergence (CD), or Persistent Contrastive Divergence (PCD).
Using techniques of statistical mechanics, however, it is possible to train an RBM directly, without sampling, by solving a set of deterministic fixed point equations.
Indeed, this approach clarifies how to view an RBM as solving a (determinisitic) fixed point equation of the form
Consider each step, at at fixed (), as a Quasi-Stationary system, which is close to equilibrium, but we don’t need to evaluate ln Z(v,h) exactly.
We can use the stationary conditions to derive a pair of coupled, non-linear equations
They extend the standard formula of sigmoid linear activations with additional, non-linear, inter-layer interactions.
They differs significantly from (simple) Deep Learning activation functions because the activation for each layer explicitly includes information from other layers.
This extension couples the mean () and total fluctuations () between layers. Higher order correlations could also be included, even to infinite order, using techniques from field theory.
We can not satisfy both equations simultaneously, but we can satisfy each condition individually, letting us write a set of recursion relations
These fixed point equations converge to the stationary solution, leading to a local equilibrium. Like Gibbs Sampling, however, we only need a few iterations (say t=3 to 5). Unlike Sampling, however, the EMF RBM is deterministic.
https://github.com/charlesmartin14/emf-rbm/
If there is enough interest, I can do a pull request on sklearn to include it.
The next blog post will demonstrate how the python code in action.
Most older physicists will remember the Hopfield model. They peaked in 1986, although interesting work continued into the late 90s (when I was a post doc).
Originally, Boltzmann machines were introduced as a way to avoid spurious local minima while including ‘hidden’ features into Hopfield Associative Memories (HAM).
Hopfield himself was a theoretical chemist, and his simple model HAMs were of great interest to theoretical chemists and physicists.
Hinton explains Hopfield nets in his on-line lectures on Deep Learning.
The Hopfield Model is a kind of spin glass, which acts like a ‘memory’ that can recognize ‘stored patterns’. It was originally developed as a quasi-stationary solution of more complex, dynamical models of neuronal firing patterns (see the Cowan-Wilson model).
Early theoretical work on HAMs studied analytic approximations to ln Z to compute their capacity (), and their phase diagram. The capacity is simply the number of patterns a network of size N can memorize without getting confused.
The Hopfield model was traditionally run at T=0.
Looking at the T=0 line, at extremely low capacity, the system has stable mixed states that correspond to ‘frozen’ memories. But this is very low capacity, and generally unusable. Also, when the capacity too large, , (which is really not that large), the system abruptly breaks down completely.
There is a small window of capacity, ,with stable pattern equilibria, dominated by frozen out, spin glass states. The problem is, for any realistic system, with correlated data, the system is dominated by spurious local minima which look like low energy spin glass states.
So the Hopfield model suggested that
glass states can be useful minima, but
we want to avoid low energy (spurious) glassy states.
One can try to derive direct mapping between Hopfield Nets and RBMs (under reasonable assumptions). Then the RBM capacity is proportional to the number of Hidden nodes. After that, the analogies stop.
The intuition about RBMs is different since (effectively) they operate at a non-zero Temperature. Additionally, it is unclear to this blogger if the proper description of deep learning is a mean field spin glass, with many useful local minima, or a strongly correlated system, which may behave very differently, and more like a funneled energy landscape.
Thouless-Anderson-Palmer Theory of Spin Glasses
The TAP theory is one of the classic analytic tools used to study spin glasses and even the Hopfield Model.
We will derive the EMF RBM method following the Thouless-Anderson-Palmer (TAP) approach to spin glasses.
On a side note he TAP method introduces us to 2 more Nobel Laureates:
The TAP theory, published in 1977, presented a formal approach to study the thermodynamics of mean field spin glasses. In particular, the TAP theory provides an expression for the average spin
where is like an activation function, C is a constant, and the MeanField and Onsager terms are like our terms above.
In 1977, they argued that the TAP approach would hold for all Temperatures (and external fields), although it was only proven until 25 years later by Talagrand. It is these relatively new, rigorous approaches that are cited by Deep Learning researchers like LeCun , Chaudhari, etc. But many of the cited results have been suggested using the TAP approach. In particular, the structure of the Energy Landscape, has been understood looking at the stationary points of the TAP free energy.
More importantly, the TAP approach can be operationalized, as a new RBM solver.
We start with the RBM Free Energy .
Introduce an ‘external field’ which couples to the spins, adding a linear term to the Free Energy
Physically, would be an external magnetic field which drives the system out-of-equilibrium.
As is standard in statistical mechanics, we take the Legendre Transform, in terms of a set of conjugate variables . These are the magnetizations of each spin under the applied field, and describe how the spins behave outside of equilibrium
The transform which effectively defines a new interaction ensemble . We now set , and note
Define an interaction Free Energy to describe the interaction ensemble
which equals the original Free Energy when .
Note that because we have visible and hidden spins (or nodes), we will identify magnetizations for each
Now, recall we want to avoid the glassy phase; this means we keep the Temperature high. Or low.
We form a low Taylor series expansion in the new ensemble
which, at low order in , we expect to be reasonably accurate even away from equilibrium, at least at high Temp.
This leads to an order-by-order expansion for the Free Energy . The first order () correction is the mean field term. The second order term () is the Onsager correction.
Upto , we have
or
Rather than assume equilibrium, we assume that, at each step during inference –at fixed (–the system satisfies a quasi-stationary condition. Each step reaches a a local saddle point in phase space, s.t.
Applying the stationary conditions lets us write coupled equations for the individual magnetizations that effectively define the (second order), high Temp, quasi-equilibrium states
Notice that the resemble the RBM conditional probabilities; in fact, at first order in , they are the same.
gives a mean field theory, and
And in the late 90’s, mean field TAP theory was attempted, unsuccessfully, to create an RBM solver.
At second order, the magnetizations are coupled through the Onsager corrections. To solve them, we can write down the fixed point equations, shown above.
We can include higher order correction the Free Energy by including more terms in the Taylor Series. This is called a Plefka expansion. The terms can be represented using diagrams
Plefka derived these terms in 1982, although it appears he only published up to the Onsager correction; a recent paper shows how to obtain all high order terms.
The Diagrammatic expansion appears not to have been fully worked out, and is only sketched above.
I can think of at least 3 ways to include these higher terms:
This is similar, in some sense, to the infinite RBM by Larochelle, which uses an resummation trick to include an infinite number of Hidden nodes.
, where
,
Obviously there are lots of interesting things to try.
The current python EMF_RBM only treats binary data, just like the scikit learn BernoulliRBM. So for say MNIST, we have to use the binarized MNIST.
There is some advantage to using Binarized Neural Networks on GPUs.
Still, a non-binary RBM may be useful. Tremel et. al. have suggested how to use real-valued data in the EMF_RBM, although in the context of Compressed Sensing Using Generalized Boltzmann Machines.
This is the question posed by a recent article. Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines (RBMs).
Here, I will discuss some aspects of this paper, in the context of RBMs.
We can use RBMs for unsupervised learning, as a clustering algorithm, for pretraining larger nets, and for generating sample data. Mostly, however, RBMs are an older, toy model useful for understanding unsupervised Deep Learning.
We define an RBM with an Energy Function
and it’s associated Partition Function
The joint probability is then
and the probability of the visible units is computed by marginalizing over the hidden units
Note we also mean the probability of observing the data X={v}, given the weights W.
The Likelihood is just the log of the probability
We can break this into 2 parts:
is just the standard Free Energy
We call the clamped Free Energy
because it is like a Free Energy, but with the visible units clamped to the data X.
The clamped is easy to evaluate in the RBM formalism, whereas is computationally intractable.
Knowing the is ‘like’ knowing the equilibrium distribution function, and methods like RBMs appear to approximate in some form or another.
Training an RBM proceeds iteratively by approximating the Free Energies at each step,
and then updating W with a gradient step
RBMs are usually trained via Contrastive Divergence (CD or PCD). The Energy function, being quadratic, lets us readily factor Z using a mean field approximation, leading to simple expressions for the conditional probabilities
and the weight update rule
RBM codes may use the terminology of positive and negative phases:
: The expectation is evaluated, or clamped, on the data.
: The expectation is to be evaluated on the prior distribution . We also say is evaluated in the limit of infinite sampling, at the so-called equilibrium distribution. But we don’t take the infinite limit.
CD approximates –effectively evaluating the (mean field) Free Energy — by running only 1 (or more) steps of Gibbs Sampling.
So we may see
or, more generally, and in some code bases, something effectively like
Initialize the positive and negative
Run N iterations of:
Run 1 Step of Gibbs Sampling to get the negative :
sample the hiddens given the (current) visibles:
sample the visibles given the hiddens (above):
Calculate the weight gradient:
Apply Weight decay or other regularization (optional):
Apply a momentum (optional):
Update the Weights:
What is Cheap about learning ? A technical proof in the Appendix notes that
knowing the Partition function is not the same as knowing the underlying distribution .
This is because the Energy can be rescaled, or renormalized, in many different ways, without changing .
This is a also key idea in Statistical Mechanics.
The Partition function is a generating function; we can write all the macroscopic, observable thermodynamic quantities as partial derivatives of . And we can do this without knowing the exact distribution functions or energies–just their renormalized forms.
Of course, our W update rule is a derivative of
The proof is technically straightforward, albeit a bit odd at first.
Let’s start with the visible units . Write
We now introduce the hidden units, , into the model, so that we have a new, joint probability distribution
and a new, Renormalized , partition function
Where RG means Renormalization Group. We have already discussed that the general RBM approach resembles the Kadanoff Variational Renormalization Group (VRG) method, circa 1975. This new paper points out a small but important technical oversight made in the ML literature, namely that
having does not imply
That is, just because we can estimate the Partition function well does not mean we know the probability distributions.
Why? Define an arbitrary non-constant function and write
.
K is for Kadanoff RG Transform, and ln K is the normalization.
We can now write an joint Energy with the same Partition function as our RBM , but with completely different joint probability distributions. Let
Notice what we are actually doing. We use the K matrix to define the RBM joint Energy function. In RBM theory, we restrict to a quadratic form, and use variational procedure to learn the weights , thereby learning K.
In a VRG approach, we have the additional constraint that we restrict the form of K to satisfy constraints on it’s partition function, or, really, how the Energy function is normalized. Hence the name ‘Renormalization.‘ This is similar, in spirit, but not necessarily in form, to how the RBM training regularizes the weights (above).
Write the total, or renormalized, Z as
Expanding the Energy function explicitly, we have
where the Kadanoff normalization factor appears now the denominator.
We can can break the double sum into sums over v and h
Identify in the numerator
which factors out, giving a very simple expression in h
In the technical proof in the paper, the idea is that since h is just a dummy variable, we can replace h with v. We have to be careful here since this seems to only applies to the case where we have the same number of hidden and visible units–a rare case. In an earlier post on VRG, I explain more clearly how to construct an RG transform for RBMs. Still, the paper is presenting a counterargument for arguments sake, so, following the argument in the paper, let’s say
This is like saying we constrain the Free Energy at each layer to be the same.
This is also another kind of Layer Normalization–a very popular method for modern Deep Learning methods these days.
So, by construction, the renormalized and data Partition functions are identical
The goal of Renormalization Group theory is to redefine the Energy function on a difference scale, while retaining the macroscopic observables.
But , and apparently this has been misstated in some ML papers and books, the marginalized probabilities can be different !
To get the marginals, let’s integrate out only the h variables
Looking above, we can write this in terms of K and its normalization
which implies
RBMs let us represent data using a smaller set of hidden features. This is, effectively, Variational Renormalization Group algorithm, in which we approximate the Partition function, at each step in the RBM learning procedure, without having to learn the underlying joining probability distribution. And this is easier. Cheaper.
In other words, Deep Learning is not Statistics. It is more like Statistical Mechanics.
And the hope is that we can learn from this old scientific field — which is foundational to chemistry and physics — to improve our deep learning models.
Shortly after this paper came out, Comment on “Why does deep and cheap learning work so well?”that the proof in the Appended is indeed wrong–as I suspected and pointed out above.
It is noted that the point of the RG theory is to preserve the Free Energy form one layer to another, and, in VRG, this is expressed as a trace condition on the Transfer operator
where
It is, however, technically possible to preserve the Free Energy and not preserve the trace condition. Indeed, because is not-constant, thereby violating the trace condition.
From this bloggers perspective, the idea of preserving Free Energy, via either a trace condition, or, say, by layer normalization, is the import point. And this may mean to only approximately satisfy the trace condition.
In Quantum Chemistry, there is a similar requirement, referred to as a Size-Consistency and/ or Size-Extensivity condition. And these requirements proven essential to obtaining highly accurate, ab initio solutions of the molecular electronic Schrodinger equation–whether implemented exactly or approximately.
And, I suspect, a similar argument, at least in spirit if not in proof, is at play in Deep Learning.
Please chime in our my YouTube Community Channel
see also: https://m.reddit.com/r/MachineLearning/comments/4zbr2k/what_is_your_opinion_why_is_the_concept_of/)
Stay tuned for a video link, and a blog post to accompany this.
Comments, questions, and bug fixes are welcome.
The motivation: to understand how to build very deep networks and why they do (or don’t) work.
There are several papers that caught my eye, starting with
Unifying Distillation and Privileged Information (2015)
These papers set the foundation for looking at much larger, deeper networks such as
FractalNet’s are particularly interesting since they suggest that very deep networks do not need student-teacher learning, and, instead, can be self similar. (which is related to very recent work on the Statistical Physics of Deep Learning, and the Renormalization Group analogy).
IMHO, it is not enough just to implement the code; the results have to be excellent as well. I am not impressed with the results I have seen so far, and I would like to flush out what is really going on.
The 2010 paper still appears to be 1 of the top 10 results on MNIST:
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
The idea is simple. They claim to get state-of-the-art accuracy on MNIST using a 5-layer MLP, but running a large number of epochs with just SGD, a decaying learning rate, and an augmented data set.
The key idea is that the augmented data set can provide, in practice, an infinite amount of training data. And having infinite data means that we never have to worry about overtraining because we have too many adjustable parameters, and therefore any reasonable size network will do the trick if we just run it long enough.
In other words, there is no convolution gap, no need for early stopping, or really no regularization at all.
This sounds dubious to me, but I wanted to see for myself. Also, perhaps I am missing some subtle detail. Did they clip gradients somewhere ? Is the activation function central ? Do we need to tune the learning rate decay ?
I have initial notebooks on github, and would welcome feedback and contributions, plus ideas for other papers to reproduce.
I am trying to repeat this experiment using Tensorflow and 2 kinds of augmented data sets:
(and let me say a special personal thanks to Søren Hauberg for providing this recent data set)
I would like to try other methods, such as the Keras Data Augmentation library (see below), or even the recent data generation library coming out of OpenAI.
Current results are up for
The initial results indicate that AlignMNIST is much better that InfiMNIST for this simple MLP, although I still do not see the extremely high, top-10 accuracy reported.
Furthermore, the 5-Layer InfiMNIST actually diverges after ~100 epochs. So we still need early stopping, even with an infinite amount of data.
It may be interesting try using the Keras ImageDataGenerator class, described in this related blog on “building powerful image classification models using very little data”
Also note that the OpenAI group as released a new paper and code for creating data used in generative adversarial networks (GANs).
I will periodically update this blog as new data comes in, and I have the time to implement these newer techniques.
Next, we will check in the log files and discuss the tensorboard results.
Comments, criticisms, and contributions are very welcome.