Hinton introduced Free Energies in his 1994 paper,

This paper, along with his wake-sleep algorithm, set the foundations for modern variational learning. They appear in his RBMs, and more recently, in Variational AutoEncoders (VAEs) .

Of course, Free Energies come from Chemical Physics. And this is not surprising, since Hinton’s graduate advisor was a famous theoretical chemist.

They are so important that Karl Friston has proposed the The Free Energy Principle : A Unified Brain Theory ?

(see also the wikipedia and this 2013 review)

**What are free Energies and why do we use them in Deep Learning ?**

#### The Free Energy is a Temperature Weighted Average Energy

In (Unsupervised) Deep Learning, Energies are quadratic forms over the weights. In an RBM, one has

This is the T=0 configurational Energy, where each configuration is some pair. In chemical physics, these Energies resemble an Ising model.

The Free Energy is a weighted average of the all the global and local minima

##### Zero Temperature Limit

Note: as , the the Free Energy becomes the T=0 global energy minima . In limit of zero Temperature, all the terms in the sum approach zero

and only the largest term, the largest negative Energy, survives.

##### Other Notation

We may also see F written in terms of the partition function Z:

where the brakets denote an equilibrium average, and expected value over some equilibrium probability distribution . (we don’t normalize with 1/N here; in principle, the sum could be infinite.)

Of course, in deep learning, we may be trying to determine the distribution , and/or we may approximate it with some simpler distribution during inference. (From now on, I just write P and Q for convenience)

But there is more to Free Energy learning than just approximating a distribution.

#### The Free Energy is an average solution to a non-convex optimization problem

In a chemical system, the Free Energy averages over all global and local minima below the Temperature T–with barriers below T as well. It is the Energy available to do work.

##### Being Scale Free: T=1

For convenience, Hinton explicitly set T=1. Of course, he was doing inference, and did not know the scale of the weights **W**. Since we don’t specify the Energy scale, we learn the scale implicitly when we learn **W. *** We call this being scale-free*

So in the T=1, scale free case, the Free Energy implicitly averages over all Energy minima where , as we learn the weights **W. **Free Energies solve the problem of Neural Nets being non-convex by averaging over the global minima and nearby local minima.

##### Highly degenerate non-convex problems

Because Free Energies provide an average solution, they can even provide solutions to highly degenerate non-convex optimization problems:

##### When do Free Energy solutions fail ?

They will fail, however, when the barriers between Energy basins are larger than the Temperature.

This can happen if the effective Temperature drops close to zero during inference. Since T=1 implicitly in inference, this means when the weights **W** are exploding.

**See: Normalization in Deep Learning**

Systems may also get trapped if the Energy barriers grow very large –as, say, in the glassy phase of a mean field spin glass. Or a supercooled liquid–the co-called Adam Gibbs phenomena. I will discuss this in a future post.

In either case, if the system, or solver, gets trapped in a single Energy basin, it may appear to be convex, and/or flat (the Hessian has lots of zeros). But this is probably not the optimal solution to learning when using a Free Energy method.

#### Free Energies produce Ruggedly Convex Landscapes

It is sometimes argued that Deep Learning is a non-convex optimization problem. And, yet, it has been known for over 20 years that networks like CNNs don’t suffer from the problems of local minima? How can this be ?

At least for unsupervised methods, it has been clear since 1987 that:

*An important property of the effective [Free] Energy function E(V,0,T) is that it has a smoother landscape than E(S) [T=0] …*

* Hence, the probability of getting stuck in a local minima decreases *

Although this is not specifically how Hinton argued for the Helmholtz Free Energy — a decade later.

#### The Hinton Argument for Free Energies

Why do we use Free energy methods ? Hinton used **the bits-back argument:**

Imagine we are encoding some training data and sending it to someone for decoding. That is, we are building an Auto-Encoder.

If have only 1 possible encoding, we can use any vanilla encoding method and the receiver knows what to do.

*But what if have 2 or more equally valid codes ? *

*Can we save 1 bit by being a little vague ?*

##### Stochastic Complexity

Suppose we have N possible encodings , each with Energy . We say the data has *stochastic complexity.*

Pick a coding with probability and send it to the receiver. The expected cost of encoding is

Now the receiver must *guess* which encoding we used. The decoding cost of the receiver is

where H is the Shannon Entropy of the random encoding

The decoding cost looks just like a Helmholtz Free Energy.

Moreover, we can use a sub-optimal encoding, and they suggest using a Factorized (i.e. mean field) Feed Forward Net to do this.

To understand this better, we need to relate

#### Thermodynamics and Inference

In 1957, Jaynes formulated the MaxEnt principle which considers equilibrium thermodynamics and statistical mechanics as inference processes.

In 1995, Hinton formulated the Helmholtz Machine and showed us how to define a quasi-Free Energy.

In Thermodynamics, the Helmholtz Free Energy F(T,V,N) is an Energy that depends on Temperature instead of Entropy. We need

and F is defined as

In ML, we set T=1. Really, the Temperature equals how much the Energy changes with a change in Entropy (at fixed V and N)

Variables like E and S depend on the system size N. That is,

as

*We say S and T are conjugate pairs; S is extensive, T is intensive.*

(see more on this in the Appendix)

##### Legendre Transform

The conjugate pairs are used to define Free Energies via the Legendre Transform:

Helmholtz Free Energy: F(T) = E(S) – TS

We switch the Energy from depending on S to T, where .

Why ? In a physical system, we may know the Energy function E, but we can’t directly measure or vary the Entropy S. However, we are free to change and measure the Temperature–*the derivative of E w/r.t. S:*

This is a powerful and general mathematical concept.

Say we have a *convex* function f(x,y,z), but we can’t actually vary x. But we do know the slope, w, everywhere along x

.

Then we can form the Legendre Transform , which gives g(w,y,z) as

the ‘T*angent Envelope*‘ of f() along x

,

.

or, simply

Note: we have converted a convex function into a concave one. The Legendre transform is concave in the intensive variables and convex in the extensive variables.

Of course, the true Free Energy F is convex; this is central to Thermodynamics (see Appendix). But that is because while it is concave in T, we evaluate it at constant T.

But what if the Energy function is not convex in the Entropy ? Or, suppose we extract an pseudo-Entropy from sampling some data, and we want to define a free energy potential (i.e. as in protein folding). These postulates also fail in systems like blog post on spin chains.

**How can we always form a convex Free energy ?***Answer: *Take the convex hull

##### Legendre Fenchel Transform

When a convex Free Energy can not be readily be defined as above, we can use the the generalized the Legendre Fenchel Transform, which provides a convex relaxation via

*the Tangent Envelope , a convex relaxation*

The Legendre-Fenchel Transform can provide a Free Energy, convexified *along the direction internal (configurational) Entropy,* allowing the Temperature to control how many local Energy minima are sampled.

#### Practical Applications

- This how the (TAP) Free Energy is formed for the EMF_RBM. I have written an open source version of the code in pythonopen source version of the code in python.
- Google Deep Mind released a paper in 2013 on DARN: Deep Auto Regressive Networks.
- The modeling package Edward provides Variational Inference on top of TensorFlow
- Variational AutoEncoders are available in Keras
- And it is how Jordan does Variational Inference–the topic of part II of this blog.

**Thanks again for reading and feedback is welcome.**

#### Appendix

Extra stuff I just wanted to write down…

##### Convexity in Thermodynamics and Statistical Physics

- S and V (or U and V) are the coordinates for the manifold of Equilibrium states
- The Energy U is a convex function of S, V, and S is concave
- The Temperature

##### Extensivity and Weight Constraints

If we assume T=1 at all times, and we assume our Deep Learning Energies are extensive–as they would be in an actual thermodynamic system–then the weight norm constraints act to enforce the size-extensivity.

as ,

if ,

and ,

then **W** should remain bounded to prevent the Energy E(n) from growing faster than Mn. And, of course, most Deep Learning algorithms do bound **W** in some form.

##### Back to Stat Mech

where C denotes a contour integral.