on Cheap Learning: Partition Functions and RBMs

Why does deep and cheap learning work so well?

This is the question posed by a recent article.  Deep Learning seems to require knowing the Partition Function–at least in old fashioned Restricted Boltzmann Machines (RBMs).

Here, I will discuss some aspects of this paper,  in the context of RBMs.

Preliminaries

We can use RBMs for unsupervised learning, as a clustering algorithm, for pretraining larger nets, and for generating sample data.   Mostly, however, RBMs are an older, toy model useful for understanding unsupervised Deep Learning.

RBMs

We define an RBM with an Energy Function

E(\mathbf{v},\mathbf{h})=-\mathbf{v}^{T}\mathbf{W}\mathbf{h}-\mathbf{v}^{T}\mathbf{a}-\mathbf{b}^{T}\mathbf{h}

and it’s associated Partition Function

Z(\mathbf{v},\mathbf{h})=\sum\limits_{\mathbf{v},\mathbf{h}} e^{-E(\mathbf{v},\mathbf{h}})

The joint probability is then

P(\mathbf{v},\mathbf{h})=\dfrac{1}{Z(\mathbf{v},\mathbf{h})}e^{-E(\mathbf{v},\mathbf{h})}

and the probability of  the visible units is computed by marginalizing over the hidden units

p(\mathbf{v})=\dfrac{1}{Z(\mathbf{v},\mathbf{h})}\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v},\mathbf{h})}

Note we also mean the probability of observing the data X={v}, given the weights W.

p(\mathbf{v})=p(\mathbf{v}|\mathbf{W,a,b})

The Likelihood is just the log of the probability

\mathcal{L}=\ln\;p(\mathbf{v})=\ln\;\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v,h})}-\ln\;Z(\mathbf{v,h})

We can break this into 2 parts:

\mathcal{L}=-F^{c}(\mathbf{v})+F(\mathbf{v,h})

 F(\mathbf{v,h}) is just the standard Free Energy

F=\ln\;Z(\mathbf{v,h})

We call F^{c}(\mathbf{v}) the clamped Free Energy

F^{c}(\mathbf{v})=\ln\;\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v,h})}

because it is like a Free Energy, but with the visible units clamped to the data X.

The clamped F^{c}(\mathbf{v}) is easy to evaluate in the RBM formalism, whereas F(\mathbf{v,h}) is computationally intractable.

Knowing the Z(\mathbf{v,h}) is ‘like’ knowing the equilibrium distribution function, and methods like RBMs appear to approximate \ln\;Z(\mathbf{v,h}) in some form or another.

RBM Training

Training an RBM proceeds iteratively by approximating the Free Energies at each step,

F(\mathbf{v,h})=\cdots

and then updating W with a gradient step

\mathbf{W}\rightarrow\mathbf{W}+\lambda\mathbf{dw}

RBMs are usually trained via Contrastive Divergence (CD or PCD).  The Energy function, being quadratic, lets us readily factor Z using a mean field approximation, leading to simple expressions for the conditional probabilities

p(h_{i}=1|v)=\sigma(b_{i}+\mathbf{W}_{i}\mathbf{v})

p(v_{i}=1|h)=\sigma(a_{i}+\mathbf{W}_{i}\mathbf{h})

and the weight update rule

d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle_{0}-\langle\mathbf{v}^{T}\mathbf{h}\rangle_{\infty}

positive and negative phases

RBM codes may use the terminology of positive and negative phases:

positive\;\;\langle\mathbf{v,h}\rangle^{+}\rightarrow negative\;\;\langle\mathbf{v,h}\rangle^{-}

\langle\mathbf{v,h}\rangle^{+} :  The expectation \langle\mathbf{v,h}\rangle_{0} is evaluated, or clamped, on the data.

\langle\mathbf{v,h}\rangle^{-} :   The expectation \langle\mathbf{v,h}\rangle_{\infty} is to be evaluated on the prior distribution p(\mathbf{X}|\mathbf{W}) .  We also say \langle\dot\rangle_{\infty} is evaluated in the limit of infinite sampling, at the so-called equilibrium distribution.  But we don’t take the infinite limit.

CD approximates \langle\mathbf{v}^{T}\mathbf{h}\rangle_{\infty} –effectively evaluating the (mean field) Free Energy   by running only 1 (or more) steps of Gibbs Sampling.

So we may see

d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle_{0}-\langle\mathbf{v}^{T}\mathbf{h}\rangle_{1}

or, more generally, and in some code bases, something effectively like

d\mathbf{w}=\langle \mathbf{v}^{T}\mathbf{h} \rangle^{+}-\langle\mathbf{v}^{T}\mathbf{h}\rangle^{-}

Pseudocode is:


Initialize the positive  (\mathbf{v,h})^{+} and negative (\mathbf{v,h})^{-}

Run N iterations of:

Run 1 Step of Gibbs Sampling to get the negative  (\mathbf{v,h})^{-} :

sample the hiddens given the (current) visibles: p(h^{-}_{i}=1|\mathbf{v})

sample the visibles given the hiddens (above): p(v^{-}_{i}=1|\mathbf{h})

Calculate the weight gradient:

dw_{i,j}=\langle \mathbf{v}^{T}\mathbf{h} \rangle^{+}-\langle\mathbf{v}^{T}\mathbf{h}\rangle^{-}

Apply Weight decay or other regularization (optional):

\mathbf{dw}\rightarrow\mathbf{dw}+\delta\Vert\mathbf{W}\Vert

Apply a momentum (optional):

\mathbf{dw}\rightarrow\mathbf{dw}+\mu\mathbf{W}_{prev}

Update the Weights:

\mathbf{W}\rightarrow\mathbf{W}+\lambda\mathbf{dw}


Energy Renormalizations

What is Cheap about learning ?  A technical proof in the Appendix notes that

knowing the Partition function Z(\mathbf{v,h}) is not the same as knowing the underlying distribution p(\mathbf{v}) .

This is because the Energy E(\mathbf{v,h})  can be rescaled, or renormalized, in many different ways, without changing Z(\mathbf{v,h}) .

This is a also key idea in Statistical Mechanics.

The Partition function is a generating function; we can write all the macroscopic, observable thermodynamic quantities as partial derivatives of \ln\;Z .  And we can do this without knowing the exact distribution functions or energies–just their renormalized forms.

Of course, our W update rule is a derivative of \ln\;Z(\mathbf{v,h})

The proof is technically straightforward, albeit a bit odd at first.

Matching Z(y) does not imply matching p(y)

Let’s start with the visible units \mathbf{v}= .  Write

p(\mathbf{v})=\dfrac{1}{Z}e^{-E(\mathbf{v})}

Z(\mathbf{v})=\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}

We now introduce the hidden units, \mathbf{h} , into the model, so that we have a new, joint probability distribution

p(\mathbf{v,h})=\dfrac{1}{Z}e^{-E(\mathbf{v,h})}

and a new, Renormalized , partition function

Z^{RG}(\mathbf{v,h})=\sum\limits_{\mathbf{v,h}}e^{-E(\mathbf{v,h})}

Where RG means Renormalization Group.  We have already discussed that the general RBM approach resembles the Kadanoff Variational Renormalization Group (VRG) method, circa 1975. This new paper points out a small but important technical oversight made in the ML literature, namely that

having Z^{RG}(\mathbf{v,h})=Z(\mathbf{v}) does not imply \tilde{p}(\mathbf{v})=\sum\limits_{\mathbf{h}}\tilde{p}(\mathbf{v})=p(\mathbf{v})

That is, just because we can estimate the Partition function well does not mean we know the probability distributions.

Why?   Define an arbitrary non-constant function K(\mathbf{v}) and write

\tilde{Z}(\mathbf{v})=\sum\limits_{\mathbf{v}}e^{-[E(\mathbf{v})+K(\mathbf{V})]} .

K is for Kadanoff RG Transform, and ln K is the normalization.

We can now write an joint Energy E(\mathbf{v,h}) with the same Partition function as our RBM E(\mathbf{h}) , but with completely different joint probability distributions.  Let

E(\mathbf{v,h})=E(\mathbf{v})+E(\mathbf{h})+K(\mathbf{v})+ln\;K(\mathbf{v}) 

Notice what we are actually doing.  We use the K matrix to define the RBM joint Energy function.  In RBM theory, we restrict E(\mathbf{v,h}) to a quadratic form, and use variational procedure to learn the weights , thereby learning K.

In a VRG approach, we have the additional constraint that we restrict the form of K to satisfy constraints on it’s partition function, or, really, how the Energy function is normalized.  Hence the name ‘Renormalization.‘   This is similar, in spirit, but not necessarily in form, to how the RBM training regularizes the weights (above).

Write the total, or renormalized, Z as

Z^{RG}(\mathbf{v,h})=Z_{tot}(\mathbf{v,h})=\sum\limits_{\mathbf{v,h}}e^{-E(\mathbf{v,h})}

Expanding the Energy function explicitly, we have

=\dfrac{1}{\tilde{Z}(\mathbf{v})}\sum\limits_{\mathbf{v,h}}e^{-[E(\mathbf{v})+E(\mathbf{h})+K(\mathbf{v})]}

where the Kadanoff normalization factor \tilde{Z} appears now the denominator.

We can can break the double sum into sums over and h

=\dfrac{1}{\tilde{Z}(\mathbf{v})}\sum\limits_{\mathbf{v}}e^{-[E(\mathbf{v})+K(\mathbf{v})]}\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}

Identify \tilde{Z} in the numerator

=\dfrac{1}{\tilde{Z}(\mathbf{v})}\tilde{Z}(\mathbf{v})\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}

which factors out, giving a very simple expression in h

=\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}

In the technical proof in the paper, the idea is that since h is just a dummy variable, we can replace h with v.  We have to be careful here since this seems to only applies to the case where we have the same number of hidden and visible units–a rare case.  In an earlier post on VRG, I explain more clearly how to construct an RG transform for RBMs.  Still,  the paper is presenting a counterargument for arguments sake, so, following the argument in the paper,  let’s say

\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})} =\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}

This is like saying we constrain the Free Energy at each layer to be the same.

\ln\;\sum\limits_{\mathbf{h}}e^{-E(\mathbf{h})}\approx\ln\;\sum\limits_{\mathbf{v}}e^{-E(\mathbf{v})}

This is also another kind of Layer Normalization–a very popular method for modern Deep Learning methods these days.

So, by construction, the renormalized and data Partition functions are identical

Z^{RG}(\mathbf{v,h}) =Z(\mathbf{v})

The goal of Renormalization Group theory is to redefine the Energy function on a difference scale, while retaining the macroscopic observables.

But , and apparently this has been misstated in some ML papers and books, the marginalized probabilities can be different !

To get the marginals, let’s integrate out only the h variables

\tilde{p}(\mathbf{v})=\dfrac{1}{Z^{RG}(\mathbf{v,h})}\sum\limits_{\mathbf{h}}e^{-E(\mathbf{v,h})}

Looking above, we can write this in terms of and its normalization \tilde{Z}

\tilde{p}(\mathbf{v})=\dfrac{1}{\tilde{Z}(\mathbf{v})}e^{-[E(\mathbf{v})+K(\mathbf{v})]}

which implies

\tilde{p}(\mathbf{v})\ne p(\mathbf{v})  


So what is cheap about deep learning ?

RBMs let us represent data using a smaller set of hidden features.  This is, effectively, Variational Renormalization Group algorithm,  in which we approximate the Partition function, at each step in the RBM learning procedure, without having to learn the underlying joining probability distribution.  And this is easier.  Cheaper.

In other words, Deep Learning is not Statistics.  It is more like Statistical Mechanics.  

And the hope is that we can learn from this old scientific field — which is foundational to chemistry and physics — to improve our deep learning models.

Post-Post Comments

Shortly after this paper came out, Comment on “Why does deep and cheap learning work so well?”that the proof in the Appended is indeed wrong–as I suspected and pointed out above.

It is noted that the point of the RG theory is to preserve the Free Energy form one layer to another, and, in VRG, this is expressed as a trace condition on the Transfer operator

\sum\limits_{\mathbf{h}}e^{-T(\mathbf{v},\mathbf{h})}=1

where -T(\mathbf{v},\mathbf{h})=E(\mathbf{v})+K(\mathbf{v})+ln\tilde{Z}

It is, however, technically possible to preserve the Free Energy and not preserve the trace condition.  Indeed, because K(\mathbf{y}) is not-constant, thereby violating the trace condition.

From this bloggers perspective, the idea of preserving Free Energy, via either a trace condition, or, say, by layer normalization, is the import point.  And this may mean to only approximately satisfy the trace condition.

In Quantum Chemistry, there is a similar requirement, referred to as a Size-Consistency and/ or Size-Extensivity condition.  And these requirements proven essential to obtaining highly accurate, ab initio solutions of the molecular electronic Schrodinger equation–whether implemented exactly or approximately.

And, I suspect, a similar argument, at least in spirit if not in proof, is at play in Deep Learning.

Please chime in our my YouTube Community Channel

see also: https://m.reddit.com/r/MachineLearning/comments/4zbr2k/what_is_your_opinion_why_is_the_concept_of/)

14 Comments

  1. Interesting perspective between partition functions, renormalization group and deep learning.
    Do you think that you can also demonstrate through Entropy (or is it a different path of proof) ?

    Also question, when using pre-processed data like symmetric data (rotation of image),
    how does it affect the energy function minimization ?

    Thanks

    Like

      1. This is an interesting point of view. Renormalization Groups seems to appear when there is power law distribution, power law appears when there is statistical aggregation….

        Like

          1. It’s rare to see such a clean and intuitive formalism for explanation (even in Physics, it tends to over-subscript).

            How would you relate the gradient back propagation at each layer through the renormalization group process ?

            Believe the derivative propagates through the renormalization invariant ?

            Am not physicist, do you have any example of gradient back propagation in Physics ?

            Your article is really nice and well written. Thanks, regards

            >

            Liked by 1 person

              1. Hello,

                Auto-encoder seems using Backpropagation for un-supervised learning (output being equal to the input).

                > Machine Learning > Tuesday, September 13, 2016 1:15 PM > Charles H Martin, PhD commented: “Well it is not so clean–see my > update. Backprop is for supervised methods. I’ll have a post on this > soon.” >

                Like

Leave a reply to ktphy (@ktphy) Cancel reply