Foundations: Mean Field Boltzmann Machines 1987

A friend from grad school pointed out a great foundational paper on Boltzmann Machines.  It is a 1987 paper from complex systems theory

A Mean Field Theory Learning Algorithm for Neural Networks

just a couple years after Hinton’s seminal 1985 paper , “A Learning Algorithm for Boltzmann Machines“.

What I really like is how we see the foundations of deep learning arose from statistical physics and theoretical chemistry. My top 10 favorite take-a-ways are:

  • The relation between Boltzmann Machines and the nearly forgotten Hopfield Associative Memory.  And why Hidden nodes made a big difference in just coming up with reasonable training algorithm.

 

  • What an actual mean field theory (MFT) is.  They don’t just factor the Energy function or use a bi-partite graph. They introduce continuous fields (U,V)  via the delta function, and then take a saddle point approximation. Today we only see MFTs expressed as the resulting factorized models like RBMs and Sum-Product Networks; we don’t see the fields.

 

  • That an annealing schedule was not about adjusting the learning rate–it was about adjusting the temperature schedule.   And that adjusting the annealing schedule is a huge flexible part of the model.  Yes, Boltzmann Machines were originally Temperature dependent.

 

  • How to derive the learning rules for neural nets using Markov chains and the principle of microscopic reversibility.

 

  • Where the tanh activation function came from.  They come from the MFT.  So today, sure, we use ReLUs.  But we don’t just have a random Energy function.  There is a deep reason for these activation functions.

 

  • How the MFT here is also a quenched approximation.  This comes up all the time in analyzing the replica symmetric solutions of mean field spin glasses (i.e. replica symmetry breaking (RSB)).  You can’t understand the phase diagram of a spin glass without understanding this.

 

  •  Why the Free Energy is smoother (i.e. closer to convex) than the highly non-convex, T=0 Energy landscape.

 

  • We do not anneal to T=0.  RBMs, and, I suspect, Deep Learning in general, is not about traversing the T=0 Energy Landscape.

 

  • The paper presents a very simple example that anyone can code up and run in a day:  2 input nodes, 1 output node, 4 hidden nodes.

 

  • The references !  Great stuff.

 

This is, all in all, a fantastic paper from the statistical physics point of view.  And if there is interest, I can go through the details here.

Happy New Year everyone!

 

6 comments

  1. I learned a lot from your articles. Thank you! By training, I am a theoretical chemist and I would really love to see how statistical physics relates to machine learning. Your articles are very helpful in revealing the relationship between these two areas.

    Like

  2. Happy new year! This is one of the most interesting blogs of ML at all. I am a physicist specializing in complex systems. This blog is a revelation!

    For the last paragraph, please more too.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s