1. In the presentation, the idea is that Dark Knowledge is a way to build a network with the small capacity , but a better optimization function than can be obtained using standard SGD training


  2. Following from the comments in https://charlesmartin14.wordpress.com/2016/10/21/improving-rbms-with-physical-chemistry/#comment-1499, which are more relevant here.

    “This is the standard ‘picture’. Now this may or may not be a good model for what is going on. ” Ok, I see. Do you know of any other proposed alternatives to the standard “picture which explains how the energy landscape corresponds to overfitting”? Hm, I suppose that in the paper where “It is suggested that having too many Hidden nodes–at a fixed Temp.–leads to a glassy state prone to overtraining small data sets.”, there may a way of looking at overfitting as a glassy state, but I haven’t finished reading the paper yet.

    Hm, thinking a bit more, if the spin-glass state has minima which are quite deep, then getting stuck in them would coincide with overfitting/small trainning error. I suppose this could be a kind of overfitting that you get if you have no regularization (or if it isn’t working properly), while the no-early-stopping overfitting is related to global minima in the funnel.. Though, I think I’m now seeing why the whole thing could be more complicated..

    Anyway I think I’m starting to get how regularization is a proxy for temperature. Basically, regularization helps either remove bad minima by changing the free energy landscape, or helps escaping from them by adding fluctuations, like dropout. Actually, even dropout can be seen as changing the energy landscape I think, as adding fluctuations is like increasing temperature, which makes entropy matter more (when writting free energy as F=U-TS). Does this sound right?

    This last thought made me remember the “survival of the flattest” idea in evolution ( http://www.nature.com/nature/journal/v412/n6844/abs/412331a0.html ) . Let’s say we have a wide funnel, even if it’s not that deep, a system with high mutation rate/temperature will actually end up there over an isolated minimum even if its deeper (say corresponding to overtraining)!


    1. The primary argument about the funnel is that these learning systems are strongly correlated, and therefore not readily treated by mean field theory. Specifically, the classical idea was that strongly correlated random models lie in a different universality class. So they behave completely differently than a spin glass , and this gives rise to the convexity .


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s