MMDS 2016 Video Presentation Charles H Martin, PhDSeptember 24, 2016September 24, 2016Uncategorized Post navigation PreviousNext Why Deep Learning Works: Perspectives from Theoretical Chemistry: Share this:TwitterFacebookLinkedInMoreRedditEmailLike this:Like Loading... Related
Thank you for your presentation. How would you define Dark Knowledge ?
In the presentation, the idea is that Dark Knowledge is a way to build a network with the small capacity , but a better optimization function than can be obtained using standard SGD training
Following from the comments in https://charlesmartin14.wordpress.com/2016/10/21/improving-rbms-with-physical-chemistry/#comment-1499, which are more relevant here.
“This is the standard ‘picture’. Now this may or may not be a good model for what is going on. ” Ok, I see. Do you know of any other proposed alternatives to the standard “picture which explains how the energy landscape corresponds to overfitting”? Hm, I suppose that in the paper where “It is suggested that having too many Hidden nodes–at a fixed Temp.–leads to a glassy state prone to overtraining small data sets.”, there may a way of looking at overfitting as a glassy state, but I haven’t finished reading the paper yet.
Hm, thinking a bit more, if the spin-glass state has minima which are quite deep, then getting stuck in them would coincide with overfitting/small trainning error. I suppose this could be a kind of overfitting that you get if you have no regularization (or if it isn’t working properly), while the no-early-stopping overfitting is related to global minima in the funnel.. Though, I think I’m now seeing why the whole thing could be more complicated..
Anyway I think I’m starting to get how regularization is a proxy for temperature. Basically, regularization helps either remove bad minima by changing the free energy landscape, or helps escaping from them by adding fluctuations, like dropout. Actually, even dropout can be seen as changing the energy landscape I think, as adding fluctuations is like increasing temperature, which makes entropy matter more (when writting free energy as F=U-TS). Does this sound right?
This last thought made me remember the “survival of the flattest” idea in evolution ( http://www.nature.com/nature/journal/v412/n6844/abs/412331a0.html ) . Let’s say we have a wide funnel, even if it’s not that deep, a system with high mutation rate/temperature will actually end up there over an isolated minimum even if its deeper (say corresponding to overtraining)!
Thanks for the great questions; I cleaned up the blog post based on this. I think the older ideas aboipt Hopfield models permeate the discussion today–and this is the problem. I have to step it for a bit so I’ll answer quickly and then return to this later. See http://vision.ucla.edu/~pratikac/pub/chaudhari.cs269.16.pdf
The primary argument about the funnel is that these learning systems are strongly correlated, and therefore not readily treated by mean field theory. Specifically, the classical idea was that strongly correlated random models lie in a different universality class. So they behave completely differently than a spin glass , and this gives rise to the convexity .
Lovelly blog you have