I agree, there is no discussion about decimation, or the fact that these are mean field methods.

I make a simple observation about free-energy based deep nets, like RBMs and DBMs.

The idea is, when defining the total Free Energy, one should also consider the constraint on the correlation operator V(h,v) such that the partition function is not changed. That is, in an RBM or DBM, we can try to define a free energy per layer, and ensure that the layer free energies are conserved. I don’t think anyone has tried this. The closest thing I know to this is bath normalization (for supervised nets). I don’t know if they are the same.

LikeLike

]]>It is not very clear that you need a strong definition of neighbours or locality. In physics this happens in disordered media for instance, in which you can have long range interactions because of the disorder. In this situation the locality does not mean much. Indeed if you construct clusters from spins that are spatially close they may not be interacting because of the disorder. Therefore your decimation process is locally equivalent to a decimation on a free non interacting model (which is not interesting). It is probably much more interestng to cluster together degree of freedom (dof) that are strongly interacting (eventhough they may be spatially far). Actually the local clustering is natural in local field theory (FT), but it is not obvious that it is needed. Actually it is not, at least in some cases I know about.

You could also consider Ising model on arbitrary graphs instead of a lattice and you still expect that renormalization works (it probably exists in the physics literature already, I know about the random planar maps case, the random non-planar case should exist as well, but the key word ‘random’ simplifies things a lot here).

Then if you follow my point above I would say that you can create ‘natural’ clusters of neighbours by looking at the edge weight magnitudes between layers. Lets say you have a layer i with 6 nodes and a layer i+1 with 3 nodes. Then say that nodes 1,2,3 of the ith layer have all high values of their weight magnitude towards node 1 of layer i+1, while nodes 4,5,6 of layer i have only very small (in magnitude) weights to node 1 of layer i+1. Then I would say that node 1,2,3 are clustered together and decimated by node 1 of layer i (because the magnitude of their weights towards the same node 1 are comparable in magnitude). Of course this would mean that some dof will be taken into account more than once in the decimation process while some dof will simply be forgotten. But intuitively I feel that it still could lead to a decent renormalization process (as long as the over\under-counting of dof is not too big, which means, I guess, that the weight matrix has to be neither too sparse nor too diffuse, so this relates to the choice of regularization of the NN. Strangely enough this makes me think about another picture in Wetterich-like renormalization).

Dropout in this context might be seen as a way of averaging different renormalization processes in order to enhance the probability of following a ‘universal flow trajectory’ that leads to an interesting fixed point .

But anyway, I still think this is only a very very vague connection to renormalization and these ideas should be investigated rigorously before claiming that NN are just performing renormalization (that is why the ‘exact mapping’ paper title choice looks like pure marketing to me).

LikeLike

]]>LikeLike

]]>(This idea is also called topological trivialization by some spin glass people, and has been put forth by a group at UCLA. ALthough I did not know about this when I started these posts)

Like a protein, the energy landscape would not just be single peaked. This would be a very rigid structure with no chemical function. In other words, overtrained. Instead, the landscape probably has some deep local minim to provide some flexibility and motion. In other words, generalization.

Of course, we need an effective ‘dimension’ , or what chemists call a reaction coordinate, on which to drive. In protein folding, this is entropy: F=-TS. But we also have to consider T, and in Hopfield nets, the retrieval phase is a very low T phase, whereas in deep nets, we implicitly have T=1 . Then again, no effort is made to ensure T=1…it is just assumed. To that end, I think there is a more subtle reason connection to spin glasses that has been overlooked entirely. It is becoming clearly now that networks generalize well when they have an excess residual entropy.

For example, see LeCun’s very recent work on Entropy SGD, and I think Lecun is on the right track here, but, because they ignore the concept of Free Energy, they also ignore Temperature, and are trying to use ideas for T=0 spin glasses to a non-zero T phenomena.

I conjecture that this excess entropy in the network landscape is analogous to the excess configurational entropy of a supercooled glass. And in supercooled glasses, there is a phenomena known as the Adam-Gibbs relation, which basically says that the energy barriers go to infinity as the temperature is lowered. This is called the Entropy Crises, and it is seen even simple spin glass models like the Random Energy Model. (I discussed this in my online video ad MMDS) For a deep net, the effective Temperature is like like the norm of weights–in other words, the traditional weight-norm regularizer. Rugged convexification arises as a way to avoid the Entropy Crises. In other words, good generalization requires averaging over a some number of T=0 energy minima, thereby avoiding Entropy collapse.

LikeLike

]]>LikeLike

]]>LikeLike

]]>I agree in that there is a confusion in directly comparing NN Energy Functions to spin glass Hamiltonians, and that training a DNN is not the same as finding the minimum in a zero-temperature energy landscapes of a glass. But, even when we could talk about “effective Free Energies” as being a more accurate description of the Deep Learning funneled landscapes, don’t you think that the whole success of DNNs relies on the fact that there’s actually no-need to find a global minimum?

You can train your DNN with different training subsets and it should/will find different local minima (even of your free energy!!). The trick is that all this local minima are, to more or less extent, representative of the problem/characteristic you’re asking the DNN to learn. Their strength should rely in the redundancy of representations of such properties. Precisely that’s why, I think, people frequently find better statistical results when averaging over the predictions of separately trained NNs, instead of extensively training a single one.

Generalization lives in the tortuous repetitions of valleys and hills at a (relatively high) energy level, and not in a deep minimum of whatever energy or free energy one could define.

Luckily we did start “suffering” from local minima at some point in ML, that is where all the fun started !!!

LikeLike

]]>LikeLike

]]>LikeLike

]]>LikeLike

]]>