# Rethinking–or Remembering–Generalization in Neural Networks

### Rethinking Generalization: the Google paper

Last year, ICLR 2017, a very interesting paper came out of Google claiming that

Understanding Deep Learning Requires Rethinking Generalization

Which basically asks, why doesn’t VC theory apply to Deep Learning ?

In response, Michael Mahoney, of UC Berkeley, and I, published a response that says

Rethinking generalization requires revisiting old ideas [from] statistical mechanics …

In this post, I discuss part of our argument, looking at the basic ideas of Statistical Learning Theory (SLT) and how they actually compare to what we do in practice. In a future post, I will describe the arguments from Statistical Mechanics (Stat Mech), and why they  provide a better, albeit more complicated, theory.

Our paper is a formal discussion of my Quora post on the subject

#### Show me the Code ?

Of course, it would be nice to have the original code so the results could be reproduced–no such luck.  The code is owned by Google–hence the name ‘Google paper’

And, Chiyuan Zhang, one of the authors, has kindly put together a PyTorch github repo–but it is old pyTorch and does not run  There is, however, a nice Keras package that does the trick.

### Towards a Theory of Learning

In the traditional theory of supervised machine learning (i.e. VC, PAC), we try to understand what are the sufficient conditions for an algorithm to learn.  To that end, we seek some mathematical guarantees that we can learn, even in the worst possible cases.

What is the worst possible scenario ?  That we learn patterns in random noise, of course.

#### Theoretical Analysis: A Practical Example

“In theory, theory and practice are the same.  In practice, they are different.”

The Google paper analyzes the Rademacher complexity of a neural network.  What is this and why is it relevant ?  The Rademacher complexity measures how much a model fits random noise in the data.  Let me provide a practical way of thinking about this:

Imagine someone gives you $m$ training examples, labeled data $\mathbf{X},\mathbf{y}$, and asks you to build a simple model.  How can you know the data is not corrupted ?

If you can build a simple model that is pretty accurate on a hold out set, you may think you are good.  Are you sure ?  A very simple test is to just randomize all the labels, and retrain the model.  The hold out results should be no better than random, or something is very wrong.

Our practical example is a variant of this:

1. Build a simple model.  Like a linear SVM.  Then optimize.  Let’s say we get 95% training accuracy, and 90% test (generalization) accuracy, and that this is pretty good.
2. Randomize some of the labels in the training data.  Like %10.  Of course, you may overtrain now.  So optimize again.  That means adjust the regularization.  Repeat this many times (i.e. 100x). Find the worst possible training and generalization accuracies.

By regularizing the model, we expect we can decrease the model capacity, and avoid overtraining.  So what can happen ?

• easy case  If the data is really clean, and since first model (1) is pretty good, we might expect then the worst training accuracy (2) should be about ~10% less than the first. After all, how could you as generalize well if you added noise to the data ?!   So we might find that, for (2) , a training accuracy of 85%, and generalization accuracy of 90%.
• hard case  The worst training accuracy (2) remains at 95%.   And the generalization accuracy stays around 80% or so.  You randomized 10% of the labels and nothing really changed ?!  Are you just overtraining ?   Or are the labels corrupted ?
• Remember, we picked a  very simple linear SVM. it has 1 adjustable parameter —the regularization parameter.  And maybe that did not even change much.  And if you overtrained, the training error would be zero, and the generalization error much larger.
• So chances are, the data is corrupted. and you were asked to build good model using very noisy data.  Been there, and it is no fun.

Notice in both cases, the difference in training error $e_{t}$ and generalization error $e_{g}$ stayed nearly fixed or at least bounded

$e_{g}-e_{t}\leq 5\%$

• pathological case  The Google paper claims:  “Even with dropout and weight decay, Inception V3 is still able to fit [a] random training set extremely well if not perfectly.”  But the generalization accuracy tanks. (See Table 2):

$e_{g}-e_{t}\rightarrow 100\%$

It appears that some deep nets can fit any data set, no matter what the labels ?    Even with standard regularization.  But computational learning theories, like VC and PAC theory, suggests that there should be some regularization scheme that decreases the capacity of the model, and prevent overtraining.

So how the heck do Deep Nets generalize so incredibly well ?!  This is the puzzle.

To get a handle on this, in this post, I review the salient ideas of VC/PAC like theories.

#### Generalization Bounds:  How much data do we need ?

the other day, a friend asked me, how much data do I need for my AI app ?”

My answer: “it’s complicated !”

For any learning algorithm (ALG), like an SVM, Random Forest, or Deep Net, ideally, we would like to know the number of training examples necessary to learn a good model.

ML Theorists sometimes call this number the Sample Complexity. In the Statistical Mechanics, it is called the loading parameter (or just the load) $\alpha$.

Statistical learning theory (SLT), i.e. VC theory, provides a way to get the sample complexity.  Moreover, it also tells us how bad a machine learning model might be, in the worst possible case.  It states the difference between the test and training error should be characterized by simply the VC dimension $d_{VC}$.–which measures the effective capacity of the model, and the number of training examples $m$.

$test\;error\;(e_{g})\;\le\;\;training\;error\;(e_{t})\;+\;bounding\;term(d_{VC},m)$.

##### More formally

Statistics is about how frequencies converge to probabilities.  That is, if we sample N training examples from a distribution $\rho^{N}$, we seek a gaurentee that the accuracy $\delta$ can be made arbitrarily small ($\delta$) with a very high probability $\gamma\;$

$\gamma=\mathbf{Pr}[\vert e_{g}-e_{t}\vert>\delta]$

#### Some notation:

We call our model a class of functions $\mathcal{F}=\{f\}$.   We associate the empirical generalization error $e_{g}$ with the true model error (or risk) $R^{true}(f)$, which would be obtained in the proper infinite limit.  And the training error with the empirical risk $R^{emp}(f)$.  Of course, $\;e_{g}\ge e_{t}\;$; we want to know how bad the difference could possibly be, by bounding it

$e_{g}-e_{t}=R^{true}(f)-R^{emp}(f)\le bound$

Also, in our paper, we use $\delta,\gamma$, but here we will follow the more familiar $\epsilon,\delta$ notation.

I follow the MIT OCW course on VC theory, as well as chapter 10 of Engle and Van der Broeck, Statistical Mechanics of Learning (2001).

#### On using Concentration Bounds: SLT vs Stat Mech

In most problems, machine learning or otherwise, we want to predict an outcome.  Usually a good estimate is the expected value.  We use concentration bounds to give us some gaurentee on the accuracy.  That is, how close is the expected value is to the actual outcome.

SLT , however, does not try to predict an expected or typical value.  Instead, it uses the concentration bounds to explain how regularization provides a control over the capacity of a model, thereby reducing the generalization error.  The amazing result of SLT is that capacity control emerges as a first class concept, arising even in the most abstract formulation of the bound–no data, no distributions, and not even a  specific model.

In contrast, Stat Mech seeks to predict the average, typical values for a very specific model, and for the entire learning curve.  That is, all practical values of the adjustable knobs of the model, the load $\alpha$, the Temperature, etc., in the Thermodynamic limit. Stat Mech does not provide concentration bounds or other guarantees, but it can provide both a better qualitative understanding of the learning process, and even good quantitative results for sufficiently large systems.

(In principle, the 2 methods are not totally incompatible, as one could examine the behavior of a specific bound in something like an effective Thermodynamic limit. See Engle …, chapter 10)

For now, let’s see what the bounding theorems say:

#### Common Bounds:

##### Standard Function Estimation:

If we are estimating a single function $f$, which does not depend on the data, then we just need to know the number of training examples m

$R^{true}(f)-R^{emp}(f)\le\dfrac{1}{\sqrt{m}}$

The Law of Large Numbers (Central Limit Theorem) tells us then that the difference goes to zero in the limit $m\rightarrow\infty$.

This is a direct result of Hoeffding’s inequality, which is a well known bound on the confidence of the empirical frequencies converging to a probability.

$\mathbf{Pr}[R^{true}(f)-R^{emp}(f)>\delta]\le 2e^{-2\delta^{2}m}$

##### Machine Learning, finite size $\mathcal{F}$

However, if our function choices f depend on the data, as they do in machine learning, we can not use this bound.  The problem:  we can overtrain. Formally, we might find a function $f_{m}$ that optimizes training error $R^{emp}(f_{m})$ but performs very poorly on test data $R^{true}(f_{m})$, causing the bound to diverge.  As in the pathological case above.

So we consider a Uniform bound over all functions $f\in\mathcal{F}$, meaning we consider maximum (supremum) deviation.

We also now need to know the number of adjustable parameters N.  For a finite size function class, and a general (unrealizable) $N=\vert\mathcal{F}\vert$, and we have

##### the Hoeffding + Union Bound Theorem

$\underset{f\in\mathcal{F}}{\sup}\;[R^{true}(f)-R^{emp}(f)]\le\sqrt{\dfrac{log N+log \frac{1}{\delta}}{m}}$.

The is a 2-sided version of this theorem also, but the salient result is that the new $log N$ term appears because we need all N bounds to hold simultaneously. We consider the maximum deviation, but we are bounded by the worst possible case.  And not even typical worst case scenarios, but the absolute worst–causing the bound to be very loose. Still at least the deviations are bounded–in the finite case.

##### Machine Learning, infinite size $\mathcal{F}$

What happens when the function class $\mathcal{F}$ is infinite?  As for Kernel methods and Neural Networks.  How do we treat the two limits $m,N\rightarrow\infty$ ?

In VC theory, we fix m, and define distribution-independent measure, the VC growth function $S_{\mathcal{F}}(m)$, which tells us how many ways our training data $D=\{z_{1}\cdots z_{m}\}\;\;z_{i}=(\mathbf{x}_{i},y_{i})$ can be classified (or shattered) by functions in $\mathcal{F}$

$S_{\mathcal{F}}(m)=\underset{z_{1}\cdots z_{m}}{sup}\vert\mathcal{F}_{z_{1}\cdots z_{m}}\vert\;\;$,

where $\mathcal{F}_{z_{1}\cdots z_{m}}$ is the set of all ways the data can be classified by the functions $\{f\in\mathcal{F}\}$

$\mathcal{F}_{z_{1}\cdots z_{m}}=\{f(z_{1}),\cdots,f(z_{m})|f\in\mathcal{F}\}$

Using this, we can bound even infinite size classes $\vert\mathcal{F}\vert$, with a finite growth function.

WLOG, we only consider binary classification. We can treat more general models using the Rademacher complexity instead of the VC dimension–which is why the Google paper talked so much about it.

##### Growth Function Theorem (Vapnik-Chervonenkis)

For any $\delta>0$, and for any random draw of the data, we have, with probability at least $(1-\delta)$

$R^{true}(f)\le R^{emp}(f)+\sqrt{2\dfrac{log S_{\mathcal{F}}(2m)+log\frac{4}{\delta}}{m}}\;\;\forall{f\in\mathcal{F}}$.

So the simpler the function class, the smaller the true error (Risk) should be.

In fact, for a finite $\vert\mathcal{F}\vert$, we have a tighter bound than above, because $S_{\mathcal{F}}(m)\le N$.

Note that we usually see the VC bounds in terms of the VC dimension…but VC theory provides us more than bounds; it tells us why we need regularization.

#### Regularization and VC Dimension

We define the VC Dimension $d_{VC}$, which is simply the number of training examples m, for which

$S_{\mathcal{F}}(m)=2^{m}$

The VC dim, and the growth function, measure the effective size $\vert\mathcal{F}\vert$ of the class .  By effective, we mean not just the number of functions, but the geometric projection of the class onto finite samples.

If we plot the growth function, we find it has 2 regimes:

##### Growth regimes:

$m:  exponential growth

$m\ge d_{VC}\;\;$:  polynomial growth

This leads to formal bounds on infinite size function classes (by Vapnik, Sauer, etc) based on the VC dim (as we mention in our paper)

##### Lemma: Vapnik and Chervonenkis, Sauer, Shelah

Let $\mathcal{F}$ be a class of functions with finite VC dim $d_{VC}$.  Then for all $m\in\mathbb{N}$

##### VC Bounds

If $\mathcal{F}$ has VC dim $d_{VC}$, and for all $m\ge d_{VC}$, with probability at least $1-\delta$

$R^{true}(f)\le R^{emp}(f)+2\sqrt{2\dfrac{d_{VC}\;log(\frac{2em}{d_{VC}})+log\frac{4}{\delta}}{m}}\;\;\forall f\in\mathcal{F}$

Since $S_{\mathcal{F}}(m)$ bounds the risk of not generalizing, when we have more training examples than the VC dim ($m\ge d_{VC}\;\;$), the risk grows so much slower.  So to generalize better, decrease $d_{VC}$, the effective capacity  of $\mathcal{F}$.

Regularization–reducing the model complexity–should lead to better generalization.

So why does regularization not seem to work as well for Deep Learning ? (At least, that is what the Google paper suggests!)

Note: Perfectly solvable, or realizable, problems may have a tighter bound, but we ignore this special case for the present discussion.

##### The Effective VC Dimension of a Neural Network:  Vapnik and LeCun

Before we dive in Statistical Mechanics, I first mention an old paper by Vapnik, Levin and LeCun, Measuring the VC Dimension of a Learning Machine (1994) .

It is well known that the VC bounds are so loose to be of no practical use.  However, it is possible to measure an effective VC dimension–for linear classifiers. Just measure the maximal difference in the error while increasing size of the data, and fit it to a reasonable function. In fact, this effective VC dim appears to be universal in many cases.  But…[paraphrasing the last paragraph of the conclusion]…

“The extension of this work to multilayer networks faces [many] difficulties..the existing learning algorithms can not be viewed as minimizing the empirical risk over the entire set of functions implementable by the network…[because] it is likely…the search will be confined to a subset of [these] functions…The capacity of this set can be much lower than the capacity of the whole set…[and] may change with the number of observations.  This may require a theory that considers the notion of a non-constant capacity with an ‘active’ subset of functions”

So even Vapnik himself suspected, way back in 1994, that his own theory did not directly apply to Neural Networks!

And this is confirmed in recent work looking at the empirical capacity of RNNs.

And the recent Google paper says things are even weirder.  So what can we do ?

#### Statistical Mechanics:

##### The Thermodynamic Limit

We argue that the whole idea of looking at worst-case-bounds is at odds with what we actually do in practice because we take a effectively consider a different limit than just fixing m and letting N grow (or vice versa).

Very rarely would we just add more data (m) to a Deep network.  Instead, we usually increase the size of the net (N) as well, because we know that we can capture more detailed features / information from the data. So, win practice, we increase  m and N simultaneously.

In Statistical Mechanics, we also consider the join limit $m,N\rightarrow\infty$but with the ratio m/N fixed.

The 2 ideas are not completely incompatible, however.   In fact, Engle… give  nice example of applying the VC Bounds, in a Thermodynamic Limit, to a model problem.

##### Typical behaviors and Phase Behavior

In contrast to VC/PAC theories, which seek to bound the worst case of a model, Statistical Mechanics tries to describes the typical behavior of a model exactly.  But typical does not just mean the most probable.  We require that atypical cases be made arbitrarily small — in the Thermodynamic limit.

This works because the probabilities distributions of relevant thermodynamic quantities, such as the most likely average Energy,  become sharply peaked around their maximal values.

Many results are well known in the Statistical Mechanics of Learning. The analysis is significantly more complicated  but the results lead to a much richer structure that explains many phenomena in deep learning.

In particular, it is known that many bounds from statistics become either trivial or do not apply to non-smooth probability distributions, or when the variables take on discrete values. With neural networks, non-trivial behavior arises because of discontinuities (in the activation functions), leading to phase transitions (which arise in the thermodynamic limit).

##### 3 phases of learning

For a typical neural network, can identify 3 phases of the system, controlled by the load parameter $\alpha=\dfrac{m}{N}$, the amount of training data m, relative to the number of adjustable network parameters N (and ignoring other knobs)

• Memorization:  $\alpha$ very small.
• Overtraining$\alpha$ too small
• Generalization $\alpha$  large

The view from SLT contrasts with some researchers who argue that all deep nets are simply memorizing their data, and that generalization arises simply because the training data is so large it covers nearly every possible case.  This seems very naive to me, but maybe ?

Generally speaking, memorization is akin to prototype learning, where only a single examples is needed to describe each class of data.  This arises in certain simple text classification problems, which can then be solved using Convex NMF (see my earlier blog post).

##### What is Over Training ?

In SLT, over-training is a completely different phase of the system, characterized by a kind of pathological non-convexity–an infinite number of (degenerate) local minimum, separated by infinitely high barriers. This is the so-called (mean field) Spin-Glass phase.

SLT Overtraining / Spin Glass Phase has an infinite number of minima

So why would this correspond to random labellings of the data ?  Imagine we have a binary classifier, and we randomize $L$ labels.  This gives $2^{L}$ new possible labellings:

Different Randomized Labellings of the Original Training Data

We argue in our paper, that this decreases the effective load $\alpha$.  If $L\ll N$ is very small, then $\alpha$ will not change much, and we stay in the Generalization phase.  But if $L\sim\mathcal{O}(N)$ is of order N (say 10%), then $\alpha$ may decrease enough to push us into the Overtraining phase.

Each Randomized Labelling corresponds to a different Local Minima

Moreover, we now have $2^{L}$  new, possibly unsatisfiable, classifications problems. So the solutions will be nearly degenerate.  But many of these will be difficult to learn because the many of label(s) are wrong, so the solutions could have high Energy barriers — i.e. they are difficult to find, and hard to get out of.

So we postulate by randomizing a large fraction of our labels, we push the system into the Overtraining / Spin Glass phase–and this is why traditional VC style regularization can not work–it can not bring us out of this phase. At least, that’s the theory.

#### Future Work

In my next paper / blog post, I will describe how to examine these ideas in practice. We develop a method to when phase transitions arise in the training of real world neural networks.  And I will show how we can observe what I postulated 3 years ago–the Spin Glass of Minimal Frustration, and how this changes the simple picture from the Spin Glass / Statistical Mechanics of Learning.  Stay tuned !

#### Caveats and Failures: online learning ?

It would be dishonest to present statistical mechanics as a general theory of learning that resolves all problems untreatable by VC theory.  In particular, however, is the problem with theoretical analysis of online learning.

Of course, statistical mechanics only rigorously applies to the limit $N\rightarrow\infty$. Statistics requires infinite sampling$m$, but does provides results for finite $N$. But even in the infinite case, there are issues.

While online learning (ala SGD) is certainly amenable to a stat mech approach, there is a fundamental problem with saddle points that arise in the Thermodynamic limit. (See section 9.8, and also chapter 13, Engle and Van den Broeck)

It has been argued that an online algorithm may get trapped in saddle points which result from symmetries in the architecture of the network, or the underlying problem itself.  In these cases, the saddle points may act as fixed points along an unstable manifold, causing a type of  symmetry-induced paralysis. And while any finite size system may escape such saddles, in the limit $N\rightarrow\infty$, the dynamics can get trapped in an unstable manifold, and learning fails.

This is typically addressed in the context of symmetry-induced replica symmetry breaking in multilayer perceptrons, which would need be the topic of another post.

#### Comments on the Sample Complexity

Computational Learning Theory (i.e VC theory) frames a learning algorithm or model as a infinite series of growing, distribution-independent hypothesis spaces $S_{k}$, of VC dimension $d^{k}_{VC}$, such that

$S_{1}\subset S_{2}\subset\cdots$

This lets us consider typical case behavior, when the limit converges (the so-called self-averaging case).  And we can even treat typical worst case behaviors.  And since we are not limited to the over loose bounds, the analysis matches reality much closer.

And we do this for a spin glass model of deep nets, we see much richer behavior in the learning curve. And we argue, this agrees more with what we see in practice

Let’s formalize this in the context of .

##### Function Classes

Let us call class of the hypothesis, or function classes, $\mathcal{F}=\{h\}$.   Notationally, sometimes we see $\mathcal{H}=\mathcal{F}$. Here, we distinguish between our abstract, theoretical hypothesis $h$ and the actual functions the algorithm sees $f$

A hypothesis represents, say, the set of all possible weights for neural network.  Or, more completely, the set of all weights, and learning rates, and dropout rates, and even initial weight distributions. It is a function $h(\mathbf{x})$.  This may be confusing because, conceptually, we will want hypothesis with random labels–but we don’t actually do this.  Instead, we measure the Rademacher complexity, which is a mathematical construct to let us measure all possible label randomizations for a given hypothesis $h(x)$.

We also fix the function class $\mathcal{F}$, so that learning process is, conceptually, a fixed point iteration over this fixed set:

$f_{0}\rightarrow f_{t}\rightarrow f*$

and not some infinite set.  This is critical because the No Free Lunch Theorem says that the worse-case sample complexity of an infinite size model is infinite.

##### Learning Errors:  Empirical and Optimal Risk

For a given distribution $\rho=\{\mathbf{X},\mathbf{y}\}$, define the expected risk of a hypothesis as expected Loss over the actual training data

$\mathcal{E}(h)=\mathbb{E}[Loss(h(\mathbf{x},y)]$

and the optimal risk as the maximum error (infimum) we encounter for all possible hypothesis–like in our practical case (2) above.

$\mathcal{E}^{*}(\mathcal{F})=\inf_{h\in\mathcal{F}}[\mathcal{E}(h)]$

As above, we identify the training error with the data dependent expected risk, and the generalization error as the maximum risk we could ever expect, giving:

$\vert e_{t}-e_{g}\vert\rightarrow\vert\mathcal{E}(h)-\mathcal{E}^{*}(\mathcal{F})\vert$

##### Consistency of Learning:  Statistical Convergence

Recall we want the minimum N  training examples necessary to learn.  What is necessary ?

##### Sample Complexity $N(\rho, \delta, \gamma)$

Suppose we pick a set $S_{n}=[\{\mathbf{x}_{i},y_{i}\}|i=1\cdots n]$ of n labeled pairs.  We define a hypothesis in by applying our ALG on this set: $h_{n}=ALG(S_{n})$.  And assume that we drew our set from some distribution, so $S_{n}\sim\rho^{n}$

Then, for all $\epsilon,\delta> 0$, we seek a positive integer $N(\rho, \delta, \gamma)$, such that for all $n\ge N$

$Pr_{\rho^{n}}[(\mathcal{E}(h_{n})-\mathcal{E}^{*}(\mathcal{F})\ge\delta]<\gamma$

Notice that $N(\rho, \delta, \gamma)$ explicitly depends on the  distribution, accuracy, and confidence.

##### The No Free Lunch Theorem

As in the Central Limit Theorem, we  consider the limit where the number of training examples $m\rightarrow\infty$.

But the No Free Lunch Theorem says that unless we restrict on the hypothesis/function space $\mathcal{F}$,  there always exist “bad” distributions $\rho^{n}$ for which the sample complexity $N(\rho, \delta, \gamma)$ is arbitrarily large.

Of course,  in practice, we know for that very large data sets, and very high capacity models, we need to regularize our models to avoid overtraining. At least we thought we knew this.

Moreover, by restricting the complexity of $\mathcal{F}$, we expect to produce more uniformly consistent results.  And this is what we do above.

#### Retarded Learning: or the most politically incorrect paper ever

We don’t necessarily need all the machinery of microscopic Statistical Mechanics and Spin Glasses (ala Engle…) to develop SLT-like learning bounds.  There is a classic paper, Retarded Learning: Rigorous Results from Statistical Mechanics, which shows how to get similar results using a variational approach.