# When Regularization Fails

Another Holiday Blog:  Feedback and Questions are very welcome.

I have a client who is using Naive Bayes pretty successfully, and the subject has come up as to ‘why do we need fancy machine learning methods?’   A typical question I am asked is:

Why do we need Regularization ?

What happens if we just turn the regularizer off ?

Can we just set it to a very small, default value ?

After all, we can just measure what customers are doing exactly.  Why not just reshow the most popular items, ads, etc to them. Can we just run a Regression?  Or Naive Bayes ?  Or just use the raw empirical estimates?

Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?

Sure, if you are simply estimating historical frequencies,  n-gram statistics, etc, then a simple method like Naive Bayes is fine. But to predict the future..

the Reason we need Regularization is for Generalization.

After all, we don’t just want to make the same recommendations over and over.

Unless you are just in love with ad re-targeting (sic).   We want to predict on things we have never seen.

And we need to correct for presentation bias when collecting data on things we have seen.

Most importantly, we need methods that don’t break down unexpectedly.

In this post, we will see why Regularization is subtle and non-trivial even in seemingly simple linear models like the classic Tikhonov Regularization.   And something that is almost never discussed in modern classes.

We demonstrate that ‘strong’  overtraining is accompanied by a Phase Transition–and optimal solutions lies just below this.

The example is actually motivated by something I saw recently in my consulting practice–and it was not obvious at first.

Still,  please realize–we are about to discuss a pathological case hidden in the dark corners of Tikhonov Regularization–a curiosity for mathematicians.

See this Quora post on SVD in NLP for some additional motivations.

Let’s begin when Vapnik [1], Smale [2], and the other great minds of the field begin:

(Regularized) Linear Regression:

The most basic method in all statistics is linear regression

$\mathbf{X}\mathbf{w}=\mathbf{y}$.

It is attributed to Gauss, one of the greatest mathematicians ever.

One solution is to assume Gaussian noise and minimize the error.

The solution can also be constructed using the Moore-Penrose PseudoInverse.

If we multiply by  $\mathbf{X}^{\dagger}$ to the left on both sides, we obtain

$\mathbf{X}^{\dagger}\mathbf{X}\mathbf{w}=\mathbf{X}^{\dagger}\mathbf{y}$

The formal solution is

$\mathbf{\hat{w}}=(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\mathbf{X}^{\dagger}\mathbf{y}$

which is only valid when we can actually invert the data covariance matrix $\mathbf{X}^{\dagger}\mathbf{X}$.

We say the problem is well-posed when $(\mathbf{X}^{\dagger}\mathbf{X})^{-1}$

1.     exists,
2.   is unique, and
3. is stable.

Otherwise we say it is singular, or ill-posed [1].

In nearly all practical methods, we would not compute the inverse directly, but use some iterative technique to solve for $\mathbf{\hat{w}}$. In nearly all practical cases, even when the matrix seems invertible, or even well posed, it may have numerical instabilities, either due to the data or the numerical solver.  Vapnik refers to these as stochastically ill posed problems [4]

Frequently XTX can not be inverted..or should not be inverted..directly.

#### Tikhonov Regularization

A  trivial solution is to simply ‘regularize’ the matrix $\mathbf{X}^{\dagger}\mathbf{X}$ by adding a small, non-zero value to the diagonal:

$(\mathbf{X}^{\dagger}\mathbf{X})^{-1}\rightarrow(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1}$.

Naively, this seems like it would dampen instabilities and allow for a robust numerical solution.  And it does…in most cases.

If you want to sound like you are doing some fancy math, give it a Russian name; we call this Tikhonov Regularization [4]:

$\mathbf{\hat{w}}(\lambda)=(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1})\mathbf{X}^{\dagger}\mathbf{y}$

This is (also) called Ridge Regression in many common packages such as scikit learn.

The parameter $\lambda$ is the regularization parameter values, usually found with Cross Validation (CV).  While a data science best practice, CV can fail !

Grid searching $\lambda$, we can obtain the infamous L-curve:

The L-curve is a log-log plot of the the norm of the regularized solution vs. the norm of the residual.  The best $\lambda$ lies somewhere along the bottom of the L, but we can’t really tell where (even with cross validation).

This challenge of regularization is absent from basic machine learning classes?

The optimal $\lambda$ is actually hard to find.   Cross Validation (CV) only works in ideal cases.  And we need a CV metric, like $R^{2}$, which also only works in ideal cases.

We might naively imagine $\lambda$ can just be very small–in effect, turning off the regularizer.  This is seen in the curve above, where the solution norm diverges.

The regularization fails in these 2 notable cases, when

1.  the model errors are correlated,  which fools simple cross validation
2.  $\lambda\rightarrow 0$ and the number of features ~ the number of training instances, which leads to a phase transition.

Today we will examine this curious, spurious behavior in case 2 (and look briefly at 1)

#### Regimes of Applications

As von Neumann said once, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk”

To understand where the regularization can fail, we need to distinguish between the cases in which Ridge Regression is commonly used.

Say we have $N_{f}$ features, and $N_{I}$ instances.  We distinguish between the High Bias and High Variance regimes.

##### High Variance: $N_{f}\gg N_{I}$

This regime is overdetermined, complicated models that are subject to overtraining.   We find multiple solutions which satisfy our constraints.

Most big data machine learning problems lie here.  We might have 100M documents and 100K words.  Or 1M images and simply all pixels as the (base) features.

Regularization lets us pick the best of solution which is the most smooth (L2 norm), or the most sparse (L1 norm), or maybe even the one with the least presentation bias (i.e. using Variance regularization to implement Counterfactural Risk Minimization

##### High Bias regime: $N_{f}\ll N_{I}$

This is the Underdetermined regime, and any solution we find, regularized or not, will most likely  generalize poorly.

When we have more instances than features, there is no solution at all (let alone a unique one)

Still, we can pick a regularizer, and the effect is similar to dimensional reduction. Tikhonov Regularization is similar truncated SVD (explained below)

Any solution we find may work, but the predictions will be strongly biased towards the training data.

In between these 2 limits is an seemingly harmless case, however, this is really a

##### Danger Zone:  $N_{f}\approx N_{I}$

This is a rare case where the number of features = the number of instances.

This does come up in practical problems, such as classifying a large number of small text phrases.  (Something I have been working on with one of my larger clients)

The number of phrases ~ the number of unique words.

This is a dangerous case … not only because it seems so simple … but because the general theory breaks down.  Fortunately, it is only pathologicial in

#### The limit of zero regularization

We examine how these different regimes behave for small values of $\lambda$

$\underset{\lambda\rightarrow 0}{\lim}\;(\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})^{-1}\mathbf{X}^{\dagger}\mathbf{y}$

Let’s formalize and consider the how the predicted accuracy behaves.

##### The Setup:  An Analytical Formulation

We analyze our regression under Gaussian noise $\mathbf{\xi}$.  Let

$y=\mathbf{X}\mathbf{w}*+\mathbf{\xi}$

where $\mathbf{\xi}$ is a Gaussian random variable with unit mean and variance $\sigma^{2}$

$\mathbf{\xi}=N(0,\sigma^{2})$

This simple model lets us analyze predicted Accuracy analytically.

Let

$\mathbf{\hat{w}}$ be the optimal Estimate, and

$\mathbf{w^{*}}$ be the Ground Truth, and

$\mathbf{\bar{w}}=\mathbb{E}_{\xi}[\mathbf{\hat{w}}]$ be expected (mean).

We would like define, and the decompose, the Generalization Accuracy into Bias and Variance contributions.

Second, to derive the Generalization Error, we need to work out how the estimator behaves as a random variable. Frequently, in Machine Learning research, one examines the worst case scenario.  Alternatively, we use the average case.

Define the Estimation Error

$Err(\mathbf{\hat{w}})=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{w^{*}}\Vert^{2}$

Notice that by Generalization Error, we usually we want to know how our model performs on a hold set or test point $\mathbf{x}$:

$\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}]$

In the Appendix, we work out the generalization bounds for the worst case and the average case.  From here on, however, when we refer to Generalization Error, we mean the average case Estimation Error (above).

For Ridge Regression, the mean estimator is

$\mathbf{\bar{w}}=(\mathbf{X^{\dagger}X}-\lambda\mathbf{I})^{-1}\mathbf{X^{\dagger}X}\mathbf{w^{*}}$

and its variation is

$\mathbf{\hat{w}}-\mathbf{\bar{w}}=(\mathbf{X^{\dagger}X}-\lambda\mathbf{I})^{-1}\mathbf{X^{\dagger}}\mathbf{\xi}$

We can now decompose the Estimation Error into 2 terms:

$\mathbb{E}_{\xi}=\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}+\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert^{2}\;\;$,  where

$\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert$ is the Bias, and

$\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}$  is the Variance,

and examine each regime in the limit $\lambda\rightarrow 0$

The Bias is just the error of the average estimator:

$\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\lambda^{2}\Vert((\mathbf{X}^{\dagger}\mathbf{X}-\lambda\mathbf{I})\mathbf{w^{*}}\Vert^{2}$

and the Variance is the trace of the Covariance of the estimator

$\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sigma^{2}\;Tr\bigg(\dfrac{\mathbf{X}^{\dagger}\mathbf{X}}{(\mathbf{X}^{\dagger}\mathbf{X}+\lambda\mathbf{I})^{2}}\bigg)$

We can examine the limiting behavior of these statistics by looking at the leading terms in a series expansion for each. Consider the Singular Value Decomposition (SVD) of $X$

$\mathbf{X}=\mathbf{USV}$.

Let $\{s_{1},s_{2},\cdots,s_{m}\}$ be the positive singular values, and $\{\mathbf{v}_{i}\}$ be the right singular vectors.

We can write the Bias as

$\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{1}^{N_{f}}\big(\dfrac{\lambda\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*}}{(s_{i}^{2}+\lambda)}\big)^{2}$

With no regularization, and the number of features exceeds the number of instances, there is a strong bias, determined by the ‘extra’ singular values

$\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=\sum\limits_{N_{i}+1}^{N_{f}}(\mathbf{v}_{i}^{\dagger}\mathbf{w}^{*})^{2}\;\;when\;\;N_{f}\gg N_{I}$

Otherwise, the Bias vanishes.

$\lim\limits_{\lambda\rightarrow 0}\Vert\mathbf{\bar{w}}-\mathbf{w^{*}}\Vert=0\;\;when\;\;N_{f}\le N_{I}$

So the Bias appears to only matter in the underdetermined case.

(Actually, this is misleading; in the overdetermined case, we can introduce a bias when tuning the regularization parameter too high)

In contrast, the Variance

$\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sigma^{2}\sum\limits_{1}^{m}\dfrac{s_{i}^{2}}{(s^{2}_{i}+\lambda)^{2}}$

$\lim\limits_{\lambda\rightarrow 0}\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}=\sum\limits_{1}^{m}s_{i}^{-2}$

can diverge if the minimal singular value is small:

$\mathbb{E}_{\xi}\Vert\mathbf{\hat{w}}-\mathbf{\bar{w}}\Vert^{2}\rightarrow\infty \;\;when\;\; s_{0}\sim 0$

That is,  if the singular values decrease too fast, the variance can explode.

And this can happen in the danger zone

When $N_{f}\sim N_{i}$ the variance can be infinite!

Which also means the Central Limit Theorem (CLT) breaks down…at least how we normally use it.

Traditionally, the (classical) CLT says that the infinite sum of i.i.d. random variables $\{x_{i}\}$ converges to a Gaussian distribution–when the variance of the ${x_{i}}$ is finite.  We can, however, generalize the CLT to show that, when the variance is infinite, the sum converges to a Levy or $\alpha$-stable, power-law distribution.

These power-law distributions are quite interesting–and arise frequently in chemistry and physics during a phase transition. But first, let’s see the divergent behavior actually arise.

#### When More is Less:  Singularities

Ryota Tomioka [3] has worked some excellent examples using standard data sets

[I will work up some ipython notebook examples soon]

The Spambase dataset is a great example.  It has $N_{f}=57$ features, and $N_{i}=4601$ instances. We run a Ridge Regression, with $N_{i}\in(0,4601]$, with $\lambda=10^{-6}$.

As the data set size $N_{f}$ increases, the accuracy sharps to zero at $N_{f}=N_{f}=57$, and then increases sharply again, saturating at $latex N_{f}\sim 10^{3}$.

Many other datasets show this anomalous behavior at $N_{f}=N_{f}$, as predicted by our analysis above.

#### Ridge Regression vs Logistic Regression: Getting out of the Hole

Where Ridge Regression fails, Logistic Regression bounces back.  Below we show some examples of RR vs LR in the $N_{f}=N_{f}$ danger zone

Logistic Regression still has problems, but it no where near as pathological as RR.

The reason for this is subtle.

• In Regression, the variance goes to infinity
• In Classification, the norm $\Vert\mathbf{\hat{w}}\Vert$ goes to infinity

#### Traditional ML Theory:  Isn’t the Variance Bounded?

Traditional theory (very losely) bounds for quantities like the generalization error and (therefore) the variance.

We work out some of examples in the Appendix.

Clearly these bounds don’t work in all cases.  So what do they really tell us?  What’s the point?

These theories gives us some confidence that we can actually apply a learning algorithm–but only when the regularizer is not too small and the noise is not too large.

They essentially try to get us far away from the pathological phase transition that arises when the variance diverges.

#### Cross Validation and Correlated Errors

In practice, we would never just cross our fingers set $\lambda=0.00000001$.

We pick a hold out set and run Generalized Cross Validation (GCV) . Yet this can fail when

1.  the model errors $\xi$ are not i.i.d. but are strongly correlated
2.  we don’t know what accuracy metric to use?

Hansen [5, 7] has discussed these issue in detail…and he provides an alternative method, the L-curve, which attempt to balance the size of the regularized solution and the residuals.

For example, we can sketch the relative errors $\dfrac{\Vert w-w(\lambda)\Vert}{\Vert w\Vert_{2}}$ for the L-curve and GVC method (see 7).  Above we see that the L-curve relative errors are more Gaussian than GCV.

In other words, while most packages provide a small choice of regression metrics; as data scientists, the defaults like $R^{2}$ may not be good enough.  Even using the 2-norm may not represent the errors well. As data scientists, we may need to develop our own metrics to get a good $\lambda$ for a regression.  (or maybe just use an SVM regression).

Hansen claims that,”experiments confirm that whenever GCV finds a good regularization parameter, the corresponding solution is located at the corner of the L-curve.” [7]

This means that the optimal regularized solution lives ‘just below’ the phase transition, where the norm diverges.

#### Phase Transitions in Regularized Regression

What do we mean by a phase transition?

The simplest thing is to we plot a graph and look a steep cliff or sharp change.

Here, generalization accuracies drops off suddenly as $\dfrac{N_{f}}{N_{i}}\rightarrow 1\;\;and\;\;\alpha\rightarrow 0$

In a physical phase transition,the fluctuations (i.e. variances and norms) in the system approach infinity.

Imagine watching a pot boil

We see bubbles of all sizes, small and large.  The variation in bubble size is all over the map.  This is characteristic of a phase transition: i.e. water to gas.

When we cook, however, we frequently simmer our foods–we keep the water just below the boiling point.  This is as hot as the water can get before it changes to gas.

Likewise, it appears that in Ridge Regression, we frequently operate at a point just below the phase transition–where the solution norm explodes.  And this is quite interesting.  And, I suspect, may be important generally.

In chemical physics, we need special techniques to treat this regime, such as the Renormalization Group. Amazingly, Deep Learning / Restricted Boltzmann Machines  look very similar to the Variational Renormalization Group method.

In our next post, we will examine this idea further, by looking at the phase diagram of the traditional Hopfield Neural Networ, and the idea of Replica Symmetry Breaking in the Statistical Mechanics of Generalization. Stay tuned !

#### References

1. Vapnik and Izmailov, V -Matrix Method of Solving Statistical Inference Problems, JMLR 2015

2. Smale, On the Mathematical Foundations of Learning, 2002

3. Tomioka, Class Lectures on Ridge Regression,

# Appendix:  some math

#### Bias-Variance Decomposition

$\mathbf{y}=\mathbf{X}{w}+\mathbf{\xi}$,

when $\mathbb{E}[\mathbf{\xi}]=0$,

then $Cov(\mathbf{\xi})=\sigma^{2}\mathbf{I}$

Let

$\mathbf{\Sigma}=\dfrac{1}{N_{i}}\mathbf{X^{\dagger}X}$

be the data covariance matrix.  And , for convenience, let

$\lambda_{n}=\dfrac{\lambda}{n}$.

The Expected value of the optimal estimator, assuming Gaussian noise, is given using the Penrose Pseudo-Inverse

$\mathbb{E}\mathbf{\hat{x}}=(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}\mathbf{\Sigma}\mathbf{w^{*}}$

The Covariance is

$Cov(\mathbf{\hat{x}})=\dfrac{\sigma^{2}}{n}(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}\mathbf{\Sigma}(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}$

The regularized Covariance matrix arises so frequently that we will assign it a symbol

$\mathbf{\Sigma}_{\lambda}=(\mathbf{\Sigma}-\lambda_{n}\mathbf{I})^{-1}$

We can now write down the mean and covariance

$\mathbb{E}_{\xi}\Vert\hat{w}-\bar{w}\Vert^{2}=Tr(Cov(\mathbf{\hat{w}}))$

[more work needed here]

#### Average and Worst Case Generalization Bounds

Consider the Generalization Error for a test point $\mathbf{x}$

$\mathbb{E}_{\mathbf{\xi}}[(\mathbf{x^{\dagger}}(\hat{\mathbf{w}}-\mathbf{w^{*}}))^{2}] =\lambda^{2}_{n}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}+\dfrac{\sigma^{2}_{n}}{n}\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{\Sigma}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}$

We can now obtain some very simple bounds on the Generalization Accuracy (just  like a bona-a-fide ML researcher)

##### The worst case bounds

$[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2}\Vert\mathbf{w^{*}}\Vert^{2}$

##### The average case bounds

$\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{x^{\dagger}}\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{w}]^{2}\le\alpha^{-1}\Vert\mathbf{\Sigma}_{\lambda_{n}}^{-1}\mathbf{x}\Vert^{2}$,

where we assume the ‘covariance*’ is

$\mathbb{E}_{\mathbf{w^{*}}}[\mathbf{w^{*}w^{*\dagger}}]=\alpha^{-1}\mathbf{I}$

#### Discrete Picard Condition

We can use the SVD approach to make stronger statements about our ability to solve the inversion problem

$\mathbf{Xw}=\mathbf{y}$.

I will sketch an idea called the “Discrete Picard Condition” and provide some intuition

Write $\mathbf{X}$ as a sum over singular vectors

$\mathbf{X}=\mathbf{U\Sigma V^{\dagger}}=\sum\limits_{i=1}^{N}\mathbf{u_{i}}\sigma_{i}\mathbf{v_{i}^{\dagger}}$.

Introduce an abstract PseudoInverse,  defined by

$\mathbf{X}^{-\dagger}\mathbf{X}=\mathbf{I}$

Express the formal solution as

$\mathbf{w}=\mathbf{X}^{-\dagger}\mathbf{y}$

Imagine the SVD of the PseudoInverse is

$\mathbf{X}^{-\dagger}=\mathbf{U^{\dagger}\Sigma^{-1}V}=\sum\limits_{i=1}^{N}\mathbf{u_{i}^{\dagger}}\sigma^{-1}_{i}\mathbf{v_{i}}$.

and use it to obtain

$\mathbf{w}=\sum\limits_{i=1}^{N}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}}$.

For this expression to be meaningful, the SVD coefficients $\mathbf{u^{\dagger}_{i}b}$ must decay, on average, faster than the corresponding singular values $\sigma{i}$.

Consequently, the only coefficients that carry any information are larger than the noise level.  The small coefficients have small singular values and are dominated by noise.    As the problem becomes more difficult to solve uniquely, the singular values decay more rapidly.

If we discard the $k+1,N$ singular components,  we obtain a regularized, unsupervised solution called

##### Truncated SVD

$\mathbf{w}=\sum\limits_{i=1}^{k}\dfrac{\mathbf{u^{\dagger}_{i}y}}{\sigma_{i}}\mathbf{v_{i}}$.

which is similar in spirit to spectral clustering.

1. Charles H Martin, PhD says:

Thanks

Like

1. ys says:

wow, this post is amazing! Though I wish in practice we could just cross our fingers and set \lambda=0.00000001. Parameter tuning is so frustrating!

Like

2. Wolf Engler says:

Very thoughtful presentation. May I offer a simple insight from a Physics perspective, and with many years of work in diverse sensor and imaging problems. If solving the problem requires regularization, then your model is wrong. Serious problems often resist solution; and resistance to solution is often very stubborn. Experience shows trying Singular Value Decomposition often helps diagnose what is wrong with a problem, or a model, or particular data. Sometimes SVD works so well, I abandon regularization altogether.

Like

1. Charles H Martin, PhD says:

Thanks. This interesting case actually came up in a production system with a client when trying to apply Ridge Regression. Usually I would just use Logistic Regression or and SVM. What I have found in text classification is that the SVM regularization parameter is universal. Which is not really discussed in the ML literature.

Like

2. Charles H Martin, PhD says:

I agree…regularization seems like a hack.

Like

3. Yusuf Isik says:

Is there a typo in definitions of high variance and high bias regimes?
For high variance you are writing Nf >> NI, and for high bias Nf << NI.
I think it should be the other way around.

Like

1. Yusuf Isik says:

OK, my mistake. I think the definitions of high bias and high variance are OK. But what confuses me is your use of overdetermined and underdetermined terms. For example for high bias case (Nf<<NI), you say that this is underdetermined regime. But then you say that ‘When we have more features than instances, there is no solution at all’. I think the high bias case is overdetermined. If in Xw=y, each row is an instance and columns are for features, we have more equations than unknowns, so it should be overdetermined.

Like

2. Charles H Martin, PhD says:

there might be typo…thanks I’ll review

Like