Dimensional Reduction via an Effective Operator

Introduction

Frequently in science, simple models work well beyond their expected range of accuracy and application, and can be essential to provide intuition and insight. Consquently, great effort goes into seaching for simple models that encompass only a few relevant variables and yet provide high accuracy, low bias, and good generalizability.

For example, in my former life as a Quantum Chemist, we constructed simple, semi-empirical models to predict the color of small pigment molecules.  The models were simple in that they were linear operators and that they explicitly treated only a small subset of the complete number of possible variables (i.e. only the $\pi$ electrons).

http://pubs.acs.org/doi/abs/10.1021/jp960617w

These semi-empirical $\pi$ electron models embodied the best principals of modern modern machine learning techniques–they required only a small subset of data, generalized well to data never seen in the model, and effectively incorporated the biases of the problem in a general purpose formalism.  In fact, they are much, much simpler than solving the actual physical problem–the numerical solution of the many-electron quantum mechanical Schrödinger equation. The apparent success of semi-empirical quantum mechanical models sparked interest in our group at Chicago, and elsewhere, to develop a first principles, or ab initio, theory to explain why these models work so well. To the statistical machine learning (ML) community, this ab initio theory resembles a Kernel Dimensional Reduction method. Consquently, one wonders then if it is possible to develop a related, general purpose, Kernel Dimensional Reduction theory for machine learning, using some of the mathematical machinery of linear algebra and projection operators, and how this theory relates to existing formulations of ML.

In this post I will describe a new approach to Machine Learning and Feature Dimensional Reduction using what I call an Effective Regression Operator.

Linear Regression

Consider the standard linear model for a regression (ala Smale: http://www.ams.org/notices/200305/fea-smale.pdf )

$\mathbf{y}=\mathbf{x^{t}w}+\mathbf{e}$

where $x$ is a vector of ($n_{x}$) features (or explanatory variables), $y$ is a set of ($n_{y}$) response variables (or dependent variables, or even labels), $\mathbf{w}$ is a linear operator (i.e. a vector of weights), and $e$ is the error in our model. Furthermore, we assume only the dependent variables $y$ are subject to error, and the error is assumed to be come from normal random distribution with zero mean. Equivalently, we may say we are modeling the conditional expectation:

$E(\mathbf{y}|\mathbf{x})=\mathbf{x^{t}}\mathbf{w}+\mathbf{e}$.

The machine learning/regression problem is to find that operator w that best statisfies our model. Typically this problem is over-determined; one takes a number of measurements (m), leading to a set of linear equations:

$\mathbf{y_{1}}=\mathbf{x_{1}^{t}\mathbf{w}}+\mathbf{e}$

$\mathbf{y_{2}}=\mathbf{x_{2}^{t}\mathbf{w}}+\mathbf{e}$

$\mathbf{y_{m}}=\mathbf{x_{m}^{t}w}+\mathbf{e}$

We may group these m equations together by grouping the measurement vectors ($x_{i}$) and the response vectors ($y_{i}$) into matrices, yielding the single linear equation

$\mathbf{y}=\mathbf{Xw}+\mathbf{e}$

where y and e $(1\times m)$ matrices (or vectors), w is an $(1\times n_{x})$ matrix , and $\mathbf{X}$ is an $(n_{x}\times m)$ matrix.

The goal is to find the optimal linear operator w which satisfies this equation.

Normal Equations

To solve the problem, let us first rewrite the linear regression using the  normal equations :

$\mathbf{X}^{t}\mathbf{y}=\mathbf{X}^{t}\mathbf{Xw}+\mathbf{e}$  (1)

The common solution  (again, assuming normal errors) is the ordinary least squares (OLS) solution, found by constructing the Moore-Penrose Pseudoinverse of $\mathbf{X}$:

$\mathbf{w}=(\mathbf{X}^{t}\mathbf{X})^{-1}\mathbf{X}^{t}\mathbf{y}$

Regularization

Ridge Regression

Frequently the matrix $X^{t}X$ is ill conditioned, so we may add a constant term (an adjustable parameter $\lambda$) to the diagonal to regularize the solution, yielding:

$\mathbf{w}=(\mathbf{X}^{t}\mathbf{X}+\lambda\mathbf{I})^{-1}\mathbf{X}^{t}\mathbf{y}$

This is also known as Ridge regression. The regularization term $\lambda$ may be chosen using cross-validation. Equivalently (and more generally) the solution may be expressesed as a convex optimization problem:

$\mathbf{w}=min(\Vert\mathbf{X}\mathbf{w}-\mathbf{y}\Vert^{2}+\lambda\Vert\mathbf{w}\Vert^{2})$

Notice that, in general, the final model for $\mathbf{y}$ should include terms from the null space of $\mathbf{X}^{t}\mathbf{X}$

Tikhonov Regularization

We may regularize the ill-conditioned inverse operator $\mathbf{X}^{t}\mathbf{X}$ by adding an entire square matrix, as

$\mathbf{w}=(\mathbf{X}^{t}\mathbf{X}+\mathbf{\Gamma^{t}\Gamma})^{-1}\mathbf{X}^{t}\mathbf{y}$

This is called Tikhonov Regularization. We call $\mathbf{\Gamma^{t}\Gamma}$ the Tikhonov matrix, or, equivalently,  $\mathbf{\Gamma}$ the Greens Function or Regularization Operator for the problem. This leads to the standard convex optimization problem seen in much of modern machine learning

$\mathbf{w}=min(\Vert\mathbf{X}\mathbf{w}-\mathbf{y}\Vert^{2}+\Vert\Gamma\mathbf{w}\Vert^{2})$

We shall see that our Effective Operator takes the form of a  generalized Tikhonov Regularization.

Feature Partitioning: Visible and Hidden Variables

The fundamental assumption driving our approach is to explicitly partition the features into visible (P) and hidden (Q) features in order to explicitly reduce the dimensions of the problem

Let us partition our feature vector x into two subspaces (denoted P and Q), where the P subspace (the Primary space) contains features $\{x_{P}\epsilon P\}$ that are strongly correlated with the response variables, or labels,  y, and the other subspace Q (the Secondary space) contains those remaining features $\{x_{Q}\epsilon Q\}$ which are thereby weakly correlated to the labels y. That is, we choose the components $\{x_{P}\}$ such that $E(x_{P},y)=x_{P}^{t}y$ is large, and $\{x_{Q}\}$ such that $E(x_{Q},y)=x_{Q}^{t}y$ is small.

We associate P with the visible features and Q with the hidden features

For each subspace P and Q, we identify the associated projection operators $P$ and $Q$, so that

$P+Q=I$
$\mathbf{x}=P\mathbf{x}+Q\mathbf{x}$

As usual, the projections operators are idempotent:

$P^{t}P=I$   Visible Space

$Q^{t}Q=I$   Hidden Space

We denote the P and Q parts of the vectors using subscripts. For example, we write the primary features as $x_{P}=Px$ and the secondary features as $x_{Q}=Qx$. Likewise, if $M$ is a matrix, we denote the action of projection operators with (upper case) subscripts, such as $M=PM$ and $M_{PP}=PMP$. For consistency, we will never act on the left with the operators, thereby avoiding writing $xP=x_{P}$ or $PMP=M_{P}$.

We may now re-write above the normal equations (1) . First, for convenience, define

$\mathbb{X}=X^{t}X$

We may express this in block form as:

$\mathbb{X}=\begin{array}{cc} P\mathbb{X}P & P\mathbb{X}Q\\ Q\mathbb{X}P & Q\mathbb{X}Q \end{array}=\begin{array}{cc} \mathbb{X_{\mathrm{PP}}} & \mathbb{X_{\mathrm{PQ}}}\\ \mathbb{X_{\mathrm{QP}}} & \mathbb{X_{\mathrm{QQ}}} \end{array}$

Now re-write this in block form (ignoring the error vector e) as

$[\begin{array}{cc} P\mathbb{X}P & P\mathbb{X}Q\\ Q\mathbb{X}P & Q\mathbb{X}Q \end{array}]\begin{array}{c} Pw\\ Qw \end{array}= \begin{array}{c} (PX)^{t}y\\ (QX)^{t}y \end{array}$

This leads to two seperate linear equations (using subscripts from now on)

$\mathbb{X_{\mathrm{PP}}}w_{P} +\mathbb{X}_{PQ}w_{Q}= X_{P}^{t}y$   Visible Space (2)

$\mathbb{X}_{QP}w_{P}+\mathbb{X}_{QQ}w_{Q}=X_{Q}^{t}y$   Hidden Space (3)

Effective Regression Operator: $\mathbb{L^{\mathrm{eff}}}$:

As a approximation, let us assume the hidden, Q-space variables are not coupled at all with the labels $y$.  For example, if we don’t know the labels.  Then $X_{Q}^{t}y=0$ , yielding

$\mathbb{X}_{QP}w_{P}+\mathbb{X_{\mathrm{QQ}}}w_{Q}=0$

This lets us write a simple expression for the hidden variable weight vector $w_{Q}$:

$w_{Q}=-(\mathbb{X}_{QQ})^{-1}\mathbb{X_{\mathrm{QP}}}w_{P}$

Substituting  for $w_{Q}$ in equation  (2) above leads to a single, non-linear equation in the Visible P-space:

$(\mathbb{X}_{PP}-\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}\mathbb{X}_{QP})w_{P}=X_{P}^{t}y$

We now define the Effective Regression Operator $\mathbb{L^{\mathrm{eff}}}$:

$\mathbb{L^{\mathrm{eff}}}w_{P}=X_{P}^{t}y$  (4a)

where

$\mathbb{L^{\mathrm{eff}}}=\mathbb{X_{\mathrm{PP}}}-\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}\mathbb{X}_{QP}$ (4b)

The Effective Operator $\mathbb{L^{\mathrm{eff}}}$ acts only on the  Visible P space, and yet generates the exact solution to the orginal regression problem, subject to the Hidden-space approximation $X_{Q}^{t}y=0$. (We may relax this approximation, shown below).

Generalizing Tikhonov Regularization

The Effective Regression Operator $\mathbb{L^{\mathrm{eff}}}$ looks like a generalized Tikhonov regression, where the original regression is defined just over the P-space:

$w_{p}=(\mathbb{X}_{PP}+\Gamma^{t}\Gamma)^{-1}X_{P}^{t}y$

and where the Tikhonov matrix $\Gamma^{t}\Gamma$ is given by

$\Gamma^{t}\Gamma=\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}\mathbb{X}_{QP}$

We may also write an formal expression for the Green’s Function $\Gamma$.  Let us transform $X$ to a basis such that $\mathbb{X}_{QQ}=\Sigma_{Q}$ is diagonal. We have

$\Gamma^{t}\Gamma=\mathbb{X}_{PQ}(\Sigma_{Q}^{-1})\mathbb{X}_{QP}$

$\Gamma^{t}\Gamma=\mathbb{X}_{PQ}(\Sigma_{Q})^{-\frac{1}{2}}(\Sigma_{Q})^{-\frac{1}{2}}\mathbb{X}_{QP}$

yielding

$\Gamma=(\Sigma_{Q})^{-\frac{1}{2}}\mathbb{X}_{QP}$

We now see a simple interpretation of the Effective Operator  $\mathbb{L^{\mathrm{eff}}}$ : it is a variance weighted coupling of the Visible P-space features to the Hidden Q-space features.

In a later post we will also examine some real world examples and also see the  Fock Space / Second Quantization formulation.

Appendix

Series Expansion Approximation

The term $\Gamma^{t}\Gamma$ may be hard to compute if the Q-space is large and/or $\mathbb{X}_{QQ}$ is hard to invert or perhaps ill-conditioned. Of course, we may add a regularization constant $\lambda$ to the diagonal so that $(\mathbb{X}_{QQ}+\lambda I_{Q})^{-1}$ is well behaved.

It may not be enough to simply regularize the problem. Additionally, we will most likely need to regularize $(\mathbb{L}^{\mathrm{eff}})^{-1}$ as well, so having two adjustable parameters may prove unwieldy.

We might hope to avoid a difficult matrix inversion by approximating $(\mathbb{X}_{QQ})^{-1}$ using the matrix series expansion:

$(I-Y)^{-1}=I+Y+Y^{2}+Y^{3}+$

For example, let us split $\mathbb{X}_{QQ}$ into a diagonal $\mathbb{D}$ and non-diagonal $\mathbb{N}$ matrices, such that:

$\mathbb{X}_{QQ}=(\mathbb{D_{\mathrm{QQ}}}-\mathbb{N}_{QQ})$

and

$(\mathbb{X}_{QQ})^{-1}=(\mathbb{D}_{QQ}-\mathbb{N}_{QQ}){}^{-1}$

$=D_{QQ}(I_{QQ}-(\mathbb{N}_{QQ})/(\mathbb{D}_{QQ}))^{-1}$

so that

$(\mathbb{X}_{QQ})^{-1}=\mathbb{D}_{QQ}(I_{QQ}+\mathbb{N}_{QQ}/(\mathbb{D}_{QQ})^{-1}+$

leading to simple expression to first order, with no matrix inverson necessary:

$(\mathbb{X}_{QQ})^{-1}\backsimeq\mathbb{D_{\mathrm{QQ}}}+\mathbb{N}_{QQ}+$

Exact Effective Linear Regression: the Unbiased Solution

Let us now consider the general case of effective regression, where $X_{Q}^{t}y\neq0$. Here, we derive an effective operator $\mathbb{L}^{\mathrm{eff}}$ that acts on the P-space and gives the exact solution for the P-space weights $w_{P}$

Within this formalism, one may then effectively de-noise the solution by creating approximations to $(\mathbb{X}_{QQ})^{-1}$ and/or other parts of $\mathbb{L}^{\mathrm{eff}}$.

Going back to equation (3)

$\mathbb{X}_{QP}w_{P}+\mathbb{X}_{QQ}w_{Q}=X_{Q}^{t}y$

we write an exact expression for $w_{Q}$:

$w_{Q}=(\mathbb{X}_{QQ})^{-1}(X_{Q}^{t}y-\mathbb{X_{\mathrm{QP}}}w_{P})$

Plugging this expression for $w_{Q}$ back into equation (2) this leads to a new regression problem

$\mathbb{X_{\mathrm{PP}}}w_{P}-\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}\mathbb{X}_{QP}w_{P}+\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}X_{Q}^{t}y=X_{P}^{t}y$

or, collecting terms, we get

$(\mathbb{X_{\mathrm{PP}}}-\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}\mathbb{X}_{QP})w_{P}=X_{P}^{t}y-\mathbb{X}_{PQ}(\mathbb{X}_{QQ})^{-1}X_{Q}^{t}y$

This is significantly harder to solve because the effective equations are highly non-linear and the inverse operator $(\mathbb{X}_{QQ})^{-1}$ appears on both sides of the equation.  I’ll leave the formal and numerical solutions and real world examples for a future post. [ We will also see how this relates to the Effective Operators that arise in Quantum Chemistry and Quantum Nuclear Physics. ]

Note:  I developed most of these ideas in 2001.  Subsequently, I have learned of some related approaches in the field of Gaussian Processes and Bayesian Regression that use the techniques of Hilbert Spaces and Projection Operators, such as  Huaiyu Zhu,  Christopher K. I. Williams,  Richard Rohwer & Michal Morciniec,  Gaussian Regression and Optimal Finite Dimensional Linear Models

One day I will sit down and try to determine the detailed relationship between these works.

1. Derek says:

Interesting. How do you know that the strategy of partitioning the feature vector into hidden and visible pieces will work? It seems a bit ad hoc. On that note, can you offer any motivation for the tinkanov regularization? I know this is standard practice, but why does it work? Perhaps we can discuss more in detail sometime.

Like

1. charlesmartin14 says:

There are two basic conditions for the method: 1, that the Q-space features are only weakly correlated with the labels, compared to the P-space and 2. that the Q-space coupling operator (Q*Q) be invertible. Note that even when Q*Q is ill conditioned, it can be regularized. Many problems are dominated by a single set of features, and adding features only incrementally improves the solution. Although it is unclear to me if the simple solution can be used (4a, b), or one needs the full blown highly non-linear one. Notice that (4a,b) can be cast solved a convex optimization. The real problem with regularization is that unless the final problem is convex, it is hard to find a stable solution

On Tikhonhov Regularization: If the problem is ill-conditioned, then there may be more than 1 solution. The Tikhonhov Regularizer simply constrains the problem to select a single solution that is sparse. Other Regularizers, such as a Radial Basis Function (RBF) Kernel, will select the solution that is most smooth (sparse in ‘derivative space’). (See my earlier post)

We can discuss in detail and I am happy to blog more. Gives me something to do.

Like

1. charlesmartin14 says:

I thought of another important example–when you only have a subset of the labels. Indeed, this was the original motivation for this method, which I invented way back in 2001. I will provide an example of this

Like

2. charlesmartin14 says:
3. Pingback: Quora
4. Pingback: Quora