This post is still just sketch of ideas…not ready for consumption
In my last post I introduced the Eigenvalue-Dependent Effective (Semantic) Operator as a Semantic Similarity metric that couples visible (P) and hidden (Q) features:
where is the complete feature-feature coupling matrix (such as a giant TFIDF matrix for all the term, tags, and other features in a large document collection) and
is the dominant eigenvalue of
.
is useful for dimensional reduction when the hidden dimensions are actually useful and not just noise.
It is necessary, however, to solve find the dominant eigenvalue of . In some cases, however, it may be quite hard to find this eigenvalue. In other cases, we may want to de-noise or regularize the hidden variables. Or perhaps there is clearly more than just one dominate eigenvalue? What can we do?
We now introduce the Eigenvalue-Independent Effective (Semantic) Operator , using techniques from quantum mechanical many body perturbation theory (MBPT), that resolves all of these issues.
The Correlation Operator
Consider again the eigenvalue equation for the feature correlation matrix
We now re-write this using subscripts, and identify the expressions for the P and Q space
Visible Space (2)
Hidden Space (3)
Recall before we formed an explicit expression for from equation (3) , so that
Fet us define a new operator, , called the Correlation Operator, that explains how our hidden features to our visible features:
Our goal can be restated as to find a suitable Correlation Operator that does not depend on
and yet generates the Effective Operator we seek. What does this mean? Lets plug
backing into equations (2) and (3) and find a (non-linear) relationship between the equations, subject to the constraints that the Effective Operator be Hermetian and span the entire P-space (i.e. includes all tags). [Below we refer to this as a Complete Model Space of features ]
Visible Space (4)
Hidden Space (5)
using the definition of $latex \mathcal{F}(P,Q) $, multiply the R.H.S. of (4) by this to obtain
Visible Space (6)
Subtracting equation (5) and (6) to into a single equation
(7)
Our goal is then to find that satisfies (7) for every eigenvector in the primary space P simultaneously. There are many ways to do this in the chemical physics literature; here I will explain the ones I believe that may have application in Machine Learning.
In particular, we do not seek an exact Effective Operator since , usually in Machine Learning, our input data is noisy and may only have a few significant figures. Indeed, we expect that an approximate Effective Operator will perform better than the exact solution. Also, we may want to add a regularizer, and we will need a formalism that permits this.
Our Effective Operator will take the form
but this operator may not be Hermertian– a requirement for a Machine Learning Kernel. We can, however, in certain cases, form the associated Hermetian effective operator by taking the sum of and it’s Hermetian conjugate — a process we call Symmetrizing the Operator
Finally, we are not really trying to model an eigenvalue problem but rather the (semantic) feature-feature similarities. Consequently, we will need a formalism that lets us treats features explicitly, and does not just model the eigenvalues and eigenvectors of . In the language of Effective Operators, this means the Primary Space must be a Complete Model Space.
So how do we get here? While the final result is quite simple, the road to get there is long and technical.
Notation: Primitive Features and Hilbert Spaces
Lower case letters (p) refers to primitive features like tags , words, etc.
We refer to the space of features as the complete set of all features
In physics and chemistry, is the Fock Space.
The spaces are Hilbert spaces, spanned documents, instances, and general vectors x, $latex \Psi $, etc. These vectors are linear combination of primitive features
Identifying Primary Features
Let us consider in more detail how we define our features spaces and the feature correlation matrix
. How can we know we can partition our features into
? We can do this when the the P-space features
, such as the tags
, dominate the principle eigenvectors of
. That is, for the dominant eigenvector
the p-space weights $latex w_{p} $ are much larger than the q-space weights
$latex w_{p}\gg w_{q} $
More importantly, lets consider how the Eigenvalues of are distributed.:
Here I sketch the possible eigenvalues of the feature density matrix ; you can also think of these as the singular values arising from an SVD/LSA like calculation. First, note that we expect the largest eigenvalues to correspond to eigenvectors that are dominated by the P-space features (p), and we refer to these as the P-space eigenvalues. Ideally, we might hope that we can identify a complete set of high lying eigenvectors that are totally dominated by the P-space features. That is, t if we were to run SVD/LSA, then we expect that lowest clusters / singular vectors would be dominated by P-space features. Typically, however, we may observe that only the few largest eigenvalues as dominated by P-space features (and the smallest by Q-space features), whereas the mid-range eigenvectors are harder to classify. Hence the need for a simple model.
It is precisely this distinction among features — the ability to identify a dominate class of features — that allows us to apply the Effective Operator technique. In some cases, this will be self-evident. In others, it might seem fictitious since we decided beforehand how to construct the feature density matrix . I will address various examples in a future post–for now, I would like to get the basic mathematical result.
The Complete Model / Feature Space
Let’s consider how we might identify a model space of primary features (like tags). Since we expect the most important P-space eigenvectors
to be dominated by p-space features, we have
We do not know yet what the P-space actually is. We might try to pick it using an SVD/LSA approach–we simply pick the N lowest eigenvectors of . But we could not use this to compute a Semantic similarity
. We must be able to express $latex \mathbb{X}^{eff} $ in a basis of the original P-space features:
Since we don’t know it, we will guess. We model the true P-space eigenfunctions by the space of functions that is isomorphic the P-space sublock of the feature density matrix,
, such that
$latex \mathcal{P}_{M}\sim P^{t}\mathbb{X}P $
We write the model primary and secondary space projection operators as
and
We call the model space a Complete Model Space because it is spanned only by the p-space features. Notice this is space (while isomorphic to) is quite different from the P-space we might obtain from taking the first
eigenfunctions of
because these latter functions have non-zero Q-space features weights.
Indeed, we are , in the most literal sense, constructing a Model for our problem, and seeking to describe the true situation with this highly accurate, de-noised, interpretable and explainable model.
The Wave Operator and the Generalized Bloch Equation
How do we relate our Model Space back to actual data clusters? We have measured the correlations between our features and formed the matrix . We wish to model the major clusters of our data, which, in our approach, is the same as modeling the top
, target eigenfunctions of
where
We assume there exists a operator , called the Wave Operator, that relates our model functions
to our original target eigenfunctions, as
for all
Moreover, we seek to model the semantic similarity between the model features
We therefore need a way to relate the Wave Operator to our Effective (Semantic Similarity) Operator; we will achieve this via the Generalized Bloch Equation
We define the Effective Operator as an operator that acts on the model space functions and generates the associated eigenvalues of the feature-feature correlation matrix
for all
If multiply both sides of this by $ \Omega $ , we obtain
We can also insert into our original eigenvalue problem, to give
We can combine these equations as
for all
Since this expression holds for all model space functions (for all ), this leads to
The Generalized Bloch Equation
But we need one more constraint or relation to obtain the complete for our Effective Operator. And we have it. We apply what is called
Intermediate Normalization
We require that our model functions and our actual functions overlap, and we set this overlap to one
for all
If we multiply P to the R.H.S. of the Generalized Bloch Equation, we obtain our final expression:
We now need a way to compute the Effective Operator…which brings us to
Many Body Perturbation Theory
As with many numerical matrix methods, to solve our Effective Operator equation, we decompose into two parts
For example, in many method, is the diagonal part of
. For later purposes, we shall choose a form such that
may be generated directly from the individual features as a sum of matrix expressions
:
We will chose a partitioning that reflects both our physical knowledge of the system and our noise. In particular, we will assume carries the noise, and therefore we only include a small piece of
.
Indeed, rather than Regularize, which is the standard Machine Learning approach, we will assume that we can treat the noise using low-order perturbation theory, and 1 or more adjustable parameters.
We can now re-write the Effective Operator in Partitioned Form
We can also directly express
and
the Partitioned Form of the Generalized Bloch Equation
and, in terms of
We will now solve this equation using a low order Perturbation Theory.
Let us express as a series expansion
,
Since we only need a low order , approximate solution, we will form the so-called “second order” Effective Hamiltonian, which will take the form
We achieve this by forming an order-by-order expansion.
Let us assume, without loss of generality, that the P-space eigenvalues of are degenerate with value
. This seems extreme, however, later we will justify this and explain how to correct for it at higher orders if necessary. We can re-write the Generalized Bloch Equation as
$latex (\lambda_{0}-\mathbb{U})\Omega P=Q\mathbb{V}\Omega P-\chi PV\Omega P $
where
We shall solve this by means of a Resolvent Operator.
Please stay tuned…this will take a while to finish….and to get the Latex formatting to work properly!
The Resolvent Operator Requires it’s own Blog Post so I will go do that first, then come back
“The aim of language…is to communicate…to impart to others the results one has obtained…As I talk, I reveal the situation…I reveal it to myself and to others in order to change it.” Jean-Paul Sartre