Effective Operators for Eigenvalues and Clustering

I a previous post, I introduced the Effective Regression Operator \mathbb{L}_{eff} for supervised (and semi-supervised) Machine Learning problems.  Here, I will introduce a different Effective Operator, inspired by Quantum Chemistry and Quantum Nuclear Physics (also called an Effective Hamiltonian), for solving eigenvalue and/or clustering problems.

I believe this is the first time that the Effective Operator / Hamiltonian has been introduced into Machine Learning.  (although I don’t know the whole field, this is basically an idea I came up with in 2001 back in the first bubble when I first got into machine learning)

Before we jump in, let me provide some motivation

One the most popular  machine learning algorithms for text clustering is Latent Semantic Analysis (LSA).    Given a collection of documents, we construct a term-document matrix X

where the rows t^{T}_{i} are term vectors, and the columns d_{j} are column vectors.

In LSA, we form the Singular Value Decomposition (SVD) of the term-document matrix,  X=U\Sigma V^{t} and then de-noise X by throwing away all but the top p=400 or so singular values.  This gives X_{p}=U\Sigma_{p} V^{t} , where \Sigma_{p} is a diagonal matrix with zeros on all except the first p=400 entries.

LSA uses  X_{p} to compute similarities:  term-document similarities, term-term similarities, and document-document similarities.    Notice that if we are only interested in semantic similarity — similarity between terms — we need only consider the term-term density matrix \mathbb{X} =XX^{t} .  Indeed, we can even forgoe solving  the SVD problem to find U and \Sigma by solving the eigenvalue problem


As before, define the projection operators P,Q such that


To recover LSA, we construct the  P,Q using the singular vectors of U . Let

U=\sum|u\rangle\langle u|

and write

P=\sum|u_{p}\rangle\langle u_{p}|  and Q=\sum|u_{q}\rangle\langle u_{q}|

Finally, set the hidden variable block Q\mathbb{X}Q=0 to zero remove the unwanted singular values and vectors.

More generally, we can define P,Q in other ways.  In LSA we to simply remove noise.  Here I show we can also use it to hide variables and make a model more “explainable”

Consider an extended LSA problem, where we have 2 kinds of features:  tags and text. Indeed, it is very common to have labelled documents, and frequently we can use these labels to help guide a text clustering algorithm (as in Labelled LDA)

Suppose we want to compute the semantic similarity between the tags explicitly, and we want to include the text as hidden variables.

We could, for example, collect all the tags and text into a single feature set of terms, run  LSA on the entire term-document matrix, and compute the tag-tag semantic similarity  \langle \tau_{i}|\tau_{j}\rangle  using the de-noised term-document matrix X

\langle \tau_{i}|\tau_{j}\rangle=\mathbb{X}_{i,j}

How can we include the effect of the text, the hidden variables, on the tag-tag similarity? Typically in machine learning, we apply the ‘Kernel Trick’  to form

>where \mathbb{X}  includes both effects of the visible tags and the hidden text.  

What is \mathbb{K} ?  Let us assume it is a linear combination of the P-space block of the term-term density matrix and some other Kernel \mathbb{K'} that include the effects of the hidden text:


Suppose the text was not included in the LSA calculation.  Then \mathbb{X} would only span the tag space (the P-space), and the eigenvectors of \mathbb{X} would represent “concept vectors” of a sort (albeit concepts with negative weights, as in SVD). Consider, then the eigenvalue equations

\mathbb{X}x=\lambda_{x} x  for all x

We now construct a new, effective operator, \mathbb{X}^{eff} , that acts only the tag-space (the Primary Space P), and yet reproduces the exact, lowest P eigenvalues of the original operator \mathbb{X} .   (There are actually several constructions of \mathbb{X}^{eff} which we will explore)

Let us partition our feature space into visible (tag) features, and hidden (text) features, such that


Now write the eigenvalue equation above as

[\begin{array}{cc}    P\mathbb{X}P & P\mathbb{X}Q\\    Q\mathbb{X}P & Q\mathbb{X}Q    \end{array}]\begin{array}{c}    Px\\    Qx    \end{array}= \lambda\begin{array}{c}    Px\\    Qx    \end{array}

We now re-write this using subscripts, and identify the expressions for  the P and Q space

\mathbb{X_{\mathrm{PP}}}x_{P} +\mathbb{X}_{PQ}x_{Q}= \lambda_{x}x_{P}   Visible Space (2)

\mathbb{X}_{QP}x_{P}+\mathbb{X}_{QQ}x_{Q}=\lambda_{x}x_{Q}   Hidden Space (3)

We can now find an expression for x_{Q} using (3)

x_{Q} = (\lambda_{x}-\mathbb{X}_{QQ})^{-1}\mathbb{X}_{QP}x_{P}

and plug this into (2) to obtain our Eigenvalue-Dependent Effective (Semantic) Operator

\mathbb{X}^{eff}(\lambda)= \mathbb{X_{\mathrm{PP}}}+\mathbb{X_{\mathrm{PQ}}}(\lambda_{x}-\mathbb{X}_{QQ})^{-1}\mathbb{X_{\mathrm{QP}}}

We can now compute an ‘exact’, or complete semantic similarity, between our visible features (tags) , and that also couples  the hidden variables, (text) as

\langle \tau_{i}|\mathbb{X}^{eff}(\lambda)|\tau_{j}\rangle\

Notice that  \mathbb{X}^{eff}(\lambda) explicitly depends on some dominant eigenvalue \lambda of the original feature-feature coupling matrix \mathbb{X} .  This is fine if we just want to explain an existing model. If \lambda is difficult to compute, or if we can not invert the Q-space sublock of  \mathbb{X} , then we might need other methods.  For example, how do we regularize / de-noise the operator?  How can we alleviate the eigenvalue dependence.

In our next post, I will address these issues, and introduce an Eigenvalue-Independent Effective (Semantic) Operator.  Stay tuned

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s