I a previous post, I introduced the Effective Regression Operator for supervised (and semi-supervised) Machine Learning problems. Here, I will introduce a different Effective Operator, inspired by Quantum Chemistry and Quantum Nuclear Physics (also called an Effective Hamiltonian), for solving eigenvalue and/or clustering problems.
I believe this is the first time that the Effective Operator / Hamiltonian has been introduced into Machine Learning. (although I don’t know the whole field, this is basically an idea I came up with in 2001 back in the first bubble when I first got into machine learning)
Before we jump in, let me provide some motivation
One the most popular machine learning algorithms for text clustering is Latent Semantic Analysis (LSA). Given a collection of documents, we construct a term-document matrix
where the rows are term vectors, and the columns
are column vectors.
In LSA, we form the Singular Value Decomposition (SVD) of the term-document matrix, and then de-noise
by throwing away all but the top
or so singular values. This gives
, where
is a diagonal matrix with zeros on all except the first
entries.
LSA uses to compute similarities: term-document similarities, term-term similarities, and document-document similarities. Notice that if we are only interested in semantic similarity — similarity between terms — we need only consider the term-term density matrix
. Indeed, we can even forgoe solving the SVD problem to find
and
by solving the eigenvalue problem
As before, define the projection operators such that
To recover LSA, we construct the using the singular vectors of
. Let
and write
and
Finally, set the hidden variable block to zero remove the unwanted singular values and vectors.
More generally, we can define in other ways. In LSA we to simply remove noise. Here I show we can also use it to hide variables and make a model more “explainable”
Consider an extended LSA problem, where we have 2 kinds of features: tags and text. Indeed, it is very common to have labelled documents, and frequently we can use these labels to help guide a text clustering algorithm (as in Labelled LDA)
Suppose we want to compute the semantic similarity between the tags explicitly, and we want to include the text as hidden variables.
We could, for example, collect all the tags and text into a single feature set of terms, run LSA on the entire term-document matrix, and compute the tag-tag semantic similarity using the de-noised term-document matrix
How can we include the effect of the text, the hidden variables, on the tag-tag similarity? Typically in machine learning, we apply the ‘Kernel Trick’ to form
>where includes both effects of the visible tags and the hidden text.
What is ? Let us assume it is a linear combination of the P-space block of the term-term density matrix and some other Kernel
that include the effects of the hidden text:
Suppose the text was not included in the LSA calculation. Then would only span the tag space (the P-space), and the eigenvectors of
would represent “concept vectors” of a sort (albeit concepts with negative weights, as in SVD). Consider, then the eigenvalue equations
for all
We now construct a new, effective operator, , that acts only the tag-space (the Primary Space P), and yet reproduces the exact, lowest P eigenvalues of the original operator
. (There are actually several constructions of
which we will explore)
Let us partition our feature space into visible (tag) features, and hidden (text) features, such that
Now write the eigenvalue equation above as
We now re-write this using subscripts, and identify the expressions for the P and Q space
Visible Space (2)
Hidden Space (3)
We can now find an expression for using (3)
and plug this into (2) to obtain our Eigenvalue-Dependent Effective (Semantic) Operator
We can now compute an ‘exact’, or complete semantic similarity, between our visible features (tags) , and that also couples the hidden variables, (text) as
Notice that explicitly depends on some dominant eigenvalue
of the original feature-feature coupling matrix
. This is fine if we just want to explain an existing model. If
is difficult to compute, or if we can not invert the Q-space sublock of
, then we might need other methods. For example, how do we regularize / de-noise the operator? How can we alleviate the eigenvalue dependence.
In our next post, I will address these issues, and introduce an Eigenvalue-Independent Effective (Semantic) Operator. Stay tuned