Music Recommendations and the Logistic Metric Embedding

In this post, we are going to see  how to build our own music recommender, using the Logistic Metric Embedding (LME) model developed by Joachims (of SVMLight fame). The core idea of this recommender is that people listen to songs in a specific sequence, and that certain songs sound better when they follow other songs.

Here we review math of the LME; But first, a little history

Introduction:  Some History of Sequence Prediction

I first began working in sequence prediction in about 2001, where I developed the Effective Operator approach to Personalized Recommenders.  This was a recommender system that used a kind of Bayesian / Regularized Linear Regression.  Some years later, working with eBay, we researched the effectiveness of search relevance (a kind of sequential recommendation) using techniques including Joachims’ earlier work on Structural SVMs.  Recent work has focussed on built upon the  Structural SVMs for Ranking, leading to  a new, simpler  approach, Metric Learning to Rank.   Metric Embedding is, in general, a good thing to know about, and you can learn about it more generally from the University of Chicago course: CMCS 39600: Theory of Metric Embeddings.  Joachims has subsequently also considered metric learning, and here, we examine some his recent research in metric learning for sequence prediction.  Specifically, we look in detail at the Logistic Metric Embedding (LME) model for predicting music playlists.

Sequence Prediction with Local Metric Embeddings

We would  like to recommend playlist of songs.  More generally, we seek to estimate the probability of observing a new, directed sequence D , given a training set of a very large set of directed sequences  \mathcal{S}^{d}_{train} .  To do this, we define a conditional, probabilistic, vector space model  that retains the local sequence structure of the sample data, and we try to learn the optimal model via regularized maximum-likelihood estimation (MLE) .

Developed by Joachims (and following his notation), we seek to predict a (directed) sequence of states p=\{p^{1},p^{2}\cdots p^{d}\} (i.e. an ngram sequence, a playlist of songs, etc. ) from a large collection of states S=\{s_{1},s_{2}\cdots\} , where p^{i}\in S .   We model each sequence p as a Markov Chain, so that

Pr(p)=\underset{i}{\Pi}Pr(p^{i}|p^{i-1})

We embed the sequence into the Latent space, or the d-fold Tensor space  (which is isomorphic to a (|S|\times d) dimensional Euclidean space )

\mathcal{S}^{d}=S\times S\times\cdots S\cong\mathbb{R}^{|S|\times d} , such that  p\in\mathcal{S}^{d}

The inference problem is to find Pr(p)  ; it is  solved using a straightforward Stochastic Gradient Descent (SGD) algorithm, restricted to a kind of nearest-neighbor sampling called a Landmark Heuristic

Because we have a Markov Model, we will see that all we need is to write down the local log-Likelihood, defined through the conditional transition probabilities Pr(p^{i}|p^{i-1})  , and, as usual, some regularization.  Joachims ignores long range interactions and only regularize locally, and I will note some opportunities to address this.

Also, we note that assuming a Markov Model is actually restrictive (Recall our post on Brownian motion and memory effects in chemical physics) .  First, it assumes that to predict the next song, the only songs that matter are the ones we listened to before this song.  Additionally, Markov Models can ignore long-range interdependencies.   A NMF based recommender does not assume this, and they include long-range effects.

Recent work (2013) shows how to parallelize the method on distributed memory architectures, although we will do all of our work on a shared memory architecture. they even provide sample code!

Visualizing the Embedding

For example, a typical embedding is visualized as:

embedding

The Online LME Demo provides an interactive view of the embedding.  It also includes simple modifications for adding popularity, user bias, and even semantic tags to the model (not discussed here)

Single and Dual Point Logistic Metric Embeddings

Joachims introduces 2 different models, called the Single Point and Dual Point models.  We  should only care about the Dual Point model, although, curiously, in the published paper (2012), the Dual Point model does not work better than the Single Point model.  So, for now, we describe both. Also, the code contains both models.

By a Metric Embedding, we mean that the conditional probability is Pr(p^{i}|p^{i-1}) is defined with the metric, or distance \Delta(p^{i},p^{i-1}) , in the Latent space \mathcal{S}^{d}\cong\mathbb{R}^{|S|\times d} where we have embedded our states (songs).

Pr(p^{i}|p^{i-1})\sim\exp(-\Delta(p^{i},p^{i-1})^{2})

What is the Embedding?

We associate each state s_{i}  with a vector \vec{X}(i)\in\mathbb{R}^{|S|}

s_{i}\rightarrow\vec{X}(i)

Notice Joachims writes \vec{X}(p^{i}) and \vec{X}(s_{i}) to distinguish between vectors associated with elements of a playlist and arbitrary elements of S .

In a model for , say, playlists, each vector  it is , initially, just a point \vec{X_{0}}(i) on a |S| -dimensional hypercube.  We then run an inference algorithm which learns the optimal vectors \vec{X_{opt}}(i) that minimize the total regularized log-likelihood L , as computed over the entire training data, and subject to the specific model chosen.

What is \Delta ?

In the Single Point model, \Delta_{1} is just the Euclidean distance between  X(i) and X( i-1)

single_point_model

\Delta_{1}(p^{i},p^{i-1})=\parallel\vec{X}(i)-\vec{X}(i-1)\parallel_{2}

This is just a standard Vector Space model.

In the Dual Point model, however, we associate 2 points (vectors)  \left(\vec{U}(i),\vec{V}(i)\right)\in\left(\mathbb{R}^{|S|},\mathbb{R}^{|S|}\right) , with each state s^{i} within a  sequence s .  We call these the entry \vec{U}(i) and exit \vec{V}(i) points.

s_{i}\rightarrow\left(\vec{U}(i),\vec{V}(i)\right)

dual_point_model

The distance between 2 states the asymmetric divergence

\Delta_{2}(p^{i},p^{i-1})=\parallel\vec{U}(i)-\vec{V}(i-1)\parallel_{2}

This is like a Vector Field Model, where each state is mapped to a directed vector

\vec{\Delta}(i)=\vec{V}(i)-\vec{U}(i)\in S .

(In mathematical physics, the embedding is a called a Vector Field or, more generally, a Fiber Bundle ).

The Latent metric is the Euclidean distance between the end point of p^{i-1} and the starting point of p^{i} within a directed sequence p .

We can also construct sequence-independent (\bar{U}(i),\bar{V}(i) )  vectors by simple averaging over a large sample of known sequences (i.e. song playlists).

i\rightarrow\left(\bar{U}(i),\bar{V}(i)\right)

Model Inference

We want to find

Pr(p)=\underset{i}\Pi\dfrac{\exp(-\Delta(p^{i},p^{i-1})^{2})}{Z(p^{i-1})}

where  Z(p^{i}) is the point partition function, given by

Z(p^{i})=\sum_{j=1}^{|S|}\exp(-\Delta(p^{i},s_{j})^{2})

This is just the normalization of the conditional probability (above).   Note that  the  sum is over all possible states s_{j}\in S , and for numerical performance, we will restrict the sum to nearby states.  

Actually, the final model used is somewhat more complicated because it includes other features to adjust for popularity, user bias, diversity, etc.

Final LME Model

Pr(p)=\underset{i}{\Pi}Pr(p^{i}|p^{i-1})=\underset{i}\Pi\dfrac{\exp(-\alpha\Delta(p^{i},p^{i-1})^{2})+\beta b+\gamma\Delta(p^{i},p^{0})^{2}}{Z(p^{i-1},p^{0},\alpha,\beta,\gamma)}

The optimal solution is found by maximizing the  log-Likelihood \log(Pr(p)) , subject to regularization to avoid over-training.

in this related post, I explain the partition function and where it comes from.

For the Single Point Model, the objective function is the log-Likelihood plus a standard 2-norm regularizer

f_{1}(\bf{x})=\underset{U,V}{\sum}L_{1}(D|X)-\lambda\parallel\bf{X}\parallel^{2}_{F}

For the Dual Point Model, the objective function includes 2 kinds of regularizers

f_{2}(\bf{x})=\underset{U,V}{\sum}L_{2}(D|U,V)-\lambda\Omega(U,V)-\nu\underset{s\in S}{\sum}\Delta_{2}(s,s)_{2}

The first regularizer is

\bf{\Omega(U,V)}=\parallel\bf{U}\parallel^{2}_{F}-\parallel\bf{V}\parallel^{2}_{F}

and ensures that the entry and exit vectors don’t get too large.  It is similar to the kind seen in traditional matrix factorization techniques.

The second regularizer ensures the small length sequences, and is similar in spirit to the kind of path regularizers seen  in Minimum Path Basis Pursuit methods (where, ideally, we would seek use an L1-norm for this).  This part is critical because it allows to decompose what looks like a large, messy, non-convex, global optimization problem into a convex sum of local, albeit non-convex , optimization problems:

A Local Manifold Optimization Problem

Here is the real trick of the method

By including  the dual point, local path regularizers \Delta_{2}(s_{a},s_{b})^{2} ,  we can represent the global log-Liklihood as a sum over local log-Likelihoods, describing regularized transitions from s_{a}\rightarrow s_{b}

We re-write the  objective function f_2(\bf{x}) (here, Joachims  confuses the notation and includes the regularizer in the log-Likehood, so be careful)

\sum_{a=1}^{|S|}\sum_{b\in C_{a}}T_{ab}l(s_{a},s_{b})-\Omega(U,V)

where

  • C_{a} is the collection of states close to s_{a} (i.e  all nearest neighbors,  observed bigams, etc.),
  • T_{ab} is the number of transitions from s_{a}\rightarrow s_{b} , and is precomputed from the training data, and
  • l(s_{a},s_{b}) is the dual point, local log-likelihood term for the transition s_{a}\rightarrow s_{b}

l(s_{a},s_{b})=-\Delta_{2}(s_{a},s_{b})^{2}-\log(Z_{2}(s_{a}))

  • \Omega(U,V) is the global regularizer, but which is trivial to evaluate locally (using SGD)

While the optimization problem is clearly non-convex, this decomposition allows us to ‘glue together’ a convex combination  of locally non-convex but hopefully tractable optimizations, leading to what I call a manifold optimization problem.   ( Indeed, from the point of view of mathematical physics, the dual point embedding model has the flavor of a non-trivial fiber bundle,  and the path constraints act to minimize the curvature between the bundles; in other words, a gauge field theory. )   Usually I don’t trust non-convex methods.  Still,  it is claimed good solutions can be found with 100-200 iterations using standard SGD.  We shall see…

SGD Update Equations

Joachims suggests solving the SGD by iterating through all states (songs) s_p and updating the exit vectors U(s_p) with

U(s_p)\leftarrow U(s_p)+\dfrac{\tau}{N}\left[\underset{b\in C_p}{\sum}T_{pb}\dfrac{\partial l(s_p,s_b)}{\partial U(s_p)}+\dfrac{\partial\Omega(U,V)}{\partial U(s_p)}\right]

and for each s_p , updating the entry vector V(s_q) for each possible transition s_{p}\rightarrow s_{q}

V(s_q)\leftarrow V(s_q)+\dfrac{\tau}{N}\left[\underset{b\in C_p}{\sum}T_{pb}\dfrac{\partial l(s_p,s_b)}{\partial V(s_q)}+\dfrac{\partial\Omega(U,V)}{\partial V(s_p)}\right]

where

  • \tau is the SGD learning rate, and
  • N is the number of transitions sampled

The partials over the regularizers are simple

\dfrac{\partial\Omega(U,V)}{\partial U(s_p)}=-2\lambda U(s_{p})-2\nu\vec{\Delta}_{2}(s_{p},s_{p})

\dfrac{\partial\Omega(U,V)}{\partial V(s_q)}=2\lambda V(s_{q})-2\nu\vec{\Delta}_{2}(s_{q},s_{q})

The partial over U(s_p)  is written to  samples all possible exit vectors s_l

\dfrac{\partial l(s_p,s_b)}{\partial U(s_p)}=-2\vec{\Delta}_{2}(s_p,s_b)-\dfrac{\sum_{S_l}\exp(-2\Delta_{2}(s_p,s_l)^{2})\vec{\Delta}_{2}(s_p,s_l)}{Z^{r}(s_p)}

while the partial over V(s_q) is considerably simpler since we only sample the individual transitions (from exit to entry) s_{p}\rightarrow s_{q} for each state, yielding

\dfrac{\partial l(s_p,s_q)}{\partial V(s_q)}=2\vec{\Delta}_{2}(s_p,s_q)-\dfrac{\exp(-2\Delta_{2}(s_p,s_q)^{2})\vec{\Delta}_{2}(s_p,s_q)}{Z^{r}(s_p)}

Notice Z^{r} refers to a restricted partition fucntion that only samples states in C_p (i.e nearest neighbors, or what Joachims called the Landmark Heuristic)

Possible Extensions

In a simple Netlfix-style item recommender, we would simply apply some form of matrix factorization (i.e NMF) to T_{ab}   directly, and ignore the sequence structure.  Here, it is noted that we have some flexibility in computing  T_{ab}  , and we could both Kernalize and Regularize it before running the inference algorithm.

Next Steps

Now that we have a basic review of the model, it would be fun to make some personalized playlists.  This can be done using the online LME Software. Have fun exploring!

References

See Joachims Website on Playlist Prediction and the references list

6 Comments

  1. > In our next post, we will look in detail at the LME Software and build some actuals playlists.
    Looks like that won’t happen.
    Actually LME code is a mess. The fact that you can compile it with simple make is nice, but code is very hard to read.

    Like

    1. Frequently academic code needs to be cleaned up. Turns out the guy I was working this with decided to switch gears and do something else.

      Like

Leave a comment