Today I am going to look at a very important advance in one of my favorite Machine Learning algorithms, NMF (Non-Negative Matrix Factorization) .
NMF is a curious algorithm in that it allows us to do something very natural and seemingly straightforward, namely, to find clusters of data points, and yet, mathematically, it is known to be non-convex and NP-HARD . Practically, this means that NMF–and many other common clustering algorithms, such as KMeans, LDA, etc.–can suffer from convergence problems and yield unstable and/or just plain unphysical results.
Several approaches exist to improve traditional NMF , such as sparse NMF ala Hoyer , using SVD initialization (i.e. NNDSVD) , and the convex NMF (CNMF) approach suggested by Jordan et. al. 
Recently, however, it has been shown how to re-formulate the traditional NMF problem, under fairly reasonable assumptions, as a true convex optimization problem, namely a linear programming (LP) problem . Is this new? Is it revolutionary? Yes and no.
In fact, it has been known for some time that CNMF could be re-formulated as purely convex optimization, called Convex-hull non-negative matrix factorization (CHNMF). Indeed, both CNMF and CHNMF are implemented in the Python Matrix Factorization (PyMF) module. The PyMF package, however, is not very fast, and in production, I would recommend using the more popular SciKit Learn NMF, with an NNDSVD seed, over the PyMF CNMF or CHNMF implementations, unless you really, really need a (more) sparse NMF solution.
So what is so exciting about the Bittorf et. al. paper? Their ‘CNMF-LP’ problem can be solved using a very fast, shared-memory, lock-free implementation of a Stochastic Conjugate Gradient (SGD) solver, called hottopix. (This is same kind of solver applied in other very high performance ML toolkits, such as the fantastic GraphLab package) IMHO, this is HUGE. It means we can stop wasting time with half-baked approaches like Hadoop+Mahout and actually solve very large scale, applied problems with the same performance we have come to expect from our SVMs.
I will break this post into a 3-4 parts. In this first post, I review the basic re-formulation of convex-NMF as a linear program.
Standard NMF problem
Let be a nonnegative data matrix with the rows indexing the features and the columns indexing the examples. That is,
where is a column, data vector. In standard NMF, we seek to factor into components, such that
is an feature matrix, and an weight matrix, and both are non-negative matrices.
Convex NMF (NMF) Problem
In Convex NMF (Jordan et. al.), we require that the feature vectors be convex linear combinations of the original data matrix, such that
This lets us relate the feature vectors to centroids of the data, as is common in Kmeans methods. Also, this leads to very sparse solutions. Lets us write
where is an matrix, called the factorization localizing matrix. This lets express convex NMF approximation as
although in other papers we define differently, and see the more cryptic form , which is the starting point for grokking the LP formulation.
Notational Confusion: from CNMF to ‘CNMF-LP’
We make a short side-note on notation.
The Jordan et. al. paper calls the data matrix and writes the factorization as
The Bittorf et. al. paper writes the data matrix the but adds the additional pre-factorization as
The matrix is the true data matrix, and the matrix is that part which satisfies the constraints to make the problem convex. They assume is small a in formal sense in order to make their theoretical argument and form an error bound (not addressed here). They then write the factorization almost interchangeably as
and / or
Also, note that they write the matrix on the left:
and / or
(switching between C on the left or right makes it hard to compare the details in the 2 CNMF papers, so we only sketch out the ideas here, and will work through the explicit details for you in the next post when we code it up)
Exact Convex NMF and Linear Programming
Certainly there are cases where NMF is well posed. That is, the data is sparse enough and the centroids separable enough that even though the Jordan et. al. convex NMF per-se is non-convex and probably NP-HARD, any practical solution with a good initial starting value will rapidly converge to a stable and unique solution.
The Bittorf et. al. observe that, by definition, for any well-posed convex NMF, the matrix elements of the factor loading matrix form a polyhedral set , such that
To solve an exact and well posed convex NMF problem, one simply needs to find a feasible solution of the set , such that all the diagonal elements . This is readily accomplished by solving the associated linear programming problem!
Real-World Convex NMF
To solve a real-world problem, we assume that the data matrix is nearly well-posed, upto some small amount of noise. We then solve the associated LP problem, where we replace the exact constraint with the relaxed constraint . This yields
subject to the constraints
; ; ; and
where is some arbitrary vector, and is an adjustable, slack parameter.
We now form the Feature matrix with yet a second LP program,
This entire problem is now readily addressable and solving using off the self convex-optimization libraries, such as Matlab and/or the cvxopt package in Python. In our next post, we will look at some toy problems and how to solve them directly using these techniques.
Lagrangian form of Convex NMF
I don’t know about you, but I always prefer to see the Lagrangian, or dual, form of a convex optimization problem, just to make sure I understand the definition and the constraints. Additionally, eventually we will want a high performance solver that we can run on very large data sets, so it is also very practical to review these equations now.
Bittorf et. al. provide an Incremental Gradient Descent algorithm where they dualize a part of the constraints. The corresponding Lagrange multiplier problem is
with the constraint set
Since the 1-norm that appears above, this requires special techniques such as subgradients. They suggest breaking the problem into two parts, a Dual Subgradient Ascent approach, that solves this problem incrementally by
1. alternating between minimizing the Lagrangian over the constraint set to find a solution , and then
2. plugging into a subgradient step with respect to the dual variables
We will go into more details in a future post.
Conclusion: Clustering via LP
So there we have it. A formulation of a convex NMF–a workhorse algorithm of applied machine learning–as a Linear Program (i.e. CNMF-LP), which can be solved using high performance, large scale convex optimization techniques. This has the potential to make clustering of very data scale data sets as easy as it is now to run very large scale linear SVMs.
It is, IMHO, a game changer.
1. Lee and Seung (1999), Learning the parts of objects by non-negative matrix factorization
2 Donoho & Stodden (2004), When does non-negative matrix factorization give a correct decomposition into parts?
3. Hoyer (2004), Non-negative Matrix Factorization with Sparseness Constraints
4. Boutsidis & Gallopoulos (2007), SVD based initialization: A head start for nonnegative matrix factorization
5. Ding, Li, & Jordan (2006), Convex and Semi-Nonnegative Matrix Factorizations
6. Bittorf, Recht, Re, & Troppy (2013), Factoring nonnegative matrices with linear programs
7. see: Python Matrix Factorization (PyMF) module and references within