SVM+ / LUPI: Learning Using Privileged Information

Recently Facebook hired Vapnik, the father of the Support Vector Machine (SVM) and the Vapnik-Chrevoniks Statistical Learning Theory.

Lesser known, Vapnik has also pioneered methods of Transductive and Universum Learning. Most recently, however, he has developed the theory of the Learning Using Privileged Information (LUPI), also known as the SVM+ method.

And this has generated a lot of interest in the community to develop numerical implementations.

In this post, we examine this most recent contribution.

Preface

Once again, we consider the problem of trying to learn a good classifier having a small set of labeled examples $x_{l}\in L$ .

Here, we assume we have multiple views of the data; that is, different, statistically independent feature sets $\phi(x),\phi*(x)$ , that describe the data.

For example, we might have a set of breast cancer images, and some holistic, textual description provided by a doctor. Or, we might want to classify web pages, using both the text and the hyperlinks.

Multi-View Learning is a big topic in machine learning, and includes methods like SemiSupervised CoTraining, Multiple Kernel Learning, Weighted SVMs, etc. We might even consider Ensemble methods as a kind of Multi-View Learning.

Having any extra, independent info $x^{*}$ about our data is very powerful.

In LUPI, also called the SVM+ method, we consider situation where we only have extra information $x^{*}_{l}$ about the labeled examples $x_{l}\in L$ .

Vapnik calls $\{x^{*}\}$ Privileged Information — and shows us how to use it

The VC Bounds

If we are going to discuss Vapnik we need to mention the VC theory. Here, we note an important result about soft and hard-margin SVMs: when the training set is non-separable, the VC bounds scale as $\dfrac{h}{\sqrt{L}}$ :

$P_{test}\;\le\;\nu_{train}+O\left(\dfrac{VCdim}{\sqrt{L}}\right)$

where $\nu_{train}$ is the training error, and h is the VC dimension

But when the training set is separable, i.e. $\nu_{train}=0$ , the VC bounds scale as $\dfrac{h}{L}$

$P_{test}\;\le\;O\left(\dfrac{VCdim}{L}\right)$

This means we can use say L=300 labeled examples as opposed to L=900,000, which is a huge difference.

It turns out we can also achieve $\dfrac{h}{L}$ scaling–if we have an Oracle that tells us, a priori, the slacks. Hence the term Oracle SVM. The Oracle tells us how much confidence we have in each label.

So if we can get the slacks $\{\varepsilon_{l}\}$ , or, equivalently, the weights, for each instance, L can be much smaller.

[Of course, this is only true if we can sample all the relevant features.]

Soft Margin SVM

In practice, we usually assume data is noisy and non-separable. So we use a soft-margin SVM where we can adjust the amount slack with the $C$ (cost) parameter,

Note that we also have the bias $b$ , but we can usually set to zero (still, be careful doing this!).

Let us write the soft-margin SVM optimization as

$\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\varepsilon_{l}\right)$

subject to the constraints

$y_{l}\left(\langle w_{l},x_{l}\rangle+b\right)\ge 1-\varepsilon_{l}\;\;\forall\;l\in L,\;\;\varepsilon_{l}\ge 0$

Notice that while we formally add slack variables $\{\varepsilon_{l}\}$ to max-margin optimization–they eventually vanish. That is, we don’t estimate the slacks; we replace them with the maximum margin violation, or the Hinge Loss error

$\hat{err_{l}}=\max[1-y_{l}(\langle w,x_{l}\rangle+b,0)]$

We then solve the unconstrained soft-margin SVM optimization, which is a convex upper bound to the problem stated above–as explained in this nice presentation on LibLinear.

$\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\max[1-y_{l}(\langle w,x_{l}\rangle+b,0)]\right)$

or, in simpler terms

$\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+C\sum_{l\in L}\hat{err_{l}}\right)$

And this works great–when we have lots of labeled data L.

What if we could actually estimate or assign all the L slack variables $\{\varepsilon_{l}\}$ ? In principle, we can far use labeled examples L. This is the promise of LUPI.

LUPI/SVM+

In LUPI, we use the privileged information $\{x^{*}_{l}\}$ to learn the slack variables $\{\varepsilon_{l}\}$ :

$\varepsilon_{l}=\langle w^{*},x_{l}^{*}\rangle+b^{*}$

We model the slack as linear functions of the privileged information.

Effectively, the privileged information provides a measure of confidence for each labeled instance.

To avoid overtraining, we apply a max-margin approach. This leads to an extended, SVM+ optimization problem, with only 2 adjustable regularization parameters $C,\gamma$ (and 2 bias params):

$\arg\min\left(\dfrac{1}{2}\langle w,w\rangle+\dfrac{\gamma}{2}\langle w^{*},w^{*}\rangle+C\sum_{l\in L}[\langle w^{*},x_{l}^{*}\rangle+b^{*} ]\right)$

subject to the constraints

$y_{l}\left(\langle w_{l},x_{l}\rangle+b\right)\ge 1-[\langle w^{*},x_{l}^{*}\rangle+b^{*}],$

$\langle w^{*},x_{l}^{*}\rangle+b^{*}\ge 0\;\;\forall\;l\in L$

Show me the code?

The SVM+/LUPI problem is a convex (quadratic) optimization problem with linear constraints–although it does require a custom solver. The current approach is to solve the dual problem using a variant of SMO. There are a couple versions (below) in Matlab and Lua. The Matlab code includes related SVM+MTL (SVM+ MultiTask Learning). I am unaware of a open source C or python implementation similar to Svmlin or Universvm.

To overcome the solver complexity, other methods, such as the Learning to Rank method, have been proposed. We will keep an eye out for useful open source platforms that everyone can benefit from.

Summary

The SVM+/LUPI formalism demonstrates a critical point about modern machine learning research. If we have multiple views of our data, extra information available during training (or maybe during testing), or just several large Hilbert spaces of features, then we can frequently do better than just dumping everything into an SVM, holding our nose, and turning the crank.