Machine Learning with Missing Labels Part 2: The UniverSVM

Ever wonder what Google DeepMind is up to? They just released a paper on Semi-Supervised learning with Deep Generative Models. What is Semi Supervised Learning (SSL)? In this series of posts, we go back to basics and take a look.

Most machine learning algos require huge amounts of labeled data to achieve high accuracy. SVMs, Random Forests, and especially Deep Learning, can take advantage of massive labeled data sets.

How can we learn when we only have a few labeled examples?

In SSL, we try to improve our classifiers by taking advantage of unlabeled data.

In our last post we described an old-school approach to this problem, the Transductive SVM (TSVM), circa 2006. Here, we examine a different approach to learn from unlabeled data–Infernece with the Universum–ala Vapnik 2006.

Plus, there is code! The UniverSVM code has a TSVM and the Universum (USVM) approach.

Wait, doesn’t Deep Learning use unlabeled data in pretraining?

This is different. Deep Learning uses huge amounts of labelled data. We want to use as little labeled data as possible. To me, this is the real test of a learning theory.

In this and the next few blogs, we will examine at several variants of SVMs– USVM, WSVM, SVM+, and Semi Supervised Deep Learning methods–all that achieve high accuracy with only a small amount of labeled data. Also, we only consider text classification–images and other data sets require different methods.

Learning with Unlabeled Data

Let’s say with have a small number $N_L$ of labeled documents, and a large number $N_U$ of unlabeled ones. Can we build document classifier (i.e. a binary SVM) that generalizes well with just a little labeled data?

According to the Vapnik and Chervonenkis Statistical Learning Theory (VC-SLT), our generalization accuracy is very losely bounded by the model complexity and the number of training documents $N_L$ :

$R(\alpha)-R_{emp}(\alpha)\leq2\mathcal{R}_{m}(\mathcal{H})+\kappa\sqrt{\dfrac{ln(\frac{1}{\delta})}{N_L}}$

which, in plain English, is just

Generalization Error $\leq$ Model Capacity ( $2R_{m}$ ) $+$ Size Effects ( $\ \sim\sqrt{\dfrac{1}{N_{L}}}$ )

The VC-SLT inspired the SVM maximum margin approach (although they are not equivalent).

SLT recognizes that for any data set $\mathcal{L}$ (of size $N_{L}$ ), we may obtain multiple, equivalent models (i.e. all having the same training accuracy):

$\Gamma_{L}=\left[\mathcal{H}_{1}\sim\mathcal{H}_{2}\sim\mathcal{H}_{3}\sim\cdots\right]$

We should select the model $\mathcal{H}$ from $\Gamma_{L}$ with the smallest capacity $2R_{m}(\mathcal{H})$ ; in an SVM, this means select the largest margin.

This is easiest to visualize when the slack or admissible error = 1

Each model $\Gamma_{L}$ is a set of hyperplanes that give same labeling. The left model is optimal because it has the largest ‘SVM Capacity’, which is essentially the volume carved out by the admissible hyperplanes. Turns out, the best model also has the hyperplane with the largest margin, so

The maximun margin is a measure of the VC capacity…

but is it the best?

While the VC-SLT bound is very loose, it does suggest that we usually need a large number of labeled documents $N_L$ to obtain any reasonable production accuracy.

The SLT is, however, inherently a theory about Transductive Learning; the proof of the VC bounds requires first proving the Transduction bound (i.e. via Symmetrization). It has always been the dream of the VC-SLT program to develop a Transductive or SemiSupervised) method that can learn much faster than $\dfrac{1}{\sqrt{N_{L}}}$ .

(By Semi-Supervised, we mean that the resulting model can be applied to out-of-sample-data directly, where as Transductive learning only applies to the known, unlabeled test set. We will not distinguish between them here.)

We would hope to achieve $\dfrac{1}{\sqrt{N_{L}+N_{U}}}$ by adding in unlabeled data. When $N_U$ is very large, we win.

(Indeed, recently, Vapnik has shown it is actually possible to reduce this bound from $\dfrac{1}{\sqrt{N_{L}}}$ to $\dfrac{1}{N_{L}}$ –if we can Learn Using Privileged Information (LUPI). This means we can learn when $N_{L}=320$ whereas previously we needed $N_{L}=100,000$ ! But this actually is akin to assigning additional data or even weights to each labelled example–and this is not the same as using just unlabeled data. Perhaps we can learn the weights from the unlabeled data–but that’s for another post.) So we are left with the un-nerving statement

If the max margin is the right measure, then the TSVM should work very well…

and yet, this has proved elusive.

What is the alternative ? Suppose instead of measuring the volume traced out by the hyperplanes, our models measured the volume between the convex hulls of the 2 classes:

Notice that now, the labeling on the right is better.

This is a broader measure of the diversity of the equivalence class. In SLT, this is related to the VC Entropy (another measure of VC capacity).

The Universum approximates this volume–or rather the VC Entropy– via the Principle of Maximum Contradictions. (In fact, the Universum is not the only way to do this, as we shall see.) A clever idea–but something hard to achieve in practice. Let us compare and contrast the TSVM and the USVM to understand how to select the data sets they need and their limitations.

Transductive SVM (TSVM)

Transductive Inference requires a statistical replica $\mathcal{R}$ of the labeled data $\mathcal{L}$ :

The replica $\mathcal{R}$ is not only the same size as $\mathcal{L}$ , it has the same statistical qualities. In particular, the label means converge: $<y_U>\rightarrow<y_L>$ as $\lim N\rightarrow\infty$ . (i.e. in a well balanced binary classifier, this is zero)

In theory, we can always create a replica (or phantom sample) because we assume the labeled data itself is drawn from some common empirical process. In modern VC-SLT, we think of the symmetrization process that creates $\mathcal{R}$ as a Rademacher process — meaning that we have replicated the training data but randomized the labels.

In practice, we need to select the replica from unlabeled data–and this is hard!

We hope that by adding unlabeled data, we can find the best solution by guessing the labels of the unlabeled data and then maximizing the margin on all the data.

We can apply Transduction SVMs if we can create a large, unlabeled data set $\mathcal{U}$ that behaves like a statistical replica of $\mathcal{U}$ , albeit much, much larger. TSVMs allow us to increase the accuracy of a binary text classifier by adding in large amounts of unlabeled data.

Also, TSVMs extend the feature space, or hypothesis space $\mathcal{H}$ . This is critical for text classification since , frequently, we need to classify documents we have never seen before, and we encounter new words. This is irrelevant for image classification.

If we have a collection of consumer blogs (about finance, beauty, sports, politics, …), with some labelled documents, and large amount of unlabeled ones. We can create (1-vs-1) binary TSVM classifiers, such as finance vs beauty if we have a good way to select the unlabeled data, as described in our previous blog:

Essentially, the unlabeled documents must consist only of finance and beauty blogs, and in the same ratio as the training data.

TSVMs only work well for simple binary ( 1 vs 1 ) classification, and only when the document classes form simple clusters. They don’t work for multi-class classifier because they can not handle ( 1 vs all ) data sets.

So while TSVMs do work, not just any unlabeled data set will do. Indeed, I personally believe the key to a good TSVM is creating a well balanced unlabeled data. Or, equivalently, estimating the fraction $r$ of (+/-) examples very well. If the replica set is poor, or poorly understood, the TSVM results can be worse than just training an SVM on only the labeled data.

UniverSVM (USVM)

In 1998, and , later in 2006, Vapnik introduced a different kind of SVM–which also the allows learning from unlabeled data–but replaces the Principle of Maximum Margin with a more robust method called

the UniverSVM: the Principle of Maximum Contradictions

The idea is to add in data from classes that are significantly different from the 2 classes being separated:

To create a finance vs beauty classifier, we would add labelled and/or unlabeled documents from other categories, such as parenting, politics, sports, etc. We then want a binary classifier that not only separates the 2 labeled classes, but is also as different from the other class–the Universum.

We create 2 replicas of the Universum–one with all (+) labels, and one all (-). We then add unlabeled documents (blue circles) that lie in the margin between classes–or really in the space between the (+/-) classes:

Screen Shot 2014-09-30 at 12.08.39 AM

The best Universum examples will lie in the convex hull between the finance and beauty documents; those inside the convex hulls will most likely be ignored. Since all the u-labels are wrong, every equivalence class of hyperplanes will produce numerous contradictions on the Universum points (u):

The best model has the largest diversity on the Universum; in other words, the largest VC Entropy. Vapnik has pointed out the following interpretation of the Universum: “[When trying to classify the labeled data, try to avoid extra generalizations.]”

The best model has the Maximum # of Contradictions on the Universum.

The UniverSVM (USVM) represents a kind of prior information that we add to the problem; the difference is that instead of specifying a prior distribution, which is hard, we replace this with a more practical approach–to specify a set of specific examples.

Notice we could have just created a 3-class SVM (and theUniverSVM code provides this for direct comparison). Or we could build 2-class classifier, augmented with some clustering metric to avoid the other class--in the same spirit the S4VM method. See, for example, the recent paper on EMBLEM. But in these simple approaches, the other class must only contain other–and that’s hard.

USVMs compared to TSVMs:

In the TSVM we have to be sure we are adding documents that are in the same classes as the training data. In the USVM, we must be sure we are adding documents that do not belong to either class.

Also, with a TSVM, we need to know the fraction of (+/-) documents, whereas with the USVM we don’t. The USVM would seem to admit some data from all classes–in principle. In practice, we will do much better if it does not.

Most importantly, unlike the TSVM, the USVM optimization is convex. So adding in bad data may will not cause convergence problems as in the TSVM and thus degrade the model. At least we hope.

Also, as with the TSVM, we suspect the USVM will only work for small $N_{L}$ ; once $N_{L}$ grows above say 5% of the total documents, we may not see much improvement.

The UniverSVM Algorithm

The SVM approach, inspired by SLT, implements a data dependent, but distribution independent, regularizer that lets us select the best model $\mathcal{H}$ from a class of equivalent hypotheses $\Gamma_{L}$ . What other methods select the optimal equivalence class $\Gamma_{r}$ ?

the Principe of Maximum Power

A model $H$ consists of an equivalence class of hyperplanes, say $h\in H_{\Gamma}$ . They all have the same accuracy on the labeled data $\mathcal{L}$ .

Suppose we know a prior distribution $P(\omega)$ on the set of all possible hyperplanes. Then we can define the power p* of the optimal model as

$p*=\int_{H_{\Gamma}*}d\mathbf{h}P(\mathbf{h})$

We rarely can define $P(\mathbf{h})$ mathematically..but we can approximate it.

Let us sample $P(\mathbf{h})$ by selecting $N_{U}$ unlabeled example documents; we call this set the Universum ( $\mathcal{U}$ ).

We call the practical method the UniverSVM. The key insight of the UniverSVM is that while we can not compute this integral, we can estimate it by measuring the number of contradictions the class of hyperplanes generates on points in $\mathcal{U}$ .

The Principle of Maximum Contradictions:

We need a way to select the equivalence class $\Gamma$ with the maximum number of contradictions on $\mathcal{U}$ . As usual, we create a regularizer.

We augment the standard SVM optimization problem with a Universum regularizer $\Vert U(f(x_u))\Vert$

$\frac{1}{2}\Vert\omega\Vert_{2}^{2}+C_{l}\sum_{l\in\mathcal{L}}H(y_{l},f(x_{l},b))+C_u\Vert U(f(x_u))\Vert$

where H is the standard SVM Hinge Loss.

The regularizer $\Vert U(f(x_u))\Vert$ can also be defined through a symmetric Hinge loss, making the problem convex. The UniverSVM code also contains a non-convex variant using a ramp-loss approach. We leave the details to the academic papers.

Laplacian SVMs and the The Maximum Volume Principle

About the same time Vapnik introduced the UniverSVM, Nyogi (at the University of Chicago) introduced both the TSVM SvmLin code, as well as a new approach to Semi-Supervised learning, the Laplacian SVM (or LapSVM).

The TSVM maximizes the margin for the labelled and unlabeled data. The USVM maximizes the contradictions on the unlabeled (UniverSVM) set. In fact, both approximate a more general form of capacity–the maximum volume between the convex hulls of the data sets. And this can be approximation using Laplacian regularization! Let’s see how this all ties together.

The LapSVM optimization is

$\frac{1}{2}\Vert\omega\Vert_{2}^{2}+C_{l}\sum_{l\in\mathcal{L}}H(y_{l},f(x_{l},b))+C_{u}\Vert\mathcal{L}\Vert$

We see it is very similar to the USVM, but with a different regularizer –norm of the Graph Laplacian $\Vert\mathcal{L}\Vert$ . There are several $\mathcal{L}$ to choose from but LapSVM uses the one corresponding to the Laplace-Beltrami operator on the data manifold.

The LapSVM has recently been applied to text classification, although we dont have a C or python version of the code to test like with do with SvmLin and the UniverSVM.

LapSVMs and related Manifold Learning approaches have motivated some very recent advances in Semi-Supervised Deep Learning methods, such as the Manifold Tangent Classifier. This classifier learns the underlying manifold using contractive auto-encoder, rather than using a simple Laplacian–and this seems to work very well for images.

(We note that the popular SciKit Learn package includes Manifold Learning, but these are only the Unsupervised variants.)

We can relate the LapSVM approach to the VC-SLT, through the Principle of Maximum Volume:

the Norm of the Graph Laplacian is measure of the VC Entropy

Let us again assume we have to select the best equivalence class $\Gamma^{*}$ on our labeled data $\mathcal{L}$ from a set of hyperplanes $\mathbf{h}\in H_{\Gamma}*$ . Lacking a specific prior , we can assume a uniform distribution $P(\mathbf{h})=1$ .

We then simply need to approximate the volume

$V_{r}(\mathbf{h})=\int_{\Gamma_{r}}d\mathbf{h}$

We need a way to compute V, so we assume there exists an operator $\mathbf{Q}$ such that

$V_{r}(\mathbf{h})\equiv\dfrac{\mathbf{h^{\dagger}Qh}}{\mathbf{h^{\dagger}h}}$

To this end, Vapnik et. al. introduce “a family of transductive algorithms which implement the maximum volume principle“, called Approximate Volume Regularization (AVR) algorithms(s). They take the form

$\min Loss_{l}(\mathbf{h})+C_{u}\dfrac{\mathbf{h^{\dagger}Qh}}{\mathbf{h^{\dagger}h}}$

For many problems, $\mathbf{Q}$ can just be the Graph Laplacian

$\mathbf{Q}=\mathcal{L}$ ,

the specific form specified by the specific AVR.

If we write the general form of $\mathcal{L}$ as

$\mathcal{L}=I-D^{-\frac{1}{2}}WD^{\frac{1}{2}}$

W can be defined using the Gaussian Similarity

$W_{i,j}=exp(\dfrac{-\Vert x_{i}-x_{j}\Vert^{2})}{2\sigma^{2}}$

which has a single adjustable width parameter $\sigma$ .

A more robust Laplacian uses the Local-Scaling Similarity

$W_{i,j}=exp(\dfrac{-\Vert x_{i}-x_{j}\Vert^{2})}{2\sigma_{i}\sigma_{j}}$

which has N adjustable parameters $\sigma_{i}$ .

the Maximum Volume Principle may be a better measure of VC capacity

This Principle of Maximum Volume has been applied in a number of ways, such as in recent papers on Maximum Volume Clustering and , and this one for Outlier Detection.

Most importantly, a very recent paper proposes a MultiClass Approximate volume Regularization Algorithm (MAVR).

We have also connected to seemingly different approaches to machine learning–traditional VC theory and operator theoretic manifold learning. In a future post, we will look at some practical examples and compares how well the available open source TSVM and USVM codes on text data.

5 Comments

Pingback: Machine Learning with Missing Labels Part 3: Experiments | Machine Learning
Pingback: Spectral Clustering: A quick overview | Machine Learning
Pingback: Semi-supervised learning using UniverSVM – how to use an index file into training data | karmendykstra
Pingback: SVM+ / LUPI: Learning Using Privileged Information | Machine Learning
gideon Isaac says:

November 30, 2016 at 9:42 am

This comment is based on a few posts you had on RBS. I am confused.
Just to give an example of one stumbling block (for me), you tried to show that the function being minimized for an RBM had both an entropy and an energy component. I did not understand what ‘entropy’ meant in this context. Does it mean “disorder” is minimized? “information” is maximized? And if so, what disorder, or what information? With energy as well, did it mean that incompatibilities (such as incompatible spins in a spin glass) are minimized? If so, what incompatibilities?
And I didn’t quite understand why having vast number of hidden units makes for a more convex energy landscape. Does it have something to do with ‘degrees of freedom’ or with more ways existing of reaching a solution?

Could you suggest some topics that I should read before reading your blog – so that your blog makes more sense to me. It looks fascinating.

LikeLike