Read this next

Zum Inhalt springen
# Support Vector Machine

#### One of the most popular classifiers are Support Vector Machines, from which the general task of classification can be explained quite well.

In the learning setting of Support Vector Machines one typically has attributes in \(\mathbb{R}^d\) and labels with two classes (with values 1 and -1). For example could the classification task be, to predict the gender of a person, using his/her height, his/her weight and some other numeric attributes.

The starting point is a training-set \(M=\{(x_{i},y_{i}) \in R^{d}x\{-1,1\}, i = 1,\dots,n\}\) which is assumed to be linear separable. This means that one can find a hyperplane in \(\mathbb{R}^d\), that splits the training-set in two subsets, where every sample in the first subset has label 1 and every sample in the second subset has label -1.

Formally this means that there exists \(w \in \mathbb{R}^d\) and \(b \in \mathbb{R}\) such that for every \(x_{i}\) the inequality \(y_{i}(\langle w,x_{i} \rangle+b)>0\) holds. This is because \(S=\{x \in \mathbb{R}^d | \langle w,x_{i} \rangle+b=0\}\) describes a hyperplane in \(\mathbb{R}^d\) and the inequality \(y_{i}(\langle w,x_{i} \rangle+b)>0\) specifies that every \(x_{i}\) with the same label lies at one site of the hyperplane (e.g for \(y_{i}=1\), \(\langle w,x_{i} \rangle+b\) must be greater than 0).

So if we could find such a hyperplane, we already would have a classifier, which classifies correctly on the training-set, and we could hope that this is enough for generalizing on unseen data points. At this point the support vector machine goes a step further and searches among all hyperplanes, which actually separates the training-set, the one(s) which additionally have the biggest margin m to their nearest data points. This can be seen in the following picture, where the red line is the hyperplane and m is the margin to the nearest data points:

#### Furthermore we restrict the w and b, so that not just \(y_{i}(\langle w,x_{i} \rangle+b)>0\) holds, but also \(y_{i}(\langle w,x_{i} \rangle+b)\ge 1\) for all data points (what is always possible if the first condition holds for some values of w and b). This can be seen as a trick to make the later optimization problem more feasible.

Let now \(x_{0}\) be an arbitrary point on the hyperplane and \(x_{i}\) one of the nearest points with w.l.o.g positive lable (otherwise some sign will turn around).

It follows from analytical geometry that

#### \(m = \frac{\langle w, x_{i}-x_{0} \rangle}{\Vert w \Vert} = \frac{\langle w, x_{i}\rangle +b – \langle w, x_{0} \rangle -b }{\Vert w \Vert} = \frac{\langle w, x_{i}\rangle +b}{\Vert w \Vert}\ge \frac{1}{\Vert w \Vert} \\ \)

#### So in order to maximize m, we can maximize \(\frac{1}{\Vert w \Vert} \) and in order to maximize \(\frac{1}{\Vert w \Vert} \) we can minimize \(\frac{1}{2}\Vert w \Vert^2 \), which leads to the following optimization problem:

#### \(P: ~~\min_{w,b}\frac{1}{2} \Vert w \Vert^2 ~~~s.t.~~~ y_{i}(\langle w,x_{i} \rangle+b)\ge 1, ~~i = 1,\dots,n \\\)

#### To solve this optimization problem we can use duality theory and the Lagrangian function. In this case the Lagrangian function is

#### \(L(\alpha,w,b) = \frac{1}{2} \Vert w \Vert^2 – \sum_{i=1}^{n} \alpha_{i}(y_{i}(\langle w,x_{i} \rangle+b)-1)\\\)

#### where \(\alpha_{i}\) are the Lagrangian parameters. To solve P we can first minimize \(L(\alpha,w,b)\) with respect to w and b and afterwards maximize the resulting solution with respect to \(\alpha\). So first we get a settle point for L with regard to w and b:

#### \(\frac{\partial L}{\partial w} = w – \sum_{i=1}^{n} \alpha_{i} y_{i} x_{i} = 0 \Leftrightarrow w = \sum_{i=1}^{n} \alpha_{i} y_{i} x_{i} \\\\\frac{\partial L}{\partial b} = – \sum_{i=1}^{n} \alpha_{i} y_{i} = 0 \)

#### Plugging these conditions into the Lagrangian function leads to

#### \( L(\alpha) = -\frac{1}{2} \langle \sum_{i=1}^{n} \alpha_{i} y_{i} x_{i}, \sum_{j=1}^{n} \alpha_{j} y_{j} x_{j} \rangle + \sum_{i=1}^{n} \alpha_{i} \\

~~~~~~~= \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i}\alpha_{j} y_{i} y_{j} \langle x_{i} x_{j} \rangle \)

#### which can then be maximized over \( \alpha_{i} \) in order to solve P, considering the additional restriction \( \alpha_{i}\ge 0 \) and the above condition \( \sum_{i=1}^{n} \alpha_{i} y_{i} = 0 \) (in the end the optimal values w* and b* are dependend on \( \alpha_{i} \)). This is the final optimization problem to be solved in order to get w and b (to train the SVM).

For some data points \( x_{i} \) the value of \( \alpha_{i} \) will be 0 and thus these points will have no influence on w (\(w = \sum_{i=1}^{n} \alpha_{i} y_{i} x_{i}\)). The points where \( \alpha_{i}>0 \) holds, and therefore have influence on the parameters w are called the „Support Vectors“. They are in fact the only points which characaterizes the hyperplane (the parameters w).

The final classification of a new data point x, once the right w and b are found, is

#### \(h(x)=sign(\langle w , x \rangle + b))=sign(\langle \sum_{i=1}^{n} \alpha_{i} y_{i} x_{i} , x \rangle + b))=sign(\sum_{i=1}^{n} \alpha_{i} y_{i}\langle x_{i} , x \rangle + b)) \\\)

####
So now we are able to classify new data points with h(x) and are hoping that the new classified sample will follow the same pattern as the training samples. The fact that we now have a hyperplane with the largest margin to the training-set data points, is at least an argument that the classifier will better generalize to new data points than any other hyperplane.

Another problem rises from the possibility that our training-set may not be linear separable. To conquer this problem one can use the so called Kernel-Trick. A given data-point \(x_{i}\in \mathbb{R}^d\) is transformed into a space of higher dimension (e.g \(\mathbb{R}^l, l>d\)), via a transformation function \(\Phi(x)\). Some mathematical results show, that in higher dimensional spaces a firstly not linear separable set can than be separable. The special advantage in the case of our SVM optimization problem is the fact that we only need to know \(\langle x_{i}, x_{j} \rangle\) or \(\langle \Phi(x_{i}), \Phi(x_{j}) \rangle\) to process our optimization problem (look at the Lagrangian) and our classification (look at h(x)). This function \(K(x,y)=\langle \Phi(x), \Phi(y) \rangle\) is called a Kernel. In case of the SVM procedure we only need to know the Kernel and make sure that there <strong>exists</strong> a function \(\Phi(x) \) which transforms in a space, where the Kernel is a dot product, rather to really know the concrete form of \(\Phi(x) \). A famous Kernel that is used, is the RBF Kernel \(K(x,y)=e^{-\frac{\Vert x-y \Vert}{2\sigma^2}}\), which can be used to make the whole described SVM procedure work for highly non-linear separable data, because it defines a dot product in an infinitesimal space.

In the learning setting of Support Vector Machines one typically has attributes in \(\mathbb{R}^d\) and labels with two classes (with values 1 and -1). For example could the classification task be, to predict the gender of a person, using his/her height, his/her weight and some other numeric attributes.

The starting point is a training-set \(M=\{(x_{i},y_{i}) \in R^{d}x\{-1,1\}, i = 1,\dots,n\}\) which is assumed to be linear separable. This means that one can find a hyperplane in \(\mathbb{R}^d\), that splits the training-set in two subsets, where every sample in the first subset has label 1 and every sample in the second subset has label -1.

Formally this means that there exists \(w \in \mathbb{R}^d\) and \(b \in \mathbb{R}\) such that for every \(x_{i}\) the inequality \(y_{i}(\langle w,x_{i} \rangle+b)>0\) holds. This is because \(S=\{x \in \mathbb{R}^d | \langle w,x_{i} \rangle+b=0\}\) describes a hyperplane in \(\mathbb{R}^d\) and the inequality \(y_{i}(\langle w,x_{i} \rangle+b)>0\) specifies that every \(x_{i}\) with the same label lies at one site of the hyperplane (e.g for \(y_{i}=1\), \(\langle w,x_{i} \rangle+b\) must be greater than 0).

So if we could find such a hyperplane, we already would have a classifier, which classifies correctly on the training-set, and we could hope that this is enough for generalizing on unseen data points. At this point the support vector machine goes a step further and searches among all hyperplanes, which actually separates the training-set, the one(s) which additionally have the biggest margin m to their nearest data points. This can be seen in the following picture, where the red line is the hyperplane and m is the margin to the nearest data points:

Let now \(x_{0}\) be an arbitrary point on the hyperplane and \(x_{i}\) one of the nearest points with w.l.o.g positive lable (otherwise some sign will turn around).

It follows from analytical geometry that

~~~~~~~= \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_{i}\alpha_{j} y_{i} y_{j} \langle x_{i} x_{j} \rangle \)

For some data points \( x_{i} \) the value of \( \alpha_{i} \) will be 0 and thus these points will have no influence on w (\(w = \sum_{i=1}^{n} \alpha_{i} y_{i} x_{i}\)). The points where \( \alpha_{i}>0 \) holds, and therefore have influence on the parameters w are called the „Support Vectors“. They are in fact the only points which characaterizes the hyperplane (the parameters w).

The final classification of a new data point x, once the right w and b are found, is

Another problem rises from the possibility that our training-set may not be linear separable. To conquer this problem one can use the so called Kernel-Trick. A given data-point \(x_{i}\in \mathbb{R}^d\) is transformed into a space of higher dimension (e.g \(\mathbb{R}^l, l>d\)), via a transformation function \(\Phi(x)\). Some mathematical results show, that in higher dimensional spaces a firstly not linear separable set can than be separable. The special advantage in the case of our SVM optimization problem is the fact that we only need to know \(\langle x_{i}, x_{j} \rangle\) or \(\langle \Phi(x_{i}), \Phi(x_{j}) \rangle\) to process our optimization problem (look at the Lagrangian) and our classification (look at h(x)). This function \(K(x,y)=\langle \Phi(x), \Phi(y) \rangle\) is called a Kernel. In case of the SVM procedure we only need to know the Kernel and make sure that there <strong>exists</strong> a function \(\Phi(x) \) which transforms in a space, where the Kernel is a dot product, rather to really know the concrete form of \(\Phi(x) \). A famous Kernel that is used, is the RBF Kernel \(K(x,y)=e^{-\frac{\Vert x-y \Vert}{2\sigma^2}}\), which can be used to make the whole described SVM procedure work for highly non-linear separable data, because it defines a dot product in an infinitesimal space.

Read this next## The 99% chance of an AI-Apocalypse?