Understanding Support Vector Machines A Support Vector Machine (SVM) can be imagined as a surface that creates a boundary between points of data plotted in multidimensional that represent examples and their feature values. The goal of a SVM is to create a flat boundary called a hyperplane, which divides the space to create fairly homogeneous partitions on either side. the SVM learning combines aspects of both the instance-based nearest neighbor learning and the linear regression modeling.

The combination is extremely powerful, allowing SVMs to model highly complex relationships applications SVMs can be adapted for use with nearly any type of learning task, including both classification and numeric prediction. Many of the algorithm's key successes have come in pattern recognition. Notable applications include: Classification of microarray gene expression data in the field of bioinformatics

to identify cancer or other genetic diseases Text categorization such as identification of the language used in a document or the classification of documents by subject matter The detection of rare yet important events like combustion engine failure, security breaches, or earthquakes Classification with hyperplanes linearly separable hyperplane

the hyperplane is a flat surface in a high-dimensional space In two dimensions, the task of the SVM algorithm is to identify a line that separates the two classes there is more than one choice of dividing line between the groups of circles and squares. Three such possibilities are labeled a, b, and c. How does the algorithm choose?

Maximum Margin Hyperplane (MMH) MMH creates the greatest separation between the two classes The maximum margin will improve the chance that, in spite of random noise, the points will remain on the correct side of the boundary The support vectors (indicated by arrows in the figure that follows) are the points from each class that are the closest to the MMH; each class must have at least one support

vector, but it is possible to have more than one. Using the support vectors alone, it is possible to define the MMH The case of linearly separable data It is easiest to understand how to find the maximum margin under the assumption that the classes are linearly separable. In this case, the MMH is as far away as possible from the outer

boundaries of the two groups of data points. These outer boundaries are known as the convex hull. The MMH is then the perpendicular bisector of the shortest line between the two convex hulls. Sophisticated computer algorithms that use a technique known as quadratic optimization are capable of finding the maximum margin in this way. An alternative approach An alternative (but equivalent)

approach involves a search through the space of every possible hyperplane in order to find a set of two parallel planes that divide the points into homogeneous groups yet themselves are as far apart as possible. hyperplane

In n-dimensional space, the following equation is used w is a vector of n weights, that is, {w1, w2, ..., wn}, b is a single number known as the bias. The bias is conceptually equivalent to the intercept term in the slopeintercept form the goal of the process is to find a set of weights that specify two hyperplanes, as follows: the distance We will also require that these hyperplanes are specified such that all the points of one class fall above the first hyperplane and all the points of the

other class fall beneath the second hyperplane. This is possible so long as the data are linearly separable. Vector geometry defines the distance between these two planes as: ||w|| indicates the Euclidean norm (the distance from the origin to vector w). Because ||w|| is in the denominator, to maximize distance, we need to minimize ||w||. The task The task is typically reexpressed as a set of constraints, as follows:

finding a solution to this problem is a task best left for quadratic optimization software The case of nonlinearly separable data The solution to this problem is the use of a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin

cost value A cost value (denoted as C) is applied to all points that violate the constraints, rather than finding the maximum margin, the algorithm attempts to minimize the total cost The greater the cost parameter, the harder the optimization will try to achieve 100 percent separation. On the other hand, a lower cost parameter will place the emphasis on a wider overall margin. It is important to strike a balance between these two in order to create a model that generalizes well to future data.

Using kernels for non-linear spaces A key feature of SVMs is their ability to map the problem into a higher dimension space using a process known as the kernel trick. In doing so, a nonlinear relationship may suddenly appear to be quite linear. the scatterplot on the left depicts a nonlinear relationship between a weather class (sunny or snowy) and two features: latitude and longitude. The points at the center of the plot are members of the snowy class,

while the points at the margins are all sunny. altitude With the addition of this feature, the classes are now perfectly linearly separable. In the left figure, we are viewing the mountain from a bird's eye view, while in the right one, we are viewing the mountain from a distance at the ground level. Here, the trend is obvious: snowy weather is found at higher altitudes. nonlinear kernels

SVMs with nonlinear kernels add additional dimensions to the data in order to create separation in this way. Essentially, the kernel trick involves a process of constructing new features that express mathematical relationships between measured characteristics. For instance, the altitude feature can be expressed mathematically as an interaction between latitude and longitudethe closer the point is to the center of each of these scales, the greater the altitude. This allows SVM to learn concepts that were not explicitly measured in the original data.

Pros and cons Kernel functions (x), is a mapping of the data into another space. the general kernel function applies some transformation to the feature vectors xi and xj and combines them using the dot product, which takes two vectors and returns a single number. Using this form, kernel functions have been developed for many

different domains of data most commonly used kernel functions The linear kernel does not transform the data at all. it can be expressed simply as the dot product of the features: The polynomial kernel of degree d adds a simple nonlinear transformation of the data: The sigmoid kernel results in a SVM model somewhat analogous to a

neural network using a sigmoid activation function. The Gaussian RBF kernel The Gaussian RBF kernel is similar to a RBF neural network. There is no reliable rule to match a kernel to a particular learning task. The fit depends heavily on the concept to be learned as well as the amount of training data and the relationships among the features. Often, a bit of trial and error is required by training and evaluating several SVMs on a validation dataset.

in many cases, the choice of kernel is arbitrary, as the performance may vary slightly. Example performing OCR with SVMs Image processing is a difficult task for many types of machine learning algorithms. SVMs are well-suited to tackle the challenges of image data. Capable of learning complex patterns without being overly sensitive to noise able to recognize visual patterns with a high degree of accuracy.

the key weakness of SVMsthe black box model representationis less critical for image processing. Optical Character Recognition (OCR) The purpose of OCR software is to process paper-based documents by converting printed or handwritten text into an electronic form to be saved in a database. this is a difficult problem due to the many variants in handwritten style and printed fonts.

Step 1 collecting data OCR software divides the paper into a matrix such that each cell in the grid contains a single glyph, which is just a term referring to a letter, symbol, or number. Next, for each cell, the software will attempt to match the glyph to a set of all characters it recognizes. Finally, the individual characters would be combined back together into words, which optionally could be spell-checked against a dictionary in the document's language. Assumptions & Dataset Assumptions:

we have already developed the algorithm to partition the document into rectangular regions each consisting of a single character. the document contains only alphabetic characters in English. Dataset: dataset donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by W. Frey and D. J. Slate. The dataset contains 20,000 examples of 26 English alphabet capital letters as printed using 20 different randomly reshaped and distorted black and white fonts.

examples of the printed glyphs Step 2 exploring and preparing the data 16 statistical attributes The attributes: the horizontal and vertical dimensions of the glyph the proportion of black (versus white) pixels The average horizontal and vertical position of the pixels.

Presumably, differences in the concentration of black pixels across various areas of the box should provide a way to differentiate among the 26 letters of the alphabet. download the letterdata.csv file from the Packt Publishing website Reading the data into R SVM learners require all features to be numeric.

some of the ranges for these integer variables appear fairly wide -> we need to normalize or standardize the data. the R package that we will use for fitting the SVM model will perform the rescaling automatically. Create training and testing data frames

Frey and Slate have already randomized the data using the first 16,000 records (80 percent) to build the model and the next 4,000 records (20 percent) to test > letters_train <- letters[1:16000, ] > letters_test <- letters[16001:20000, ] Step 3 training a model on the data The e1071 package: the Department of Statistics at the Vienna University of Technology (TU Wien)

provides an R interface to the award winning LIBSVM library, a widely used open source SVM program written in C++. the authors' website at http://www.csie.ntu.edu.tw/~cjlin/libsvm/. the klaR package: the Department of Statistics at the Dortmund University of Technology (TU Dortmund) provides functions to work with this SVM implementation directly within R. For information on SVMlight, have a look at http://svmlight.joachims.org/.

the kernlab package Advantage: it was developed natively in R rather than C or C++, which allows it to be easily customized; none of the internals are hidden behind the scenes. Perhaps even more importantly, unlike the other options, kernlab can be used with the caret package, which allows SVM models to be trained and evaluated using a variety of automated methods. authors' paper at http://www.jstatsoft.org/v11/i09/.

The syntax the ksvm() function uses the Gaussian RBF kernel training a simple linear SVM classifier specify the linear (that is, vanilla) kernel using the vanilladot option > library(kernlab) > letter_classifier <- ksvm(letter ~ ., data = letters_train, kernel = "vanilladot")

Step 4 evaluating model performance > letter_predictions <- predict(letter_classifier, letters_test) type = "response" default was used. returns a vector containing a predicted letter for each row of values in the test data. examine how well our classifier performed

compare the predicted letter to the true letter in the testing dataset diagonal values of 144, 121, 120, 156, and 127 indicate the total number of records where the predicted letter matches the true value value of 5 in row B and column D indicates that there were five cases where the letter D was misidentified as a B. accuracy

> agreement <- letter_predictions == letters_test$letter Note that when Frey and Slate published the dataset in 1991, they reported a recognition accuracy of about 80 percent Step 5 improving model performance By using a more complex kernel function, we can map the data into a

higher dimensional space, and potentially obtain a better model fit. A popular convention is to begin with the Gaussian RBF kernel, which has been shown to perform well for many types of data. > letter_classifier_rbf <- ksvm(letter ~ ., data = letters_train, kernel = "rbfdot") > letter_predictions_rbf <- predict(letter_classifier_rbf, letters_test) compare the accuracy The End of Chapter 7