CS4442/9542b Artificial Intelligence II Prof. Olga Veksler Lecture 2 Introduction to ML Basic Linear Algebra Matlab Some slides on Linear Algebra are from Patrick Nichols Outline Introduction to Machine Learning Basic Linear Algebra Matlab Intro Intro: What is Machine Learning? How to write a computer program that automatically improves its performance through experience Machine learning is useful when it is too difficult to come up with a program to perform a desired task Make computer to learn by showing examples (most frequently with correct answers) supervised learning or learning with a teacher In practice: computer program (or function) which has a tunable parameters, tune parameters until the

desirable behavior on the examples Different Types of Learning Learning from examples: Supervised Learning: given training examples of inputs and corresponding outputs, produce the correct outputs for new inputs study in this course Unsupervised Learning: given only inputs as training, find structure in the world: e.g. discover clusters Other types, such as reinforcement learning are not covered in this course Supervised Machine Learning Training samples (or examples) x1,x2,, xn Each example xi is typically multi-dimensional xi1, xi2 ,, xid are called features, xi is often called a feature vector Example: x1 = {3,7, 35}, x2 = {5, 9, 47}, how many and which features do we take? Know desired output for each example y1, y2,yn

This learning is supervised (teacher gives desired outputs) yi are often one-dimensional Example: y1 = 1 (face), y2 = 0 (not a face) Supervised Machine Learning Two types of supervised learning: Classification (we will only do classification in this course): yi takes value in finite set, typically called a label or a class Example: yi {sunny, cloudy, raining} Regression yi continuous, typically called an output value Example: yi = temperature [-60,60] Toy Application: fish sorting classifier fis h fis h im

ag e sp ec ies salmon sorting chamber sea bass Classifier design Notice salmon tends to be shorter than sea bass Use fish length as the discriminating feature Count number of bass and salmon of each length 2 4 8 10

12 14 bass 0 1 3 8 10 5 salmon 2 5

10 5 1 0 12 Count 10 8 salmon sea bass 6 4 2 0 2 4

8 10 Length 12 14 Single Feature (length) Classifier Find the best length L threshold fish length < L classify as salmon fish length > L classify as sea bass For example, at L = 5, misclassified: 1 sea bass 16 salmon 2 4

8 10 12 14 bass 0 1 3 8 10 5 salmon

2 5 10 5 1 0 Classification error (total error) 17 = 34% 50 Single Feature (length) Classifier 12 fish classified as salmon fish classified as sea bass Co u n t

10 8 salmon sea bass 6 4 2 0 2 4 8 10 12 14 Length After searching through all possible thresholds L, the best L= 9, and still 20% of fish is misclassified

Next Step Lesson learned: Length is a poor feature alone! What to do? Try another feature Salmon tends to be lighter Try average fish lightness Single Feature (lightness) Classifier 1 2 3 4 5 bass 0 1

2 10 12 salmon 6 10 6 1 0 14 Co u n t 12 10

8 salmon sea bass 6 4 2 0 1 2 3 4 5 Lightness Now fish are classified best at lightness threshold of 3.5 with classification error of 8% Can do better by feature combining

Use both length and lightness features Feature vector [length,lightness] lightness ba ss decision boundary decision regions sa lm on length Classification error 4% lightness Even Better Decision Boundary length Decision boundary (wiggly) with 0% classification error

Test Classifier on New Data lightness The goal is for classifier to perform well on new data Test wiggly classifier on new data: 25% error length What Went Wrong? added 2 samples We always have only a limited amount of data, not all possible data We should make sure the decision boundary does not adapt too closely to the particulars of the data we have at hand, but rather grasps the big picture What Went Wrong: Overfitting Complicated boundaries overfit the data, they are too tuned to the particular training data at hand Therefore complicated boundaries tend to not generalize well to the new data

We usually refer to the new data as test data Overfitting: Extreme Example Say we have 2 classes: face and non-face images Memorize (i.e. store) all the face images For a new image, see if it is one of the stored faces if yes, output face as the classification result If no, output non-face also called rote learning problem: new face images are different from stored face examples zero error on stored data, 50% error on test (new) data Rote learning is memorization without generalization slide is modified from Y. LeCun Generalization training data test data The ability to produce correct outputs on previously unseen examples is called generalization The big question of learning theory: how to get good generalization

with a limited number of examples Intuitive idea: favor simpler classifiers William of Occam (1284-1347): entities are not to be multiplied without necessity Simpler decision boundary may not fit ideally to the training data but tends to generalize better to new data Underfitting We can also underfit data, i.e. use too simple decision boundary chosen model is not expressive enough There is no way to fit a linear decision boundary so that the training examples are well separated Training error is too high test error is, of course, also high Underfitting Overfitting underfitting just right overfitting

Sketch of Supervised Machine Learning Chose a learning machine f(x,w) w are tunable weights x is the input sample f(x,w) should output the correct class of sample x use labeled samples to tune weights w so that f(x,w) give the correct label for sample x Which function f(x,w) do we choose? has to be expressive enough to model our problem well, i.e. to avoid underfitting yet not to complicated to avoid overfitting Training and Testing There are 2 phases, training and testing Divide all labeled samples x1,x2,xn into 2 sets, training set and test set Training phase is for teaching our machine (finding optimal weights w)

Testing phase is for evaluating how well our machine works on unseen examples Training Phase Find the weights w s.t. f(xi,w) = yi as much as possible for training samples (xi, yi) as much as possible needs to be defined How do we find parameters w to ensure f(xi,w) = yi for most training samples (xi,yi) ? This step is usually done by optimization, can be quite time consuming Testing Phase The goal is to design machine which performs well on unseen examples Evaluate the performance of the trained machine f(x,w) on the test samples (unseen labeled samples) Testing the machine on unseen labeled examples lets us approximate how well it will perform in practice If testing results are poor, may have to go back to the training phase and redesign f(x,w)

Generalization and Overfitting Generalization is the ability to produce correct output on previously unseen examples In other words, low error on unseen examples Good generalization is the main goal of ML Low training error does not necessarily imply that we will have low test error we have seen that it is easy to produce f(x,w) which is perfect on training samples (rote learning) Overfitting when the machine performs well on training data but poorly on test data Classification System Design Overview Collect and label data by hand salmon sea bass salmon salmon

sea bass sea bass Split data into training and test sets Preprocess by segmenting fish from background Extract possibly discriminating features length, lightness, width, number of fins,etc. Classifier design Choose model for classifier Train classifier on training data Test classifier on test data we look at these two steps in this course Basic Linear Algebra Basic Concepts in Linear Algebra vectors and matrices products and norms vector spaces and linear transformations

Introduction to Matlab Why Linear Algebra? For each example (e.g. a fish image), we extract a set of features (e.g. length, width, color) This set of features is represented as a feature vector [length, width, color] All collected examples will be represented as collection of (feature) vectors [l1, w1 , c1 ] [l2 , w2 , c2 ] [l3 , w3 , c3 ] example 1 example 2 example 3 l1 l2 l 3 w1 w2

w3 c1 c2 c3 matrix Also, we will use linear models since they are simple and computationally tractable What is a Matrix? A matrix is a set of elements, organized into rows and columns rows feature 4 feature 3 feature 2 feature 1 columns

2 7 6 10 1 4 4 9 6 4 9 6 example 1 example 2 example 3 Basic Matrix Operations addition, subtraction, multiplication by a scalar a b e c d g f a e b f add elements h c g d h

a b c d f a e b f subtract elements h c g d h e g a b a b multiply every entry c d c d Matrix Transpose

T n by m matrix A and its m by n transpose A x11 x12 x x 21 22 A xn1 xn 2 x1m x2 m xnm x11 x21 x x 12 22

T A x1m x2 m xn1 xn 2 xnm Vectors Vector: N x 1 matrix x1 v x2 dot product and magnitude defined on vectors only x2 x2

x2 a a+b v x1 b a-b a x1 vector addition b x1 vector subtraction More on Vectors

n-dimensional row vector x x1 x2 xn x1 x 2 T x Transpose of row vector is column vector xn Vector product (or inner or dot product) x, y x y xT y x1 y1 x2 y2 xn yn xi yi i 1n More on Vectors Euclidian norm or length x 2 x i

x, x i 1 n If ||x|| =1 we say x is normalized or unit length xT y angle q between vectors x and y : cos x y inner product captures direction relationship cos 0 cos 1 y x y x T cos 1 x y 0 xy

T x y x y 0 y x xT y x y 0 More on Vectors Vectors x and y are orthonormal if they are orthogonal and ||x|| = ||y|| =1 Euclidian distance between vectors x and y x y x y i i 1 n i 2 x-y

x y Linear Dependence and Independence Vectors x1, x2,, xn are linearly dependent if there exist constants 1, 2,, n s.t. 1x1+ 2x2++nxn = 0 i 0 for at least one I Vectors x1, x2,, xn are linearly independent if 1x1+ 2x2++nxn = 0 1 = 2== n= 0 Vector Spaces and Basis The set of all n-dimensional vectors is called a vector space V A set of vectors {u1,u2,, un } are called a basis for vector space if any v in V can be written as v = 1u1+ 2u2++nun u1,u2,, un are independent implies they form a basis, and vice versa

u1,u2,, un give an orthonormal basis if 1. ui 1 i 2. ui u j i j Orthonormal Basis x, y,, z form an orthonormal basis x 1 0 0 T y 0 1 0 T z 0 0 1 T x y 0 x z 0 y z 0 Matrix Product a11 AB

an1 a12 an 2 a13 an 3 b11 a1d b21 b31 and bd 1

b1m b2 m b3m bdm cij cij = ai, bj ai is row i of A bj is column j of B # of columns of A = # of rows of B even if defined, in general AB BA Matrices

Rank of a matrix is the number of linearly independent rows (or equivalently columns) A square matrix is non-singular if its rank equal to the number of rows. If its rank is less than number of rows it is singular. Identity matrix AI=IA=A 1 0 I 0 0 0 1 0 0 0 0 1

Matrix A is symmetric if A=A T 1 2 9 5 2 9 5 7 4 8 4 3 6 8 6 4 Matrices

-1 Inverse of a square matrix A is matrix A s.t. -1 AA = I If A is singular or not square, inverse does not exist T Pseudo-inverse A is defined whenever A A is not singular (it is square) -1 T T A = (A A) A T -1 T AA =(A A) AA=I MATLAB Starting matlab

Clear + - */ ^ help arith A=[2 3;4 5] A find(A>3), colon operator * / ^ .* ./ .^ eye(n),norm(A),det(A),eig(A) max,min,std help matfun

if i== 1else end, if else if end for i=1:0.5:2 end while i == 1 end Return help lang Graphics Matrix and vector operations .m files scripts function y=square(x) help lang Flow control

==,&,|,~,xor help relop double Char Programming in Matlab Lists, vectors, matrices help elfun Data types

Relational operators Scalars, variables, basic arithmetic quit more help general Elementary functions Basic Navigation

xterm -fn 12X24 matlab help graphics help graph3d File I/O load,save fopen, fclose, fprintf, fscanf