Numerical features Throughout this lecture we assume that all features are numerical, i.e. feature vectors belong to Rn. Most non-numerical features can be conveniently transformed to numerical ones. For example: Colors {blue, red, yellow} can be represented by {0, 1, 2} (or {−1, 0, 1}, ...) A black-and-white picture of x × y pixels can be encoded as a vector of xy numbers that capture the shades of gray of the pixels. 1 Basic Problems We consider two basic problems: (Binary) classification Our goal: Classify inputs into two categories. Function approximation (regression) Our goal: Find a (hypothesized) functional dependency in data. 2 Binary classification in Rn Assume a set of instances X ⊆ Rn , an unknown categorization function c : X → {0, 1}. Our goal: Given a set D of training examples of the form (x, c(x)) where x ∈ X, construct a hypothesized categorization function h ∈ H that is consistent with c on the training examples, i.e., h(x) = c(x) for all training examples (x, c(x)) ∈ D Comments: In practice, we often do not strictly demand h(x) = c(x) for all training examples (x, c(x)) ∈ D (often it is impossible) We are more interested in good generalization, that is how well h classifies new instances that do not belong to D. Recall that we usually evaluate accuracy of the resulting hypothesized function h on a test set. 3 Hypothesis Spaces We consider two kinds of hypothesis spaces: Linear (affine) classifiers (this lecture) Classifiers based on combinations of linear and sigmoidal functions (classical neural networks) (next lecture) 4 Length and Scalar Product of Vectors We consider vectors x = (x1, . . . , xm) ∈ Rm. Typically, we use Euclidean metric on vectors: |x| = m i=1 x2 i The distance between two vectors (points) x, y is |x − y|. We use the scalar product x · y of vectors x = (x1, . . . , xm) and y = (y1, . . . , ym) defined by x · y = m i=1 xi yi Recall that x · y = |x||y| cos θ where θ is the angle between x and y. That is x · y is the length of the projection of y on x multiplied by |x|. Note that x · x = |x|2 5 Linear classifier - example 0 0 0 0 1 1 1 classification in plane using a linear classifier if a point is incorrectly classified, the learning algorithm turns the line (hyperplane) to improve the classification. 6 Linear Classifier A linear classifier h[w] is determined by a vector of weights w = (w0, w1, . . . , wn) ∈ Rn+1 as follows: Given x = (x1, . . . , xn) ∈ X ⊆ Rn, h[w](x) := 1 w0 + n i=1 wi · xi ≥ 0 0 w0 + n i=1 wi · xi < 0 More succinctly: h(x) = sgn w0 + n i=1 wi · xi where sgn(y) = 1 y ≥ 0 0 y < 0 7 Linear Classifier – Notation Given x = (x1, . . . , xn) ∈ Rn we define an augmented feature vector ~x = (x0, x1, . . . , xn) where x0 = 1 This makes the notation for the linear classifier more succinct: h[w](x) = sgn(w ·~x) 8 Perceptron Learning Given a training set D = {(x1, c(x1)) , (x2, c(x2)) , . . . , (xp, c(xp))} Here xk = (xk1 . . . , xkn) ∈ X ⊆ Rn and c(xk) ∈ {0, 1}. We write ck instead of c(xk). Note that ~xk = (xk0, xk1 . . . , xkn) where xk0 = 1. A weight vector w ∈ Rn+1 is consistent with D if h[w](xk) = sgn(w ·~xk) = ck for all k = 1, . . . , p D is linearly separable if there is a vector w ∈ Rn+1 which is consistent with D. Our goal is to find a consistent w assuming that D is linearly separable. 9 Perceptron – Learning Algorithm Online learning algorithm: Idea: Cyclically go through the training examples in D and adapt weights. Whenever an example is incorrectly classified, turn the hyperplane so that the example is closer to it’s correct half-space. Compute a sequence of weight vectors w(0), w(1), w(2), . . .. w(0) is randomly initialized close to 0 = (0, . . . , 0) In (t + 1)-th step, w(t+1) is computed as follows: w(t+1) = w(t) − ε · h[w(t) ](xk) − ck ·~xk = w(t) − ε · sgn w(t) ·~xk − ck ·~xk Here k = (t mod p) + 1, i.e. the examples are considered cyclically, and 0 < ε ≤ 1 is a learning speed. Věta (Rosenblatt) If D is linearly separable, then there is t∗ such that w(t∗) is consistent with D. 10 Example Training set: D = {((2, −1), 1), ((2, 1), 1), ((1, 3), 0)} That is x1 = (2, −1) x2 = (2, 1) x3 = (1, 3) ~x1 = (1, 2, −1) ~x2 = (1, 2, 1) ~x3 = (1, 1, 3) c1 = 1 c2 = 1 c3 = 0 Assume that the initial vector w(0) is w(0) = (0, −1, 1). Consider ε = 1. 11 Example: Separating by w(0) −1 1 2 3 −3 −2 −1 1 2 3 4 x1 x2 x3 Denoting w(0) = (w0, w1, w2) = (0, −1, 1) the blue separating line is given by w0 + w1x1 + w2x2 = 0. The red vector normal to the blue line is (w1, w2). The points on the side of (w1, w2) are assigned 1 by the classifier, the others zero. (In this case x3 is assigned one and x2, x3 are assigned zero, all of this is inconsistent with c1, c2, c3.) 12 Example: w(1) We have w(0) ·~x1 = (0, −1, 1) · (1, 2, −1) = 0 − 2 − 1 = −3 thus sgn w(0) ·~x1 = 0 and thus sgn w(0) ·~x1 − c1 = 0 − 1 = −1 (This means that x1 is not well classified, and w(0) is not consistent with D.) Hence, w(1) = w(0) − sgn w(0) ·~x1 − c1 ·~x1 = w(0) +~x1 = (0, −1, 1) + (1, 2, −1) = (1, 1, 0) 13 Example −1 1 2 3 −3 −2 −1 1 2 3 4 x1 x2 x3 14 Example: Separating by w(1) We have w(1) ·~x2 = (1, 1, 0) · (1, 2, 1) = 1 + 2 = 3 thus sgn w(1) ·~x2 = 1 and thus sgn w(1) ·~x2 − c2 = 1 − 1 = 0 (This means that x2 is currently well classified by w(1) . However, as we will see, x3 is not well classified.) Hence, w(2) = w(1) = (1, 1, 0) 15 Example: w(3) We have w(2) ·~x3 = (1, 1, 0) · (1, 1, 3) = 1 + 1 = 2 thus sgn w(2) ·~x3 = 1 and thus sgn w(2) ·~x3 − c3 = 1 − 0 = 1 (This means that x3 is not well classified, and w(2) is not consistent with D.) Hence, w(3) = w(2) − sgn w(2) ·~x3 − c3 ·~x3 = w(2) −~x3 = (1, 1, 0) − (1, 1, 3) = (0, 0, −3) 16 Example: Separating by w(3) −1 1 2 3 −3 −2 −1 1 2 3 4 x1 x2 x3 17 Example: w(4) We have w(3) ·~x1 = (0, 0, −3) · (1, 2, −1) = 3 thus sgn w(3) ·~x1 = 1 and thus sgn w(3) ·~x1 − c1 = 1 − 1 = 0 (This means that x1 is currently well classified by w(3) . However, as we will see, x2 is not.) Hence, w(4) = w(3) = (0, 0, −3) 18 Example: w(5) We have w(4) ·~x2 = (0, 0, −3) · (1, 2, 1) = −3 thus sgn w(4) ·~x2 = 0 and thus sgn w(4) ·~x2 − c2 = 0 − 1 = −1 (This means that x2 is not well classified, and w(4) is not consistent with D.) Hence, w(5) = w(4) − sgn w(4) ·~x2 − c2 ·~x2 = w(4) +~x2 = (0, 0, −3) + (1, 2, 1) = (1, 2, −2) 19 Example: Separating by w(5) −1 1 2 3 −3 −2 −1 1 2 3 4 x1 x2 x3 20 Example: The result The vector w(5) is consistent with D: sgn w(5) ·~x1 = sgn ((1, 2, −2) · (1, 2, −1)) = sgn(7) = 1 = c1 sgn w(5) ·~x2 = sgn ((1, 2, −2) · (1, 2, 1)) = sgn(3) = 1 = c2 sgn w(5) ·~x3 = sgn ((1, 2, −2) · (1, 1, 3)) = sgn(−3) = 0 = c3 21 Perceptron – Learning Algorithm Batch learning algorithm: Compute a sequence of weight vectors w(0), w(1), w(2), . . .. w(0) is randomly initialized close to 0 = (0, . . . , 0) In (t + 1)-th step, w(t+1) is computed as follows: w(t+1) = w(t) − ε · p k=1 h[w(t) ](xk) − ck ·~xk = w(t) − ε · p k=1 sgn w(t) ·~xk − ck ·~xk Here k = (t mod p) + 1, i.e. the examples are considered cyclically, and 0 < ε ≤ 1 is a learning speed. 22 Function Approximation – Oaks in Wisconsin This example is from How to Lie with Statistics by Darrell Huff (1954) 23 Function Approximation Assume a set X ⊆ Rn of instances, an unknown function f : X → R. Our goal: Given a set D of training examples of the form (x, f (x)) where x ∈ X, construct a hypothesized function h ∈ H such that h(x) ≈ f (x) for all training examples (x, f (x)) ∈ D Here ≈ means that the values are somewhat close to each other w.r.t. an appropriate error function E. In what follows we use the least squares defined by E = 1 2 (x,f (x))∈D (f (x) − h(x))2 Our goal is to minimize E. The main reason is that this function has nice mathematical properties (as opposed e.g. to (x,f (x))∈D |f (x) − h(x)| ). 24 Least Squares – Oaks in Wisconsin 25 Linear Function Approximation Given a set D of training examples: D = {(x1, f (x1)) , (x2, f (x2)) , . . . , (xp, f (xp))} Here xk = (xk1 . . . , xkn) ∈ Rn and fk(x) ∈ R. Recall that ~xk = (xk0, xk1 . . . , xkn). Our goal: Find w so that h[w](x) = w ·~x approximates the function f some of whose values are given by the training set. Least Squares Error Function: E(w) = 1 2 p k=1 (w ·~xk − fk)2 = 1 2 p k=1 n i=0 wi xki − fk 2 26 Gradient of the Error Function Consider the gradient of the error function: E(w) = ∂E ∂w0 (w), . . . , ∂E ∂wn (w) = p k=1 (w ·~xk − fk) ·~xk What is the gradient E(w) ? It is a vector in Rn+1 which points in the direction of the steepest ascent of E (it’s length corresponds to the steepness). Note that here the vectors ~xk are fixed parameters of E! Fakt If E(w) = 0 = (0, . . . , 0), then w is a global minimum of E. This follows from the fact that E is a convex paraboloid that has a unique extreme which is a minimum. 27 Gradient – illustration 28 Function Approximation – Learning Gradient Descent: Weights w(0) are initialized randomly close to 0. In (t + 1)-th step, w(t+1) is computed as follows: w(t+1) = w(t) − ε · E(w(t) ) = w(t) − ε · p k=1 w(t) ·~xk − fk ·~xk = w(t) − ε · p k=1 h[w(t) ](xk) − fk ·~xk Here k = (t mod p) + 1 and 0 < ε ≤ 1 is the learning speed. Note that the algorithm is almost similar to the batch perceptron algorithm! Tvrzení For sufficiently small ε > 0 the sequence w(0), w(1), w(2), . . . converges (component-wisely) to the global minimum of E. 29 Finding the Minimum in Dimension One Assume n = 1. Then the error function E is E(w0, w1) = 1 2 p k=1 (w0 + w1xk − fk)2 Minimize E w.r.t. w0 a w1: δE δw0 = 0 ⇔ w0 = ¯f − w1 ¯x ⇔ ¯f = w0 + w1 ¯x where ¯x = 1 p p k=1 xk a ¯f = 1 p p k=1 fk δE δw1 = 0 ⇔ w1 = 1 p p k=1(fk − ¯f )(xk − ¯x) 1 p p k=1(xk − ¯x)2 i.e. w1 = cov(f , x)/var(x) 30 Finding the Minimum in Arbitrary Dimension Let A be a matrix p × (n + 1) (p rows, n + 1 columns) whose k-th row is the vector ~xk. Let f = (f1, . . . , fp) be the column vector formed by values of f in the training set. Then E(w) = 0 ⇔ w = (A A)−1 A f if (A A)−1 exists (Then (A A)−1 A is the so called Moore-Penrose pseudoinverse of A.) 31 Normal Distribution – Reminder Distribution of continuous random variables. Density (one dimensional, that is over R): p(x) = 1 σ √ 2π exp − (x − µ)2 2σ2 =: N[µ, σ2 ](x) µ is the expected value (the mean), σ2 is the variance. 32 Maximum Likelihood vs Least Squares Fix a training set D = {(x1, f1) , (x2, f2) , . . . , (xp, fp)} Assume that each fk has been generated randomly by fk = w ·~xk + k Here w0, w1 are unknown constants k are normally distributed with mean 0 and an unknown variance σ2 Assume that 1, . . . , p have been generated independently. Denote by p(f1 , . . . , fp | w0, w1, σ2 ) the probability density according to which the values f1, . . . , fn were generated assuming fixed w0, w1, x1, . . . , xp. (For interested: The independence and normality imply p(f1 , . . . , fp | w0, w1, σ2 ) = p k=1 N[w0 + w1xk , σ2 ](fk ) where N[w0 + w1xk , σ2 ](fk ) is a normal distribution with the mean w0 + w1xk and the variance σ2 .) 33 Maximum Likelihood vs Least Squares Our goal is to find w that maximizes the likelihood that the training set D with fixed values f1, . . . , fn has been generated: L(w, σ2 ) := p(f1, . . . , fp | w, σ2 ) Věta w maximizes L(w, σ2) for arbitrary σ2 iff w minimizes E(w). Note that the maximizing/minimizing w does not depend on σ2. Maximizing σ2 satisfies σ2 = 1 p p k=1(fk − w · xk)2. 34 Comments on Linear Models Linear models are parametric, i.e. they have a fixed form with a small number of parameters that need to be learned from data (as opposed e.g. to decision trees where the structure is not fixed in advance). Linear models are stable, i.e. small variations in in the training data have only limited impact on the learned model. (tree models typically vary more with the training data). Linear models are less likely to overfit (low variance) the training data but sometimes tend to underfit (high bias). 35