PA196: Pattern Recognition 05. Nonparametric techniques Dr. Vlad Popovici popovici@recetox.muni.cz RECETOX Masaryk University, Brno Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Introduction • let X1, . . . , Xn be i.i.d. d−dimensional random variables • let p(x) be their continuous distribution: p(x) ≥ 0, Rd p(x) dx = 1 • the problem is to estimate p(x) i.e. find ˆp(x) • Note: a density estimate does not need to be a density itself!; it can have negative values or infinite integral... Desirable properties: • asymptotical unbiasedness: E[ˆp(x)] → p(x) as n → ∞ • consistency: • mean squared error: MSE(ˆp) = E[(ˆp(x) − p(x))2 ] • ↔ MSE(ˆp) = Var(ˆp) + [bias(ˆp)]2 • if MSE → 0 for all x ∈ Rd then it is a pointwise consistent estimator of p in the quadratic mean • global measure of accuracy: the mean integrated squared error (average of all possible samples): MISE = E (ˆp(x) − p(x))2 dx = E[(ˆp(x) − p(x))2 ] dx Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Histograms • the simplest density estimator: divide the interval of values in N equal intervals (cells) • ˆp(x) = nj N j=1 njdx where nj is the number of points falling into the j−th interval straddling the point x • in d dimensions: ˆp(x) = nj N j=1 njdV dx Problems: • exponential growth of number of cells (Nd ) • super-exponential growth in sample size needed for a proper estimation • discontinuity between cells Modifications: • data-adaptive histograms: allow the location, size and shape of the cells to adapt to the available data • assume variable independence (naive Bayes): p(x) = d i=1 p(xi). For each variable one can use a histogram with N cells, which leads to Nd Nd cells. • Lancaster models: assume that interactions above a certain order vanish. • Bayesian networks: p(x) = p(xd|x1, . . . , xd−1)p(xd−1|x1, . . . , xd−2)p(x2|x1)p(x1) • dependence trees: pairwise conditional probabilities Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Parzen estimator (kernel methods) • fix the volume of the cell and use the number of point falling within to construct a density estimate • idea: smooth the histogram with a properly selected kernel function • the kernels are chosen to have a compact support • the density estimate is ˆp(x) = 1 nh n i=1 K x − xi h where K is the kernel function and h is a smoothing parameter (spread, bandwidth) Examples of kernel functions • rectangular: K(x) =    1/2, for |x| < 1 0, otherwise • triangular: K(x) =    1 − |x|, for |x| < 1 0, otherwise • normal: K(x) = 1√ 2π exp(−x2 /2) • Bartlett-Epanechnikov: K(x) =    3 4 (1 − x2 /5)/ √ 5, for |x| < √ 5 0, otherwise Different levels of smoothing: from Webb: Statistical pattern recognition Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances k−NN • the probability that a point z falls into a volume V centered at x is θ = V(X) p(x) dx • for a small volume, θ ≈ p(x)V • on the other hand, θ ≈ k(x) n : the fraction of points falling within V • ⇒ k−NN density estimator: ˆp(x) = k(x) nV • k−NN: fix k(x)/n or, equivalently (for a given n) fix k and find the volume V centred at containing k points • example: if xk is the k−th closest point to x then V can be taken as a sphere of radius x − xk • the volume of a d−dimensional sphere is 2rd π d 2 d Γ(d/2) where Γ(t) = ∞ 0 xt−1 e−x dx (for n ∈ N, Γ(n) = (n − 1)!) • this is in contrast with the histogram, where the volume is fixed and k varies k−NN density estimation with k = 1 from Webb: Statistical pattern recognition k−NN density estimation with k = 2 from Webb: Statistical pattern recognition Notes: • the density estimate produced is not a density itself • (the estimate varies as 1/|x| leading to an infinite integral) • it is asymptotically unbiased if lim n→∞ k(n) = ∞ lim n→∞ k(n) n = 0 Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances • k−NN can be used to estimate the density → apply MAP rule to get a classification rule • let there be ki samples of class gi among the closest k samples to x; m i=1 ki = k (m is the total number of classes) • let ni be the total number of samples from class gi: m i=1 ni = n • then the estimate of the class-conditional probability is ˆp(x|gi) = ki niV • the estimated prior is ˆp(gi) = ni n k−NN decision rule • MAP rule: assign x to gi if ˆp(gi|x) ≥ ˆp(gj|x) for all j • from Bayes’ theorem: assign x to gi if ki niV ni n ≥ kj njV nj n for all j i k−NN decision rule Assign x to gi if ki ≥ kj, ∀j i What about the ties? Breaking the ties • random assignment among classes with the same number of neighbors • assign to the class with the closest mean vector • assign to the most compact class • weighted distance • etc. etc. Error rate for k−NN (Cover, Hart, 1967) e∗ ≤ e ≤ e∗ 2 − me∗ m − 1 where e∗ is the Bayes error rate, m is the number of classes and e is the k−NN error rate As n → ∞, e∗ ≤ e ≤ 2e∗. Note on implementing k−NN: • as n becomes large, finding the k NN incurs more computation • various approximating algorithms, e.g. LAESA: linear approximating and eliminating search algorithm • idea: use the properties of the metric space and reduce the number of comparisons to a set of identify "prototypes" Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Refinements: editing techniques Idea: remove misclassified samples to obtain homogeneous regions. Procedure: given a set R and a classification rule η, let S be the set of misclassified samples from R by η. Remove these and re-train η on R = R \ S, etc. etc Possible implementation: 1 consider a partition of the full set into N subsets R1, . . . , RN 2 classify samples in Ri using k−NN trained on the union of M "next" sets: R(i+1) mod N ∪ · · · ∪ R(i+M−1) mod N for 1 ≤ M ≤ N − 1 3 remove the samples misclassified and repartition 4 repeat until a predefined number of iterations do not remove any more samples Notes: • M = N − 1 is similar to cross-validation • if N is equal to number of samples, the procedure becomes leave-one-out • the result is a set of homogeneous "clusters" of samples Refinements: condensation • after editing, the clusters can be "condensed" • idea: remove samples in the center of the clusters, that do not contribute to the decision Outline 1 Density estimation Histograms Parzen density estimation 2 k-nearest neighbor estimation 3 Nearest neighbor classification rule k−NN decision rule Refinements Distances Distance • choice of distance depends on the (knowledge of the) domain • is the space isotrop? are some variables "more important"? etc etc • general Euclidean distance: d(x, z) = (x − z)t A(x − z) • alternative (van der Heiden, Groen - 1997 - radar applications): d(x, z) = (x(p) − z(p))t (x(p) − z(p)) where x (p) i =    (x p i − 1)/p, if 0 < p ≤ 1 log xi, if p = 0 What about k? • the larger k the more robust is the procedure; however • k must be less than the smallest of ni • k can be optimized in a cross-validation approach • Enas, Choi (1986) suggest: k ≈ n2/8 or k ≈ n3/8 where n is the sample size