IA008: Computational Logic 5. Inductive Inference Achim Blumensath blumens@fi.muni.cz Faculty of Informatics, Masaryk University, Brno Basic Concepts Induction learning general facts from examples: Induction is the process of forming of a hypothesis (about a target concept/function) based on observed data. Example What is the next number? 0, 1, 1, 2, 3, 5, 8,… an = an−2 + an−1 0, 0, 0, 0, 0, 120, 720,… an = n(n − 1)(n − 2)(n − 3)(n − 4) Fundamental Problem From a strictly logical point of view, induction is not possible: there are always several possible explanations for the observed phenomena and there is no rational basis for choosing one over the others. Hence, a hypothesis can be falsified but never verified. Consequently we need to make additional a priori assumptions (the so-called inductive bias) regarding the target concept. Inductive Learning Hypothesis A hypothesis that approximates the target concept well over a sufficiently large amount of training data will also approximate it well over unobserved examples. Occam’s Razor Use the simplest hypothesis that matches the observations. (What’s simple depends on our formalism.) Philosophy of Science Scientific Method In the 17th century, Francis Bacon, René Descartes, and Isaac Newton developed the scientific method based on induction. Problem of Induction David Hume was the first to point out that inductive inferences are unprovable and always subject to falsification. Falsifiability Karl Popper argued that induction does not exist. Instead science is based on conjecture and criticism. One should select hypotheses that are the easiest to falsify. Paradigm Shift Thomas Kuhn viewed science as a social process. He emphasised the role of paradigms and the way they are replaced when sufficiently many observations point to problems with the current paradigm. Machine Learning Induction (and learning in general) works best if it is interactive: ▸ form a hypothesis based on the current data ▸ test the hypothesis on new data ▸ repeat The question therefore is not whether the hypothesis is true, but how well it predicts observations. Most decent algorithms for inference use statistical methods and fall outside the scope of this course. Hypothesis Space Hypothesis Space Hypothesis Space Hypothesis Space Example         Example         x ∨ y Example         x ∨ y ¬x ∨ ¬z Boolean Functions Boolean functions In this lecture we will concentrate on learning boolean functions f {0, 1}n → {0, 1} (which can be encoded as propositional formulae) Example x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 f (¯x) 0 1 0 1 1 1 1 0 0 1 √ 1 0 1 0 0 0 0 1 1 1 ⨉⨉⨉ 1 1 0 0 1 1 1 0 1 0 ⨉⨉⨉ 0 0 0 0 1 0 0 0 1 0 √ 0 0 0 1 1 0 0 1 1 0 √ 0 1 1 1 0 1 1 0 1 1 ⨉⨉⨉ 0 1 0 0 1 0 0 1 0 0 √ Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. √ Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. √ √ Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. √ √ √ Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. √ √ √ ⨉⨉⨉ ⨉⨉⨉ ⨉⨉⨉ Conjunctive Hypotheses Setting Learning a boolean function f {0, 1}n → {0, 1} using as hypotheses conjunctions η = xi ∧ ⋅⋅⋅ ∧ ¬xk of literals. General-to-specific ordering η is more specific than ζ if η ⊧ ζ. Idea Find the most specific hypothesis. √ √ √ ⨉⨉⨉ ⨉⨉⨉ ⨉⨉⨉ ⨉⨉⨉ Find-S algorithm ▸ Start with η = ▸ Consider the next positive example ¯b ▸ If η(¯b) is true, continue. ▸ Otherwise, find the most specific ζ such that η ⊧ ζ and ζ(¯b) is true. ▸ Continue with η = ζ. This algorithm computes find the least conjunction with respect to the ⊧-ordering that covers all positive examples. If any of the negative examples is also covered, the training data cannot be described by a conjunction. Example x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 f (¯x) 0 1 0 1 1 1 1 0 0 1 √ 1 0 1 0 0 0 0 1 1 1 ⨉⨉⨉ 1 1 0 0 1 1 1 0 1 0 ⨉⨉⨉ 0 0 0 0 1 0 0 0 1 0 √ 0 0 0 1 1 0 0 1 1 0 √ 0 1 1 1 0 1 1 0 1 1 ⨉⨉⨉ 0 1 0 0 1 0 0 1 0 0 √ η0 = η1 = ¬x1 ∧ x2 ∧ ¬x3 ∧ x4 ∧ x5 ∧ x6 ∧ x7 ∧ ¬x8 ∧ ¬x9 ∧ x10 η2 = ¬x1 ∧ ¬x3 ∧ x5 ∧ ¬x8 η3 = ¬x1 ∧ ¬x3 ∧ x5 η4 = ¬x1 ∧ ¬x3 ∧ x5 Hypothesis space Goal Compute all hypotheses consistent with the data. Let D ⊆ {0, 1}n × {0, 1} be the observed data and H the set of all hypotheses consistent with every datum in D. We compute the sets H+ and H− of maximal/minimal elements of H (with respect to the general-to-specific order ⊧). Candidate-Elimination Algorithm ▸ Start with H+ = {⊺} and H− = { }. ▸ For each positive d ∈ D: ▸ Delete from H+ every hypothesis η with η(d) = 0. ▸ Replace every η ∈ H− with η(d) = 0 by the set of all minimal ζ such that η ⊧ ζ , ζ(d) = 1 , and ζ ⊧ η′ , for some η′ ∈ H+ . ▸ Remove from H− all elements that are not minimal. ▸ For each negative d ∈ D: proceed analogously with H+ and H− interchanged. Example x1 x2 x3 f (¯x) 1 1 0 √ 0 0 1 ⨉⨉⨉ 1 0 0 √ 1 0 1 ⨉⨉⨉         Example x1 x2 x3 f (¯x) 1 1 0 √ 0 0 1 ⨉⨉⨉ 1 0 0 √ 1 0 1 ⨉⨉⨉         Step 0. H− = { } H+ = {⊺} Example x1 x2 x3 f (¯x) 1 1 0 √ 0 0 1 ⨉⨉⨉ 1 0 0 √ 1 0 1 ⨉⨉⨉         Step 0. H− = { } H+ = {⊺} Step 1. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {⊺} Example x1 x2 x3 f (¯x) 1 1 0 √ 0 0 1 ⨉⨉⨉ 1 0 0 √ 1 0 1 ⨉⨉⨉         Step 0. H− = { } H+ = {⊺} Step 1. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {⊺} Step 2. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {x1, x2, ¬x3} Example x1 x2 x3 f (¯x) 1 1 0 √ 0 0 1 ⨉⨉⨉ 1 0 0 √ 1 0 1 ⨉⨉⨉         Step 0. H− = { } H+ = {⊺} Step 1. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {⊺} Step 2. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {x1, x2, ¬x3} Step 3. H− = {x1 ∧ ¬x3} H+ = {x1, ¬x3} Example x1 x2 x3 f (¯x) 1 1 0 √ 0 0 1 ⨉⨉⨉ 1 0 0 √ 1 0 1 ⨉⨉⨉         Step 0. H− = { } H+ = {⊺} Step 1. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {⊺} Step 2. H− = {x1 ∧ x2 ∧ ¬x3} H+ = {x1, x2, ¬x3} Step 3. H− = {x1 ∧ ¬x3} H+ = {x1, ¬x3} Step 4. H− = {x1 ∧ ¬x3} H+ = {¬x3} Decision Trees Decision Trees Organise the function to be learned as a tree. x1 x2 x3 x4 x5 x6 x7 f (¯x) 1 0 1 1 1 0 1 √ 0 1 0 0 0 1 1 ⨉⨉⨉ 1 1 1 1 1 1 0 √ 0 0 1 0 0 1 0 √ 0 0 0 1 1 0 1 ⨉⨉⨉ 1 1 0 1 1 0 0 √ 1 0 1 0 1 0 0 √                                         ⨉⨉⨉ √ ⨉⨉⨉ √ √ √ √ Decision Trees Organise the function to be learned as a tree. x1 x2 x3 x4 x5 x6 x7 f (¯x) 1 0 1 1 1 0 1 √ 0 1 0 0 0 1 1 ⨉⨉⨉ 1 1 1 1 1 1 0 √ 0 0 1 0 0 1 0 √ 0 0 0 1 1 0 1 ⨉⨉⨉ 1 1 0 1 1 0 0 √ 1 0 1 0 1 0 0 √       ⨉⨉⨉ √ ⨉⨉⨉ √ The order of the variables xi matters. Which one do we choose? Ordered Binary Decision Diagrams (OBDDs) ▸ data structure to compactly represent a boolean function ▸ the arguments are ordered x1, . . . , xn ▸ the graph is reduced: merge isomorphic subgraphs and eliminate unneeded vertices (x1 ∧ x3) ∨ (x2 ∧ x3) ∨ ¬(x1 ∨ x2 ∨ x3) x x x x            