An Introduction to Graph Mining An Introduction to Graph Mining Hossein Rahmani hrahmani@liacs.nl 8 December 2009 Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 1 / 49 An Introduction to Graph Mining Outline From Data Mining To Graph Mining Graph Algorithms Function Prediction in PPI Networks Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 2 / 49 An Introduction to Graph Mining From Data Mining To Graph Mining Data Mining Classification Clustering Association rule learning Graph Mining Powerful way to represent data output: expressed as graphs Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 3 / 49 An Introduction to Graph Mining Graph Mining Domains Internet Movie Database Web Data Social Networks Analysis Bio-Informatics Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 4 / 49 An Introduction to Graph Mining Internet Movie Database Movie Recommendation Community Detection Prediction: $2 million in opening weekend Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 5 / 49 An Introduction to Graph Mining Web Mining Web Content Mining Topic Prediction Web Structure Mining Community Mining Web Usage Mining Website Roadmap Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 6 / 49 An Introduction to Graph Mining Social Networks Relationships and flows between people Technologies Email, Blogs Social Networking Software like Orkut, FaceBook Questions: Who has control over what flows in the Network? Who has best visibility of what is happening in the Network? Customer Network Value Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 7 / 49 An Introduction to Graph Mining Protein-Protein Interaction(PPI) Network Nodes: Proteins Edges: Interaction among the Proteins Open Problems: Function Prediction Drug Discovery Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 8 / 49 An Introduction to Graph Mining Function Prediction in PPI Networks Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 9 / 49 An Introduction to Graph Mining Drug Discovery in PPI Networks Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 10 / 49 An Introduction to Graph Mining Outline From Data Mining To Graph Mining Graph Algorithms Some Definitions Graph Matching Graph Compression Graph based Decision Tree Function Prediction in PPI Networks Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 11 / 49 An Introduction to Graph Mining Graph Definition Graph: G = (V ,E, µ, v) V : finite set of nodes. E ⊆ V× V denotes a set of edges. µ : V → LV denotes a node labeling function. v : E → LE denotes an edge labeling function. Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2) Graph G1 is a subgraph of G2, written G1 ⊆ G2, if: V1 ⊆ V2 E1 ⊆ E2 µ1(u) = µ2(u) for all u ∈ V1. v1(u, v) = v2(u, v) for all (u, v) ∈ E1. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 12 / 49 An Introduction to Graph Mining Graph Definition Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2), A graph isomorphism between G1 and G2 is a bijective function f : V1 → V2 satisfying µ1(u) = µ2(f(u)) for all nodes u ∈ V1. For every edge e1 = (u, v) ∈ E1, there exists an edge e2 = (f(u), f(v)) ∈ E2 such that v1(e1) = v2(e2). Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 13 / 49 An Introduction to Graph Mining Graph Definition Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2) Any bijective function f : ˆV1 → ˆV2, where ˆV1 ⊆ V1 and ˆV2 ⊆ V2 is called edit path from G1 to G2. Example: f = {u1 → v3, u2 → , ..., → v6} u1 → v3: Substitution of node u1 ∈ V1 by node v3 ∈ V2 u2 → : Deletion of node u2 ∈ V1 → v6: Insertion of v6 ∈ V2 Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 14 / 49 An Introduction to Graph Mining Frequent Subgraph Frequent subgraphs: support (subgraph) >= minimum support Usage: Graph Classification Graph Clustering Graph Indexing Detection Algorithms: Apriori-Based Approach Pattern Growth Approach Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 15 / 49 An Introduction to Graph Mining Apriori-Based Approach Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 16 / 49 An Introduction to Graph Mining Pattern Growth Approach A graph G is extended by adding new edge e. Edge e may or may not introduce a new node to G. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 17 / 49 An Introduction to Graph Mining Graph Compression Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 18 / 49 An Introduction to Graph Mining Graph Compression Portion of DNA Easy to Analyze? Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 19 / 49 An Introduction to Graph Mining Graph Compression Si compresses the input DNA. Cluster hierarchy Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 20 / 49 An Introduction to Graph Mining Graph Based Decision Tree Decision Tree Each branch corresponds to attribute value Each leaf node assigns a classification Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 21 / 49 An Introduction to Graph Mining Graph Based Decision Tree Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 22 / 49 An Introduction to Graph Mining Graph Based Decision Tree Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 23 / 49 An Introduction to Graph Mining Graph Based Decision Tree Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 24 / 49 An Introduction to Graph Mining Outline From Data Mining To Graph Mining Graph Algorithms Function Prediction in PPI Networks Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 25 / 49 An Introduction to Graph Mining Function Prediction in PPI Networks Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 26 / 49 An Introduction to Graph Mining Formal Description of PPI Networks Represented as an undirected graph Node set V → Proteins Edge set E → Direct interaction ∀v ∈ V is described by a description d(v) ∈ D d(v) derived from the network structure No additional information, such as the protein structure is available ∀v ∈ V optionally is annotated with a label l(v) ∈ L Labels l(v) are sets of protein functions E.g., metabolism, transcription, protein synthesis and etc We assume there is a true labeling function λ that is l(v) = λ(v) where l(v) is defined Task: Find a suitable λ(v) where l(v) is not defined Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 27 / 49 An Introduction to Graph Mining Related Works 1 Transductive approaches Local: Majority Rule and its extensions Global: Global Optimization and Functional Clustering 2 Inductive approaches Local: Topological Redundancies Global: ? → Our Method Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 28 / 49 An Introduction to Graph Mining Majority Rule Local transductive method Assumption: Two Interacting proteins have something in common (e.g., same function) Predicted function: Most common function(s) among classified partners Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 29 / 49 An Introduction to Graph Mining Majority Rule Problem: Links unclassified-unclassified proteins completely neglected Solution: Global optimization methods Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 30 / 49 An Introduction to Graph Mining Functional Clustering Global transductive method Assumption: Dense regions are a sign of the common involvement in biological process Predict the function of unclassified protein based on the cluster they belong to Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 31 / 49 An Introduction to Graph Mining Our Method: A Global Description of Proteins Global inductive approach Node description N nodes in the network numbered from 1 to N Each node is described by an n-dimensional vector i’th component in the vector of node v gives the length of shortest path between v and node i Probelm: Large Graph → very high dimensional descriptions Solution: Reduce dimensionality by focusing on shortest-path distance to a few "important" nodes Feature selection problem Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 32 / 49 An Introduction to Graph Mining Important Proteins Definition: Node i is important if the shortest-path distance of some node v to node i is likely to be relevant for v’s classification Feature fi denotes the shortest-path distance to node i Anova based feature selection For each function j, let Gj be the set of all proteins that have that function j Let ¯fij be the average fi value in Gj Let var(fij ) the variance of the fi in Gj Anova (analysis of variance) based relevancy measure: Ai = Varj [¯fij ] Meanj [var(fij )] (1) A high Ai denotes a high relevance of feature fi Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 33 / 49 An Introduction to Graph Mining Important Proteins Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 34 / 49 An Introduction to Graph Mining Comparison of Learners Random Forest performs best among all the learners in 13 out of 17 cases Other 4 cases are all characterized by a very high class skew Random Forest: Best candidate for learning from this type of data Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 35 / 49 An Introduction to Graph Mining Different Number of Important Proteins We select the 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200 and 300 most important proteins. Describe each protein using its distance to the n most important proteins. Using random forest for function prediction and record precision, recall, F-measure and AUC. In 4 Datasets for 17 functions. The shape of the curves is qualitatively very similar in all cases. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 36 / 49 An Introduction to Graph Mining Less than 10 Important Proteins Bad Bad predictive performance. They do not reach their maximum. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 37 / 49 An Introduction to Graph Mining 10-50 Important Proteins There is usually a major improvement in all four metrics. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 38 / 49 An Introduction to Graph Mining 50-70 Important Proteins For most of the functions, selecting 50-70 important proteins is enough to obtain good classication results. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 39 / 49 An Introduction to Graph Mining Conclusions Graph Mining Domains Introduction to Graph Algorithms Graph Matching Graph Compression Graph based Decision Tree Protein Function prediction Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 40 / 49 An Introduction to Graph Mining Thanks! Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 41 / 49 An Introduction to Graph Mining References Cook, D. J. and Holder, L. B. 2006 Mining Graph Data. John Wiley & Sons. Rahmani, H., Blockeel, H. and Bender, A., Proc. 3rd Int. Workshop in Machine Learn. Syst. Biol. (MLSB09) 2009 85 – 94. Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 42 / 49 An Introduction to Graph Mining Transductive Learning Task: Predict the label of all the nodes Input: G = (V, E, d, l) with l a partial function Output: Complete version G = (V, E, d, l ) with l a complete function that is consistent with l l should approximates λ by optimization criterion o o expresses our assumption about λ E.g., directly connected nodes tend to have similar labels Number of {v1, v2} edges where l (v1) = l (v2) edges should be minimal Our assumption about λ is called bias of transductive learner Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 43 / 49 An Introduction to Graph Mining Inductive Learning Task: Learn a function f : D → L that maps a node description d(v) onto its label l(v) Input: G = (V, E, d, l) with l a partial function Output: f : D → L such that f(d(v)) = l(v) Note: f differs from l in that it maps D, not V, onto L It can make prediction for node v that is not in the original network, as long as d(v) is known Biases Transductive bias: Assumption expressed by optimization criterion o Description bias D Inductive bias: Choice of learning algorithm that is used to learn f from (d(v), l(v)) couples Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 44 / 49 An Introduction to Graph Mining Global Optimization Global transductive method Links unclassified/unclassified proteins also taken into account Any probable function assignment to the whole set of unclassified proteins is considered Counting number of interacting pairs with no common functions Select the function assignment with lowest value Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 45 / 49 An Introduction to Graph Mining Topological Approaches Local inductive method Node description d(v) is built based on the local neighborhood Count number of patterns (e.g., graphlet) around the proteins Make the signature vector for each protein Assumption: Proteins with high similar signature vector have same functions Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 46 / 49 An Introduction to Graph Mining Topological Approaches Some topological patterns (Number of considered patterns = 73) Orbit: One of the previous patterns Same orbit frequency → same function Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 47 / 49 An Introduction to Graph Mining Topological Approaches Local inductive method Node description d(v) is built based on the local neighborhood Count number of patterns (e.g., graphlet) around the proteins Make the signature vector for each protein Assumption: Proteins with high similar signature vector have same functions Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 48 / 49 An Introduction to Graph Mining Topological Approaches Some topological patterns (Number of considered patterns = 73) Orbit: One of the previous patterns Same orbit frequency → same function Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 49 / 49