An Introduction to Graph Mining
An Introduction to Graph Mining
Hossein Rahmani
hrahmani@liacs.nl
8 December 2009
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 1 / 49
An Introduction to Graph Mining
Outline
From Data Mining To Graph Mining
Graph Algorithms
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 2 / 49
An Introduction to Graph Mining
From Data Mining To Graph Mining
Data Mining
Classiﬁcation
Clustering
Association rule learning
Graph Mining
Powerful way to represent
data
output: expressed as
graphs
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 3 / 49
An Introduction to Graph Mining
Graph Mining Domains
Internet Movie Database
Web Data
Social Networks Analysis
Bio-Informatics
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 4 / 49
An Introduction to Graph Mining
Internet Movie Database
Movie Recommendation
Community Detection
Prediction: $2 million in opening weekend
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 5 / 49
An Introduction to Graph Mining
Web Mining
Web Content Mining
Topic Prediction
Web Structure Mining
Community Mining
Web Usage Mining
Website Roadmap
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 6 / 49
An Introduction to Graph Mining
Social Networks
Relationships and ﬂows
between people
Technologies
Email, Blogs
Social Networking Software
like Orkut, FaceBook
Questions:
Who has control over what
ﬂows in the Network?
Who has best visibility of what
is happening in the Network?
Customer Network Value
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 7 / 49
An Introduction to Graph Mining
Protein-Protein Interaction(PPI) Network
Nodes: Proteins
Edges: Interaction among the
Proteins
Open Problems:
Function Prediction
Drug Discovery
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 8 / 49
An Introduction to Graph Mining
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 9 / 49
An Introduction to Graph Mining
Drug Discovery in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 10 / 49
An Introduction to Graph Mining
Outline
From Data Mining To Graph Mining
Graph Algorithms
Some Deﬁnitions
Graph Matching
Graph Compression
Graph based Decision Tree
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 11 / 49
An Introduction to Graph Mining
Graph Deﬁnition
Graph: G = (V ,E, µ, v)
V : ﬁnite set of nodes.
E ⊆ V× V denotes a set of edges.
µ : V → LV denotes a node labeling function.
v : E → LE denotes an edge labeling function.
Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2)
Graph G1 is a subgraph of G2, written G1 ⊆ G2,
if:
V1 ⊆ V2
E1 ⊆ E2
µ1(u) = µ2(u) for all u ∈ V1.
v1(u, v) = v2(u, v) for all (u, v) ∈ E1.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 12 / 49
An Introduction to Graph Mining
Graph Deﬁnition
Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2),
A graph isomorphism between G1 and G2 is a bijective function
f : V1 → V2 satisfying
µ1(u) = µ2(f(u)) for all nodes u ∈ V1.
For every edge e1 = (u, v) ∈ E1, there exists an edge
e2 = (f(u), f(v)) ∈ E2 such that v1(e1) = v2(e2).
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 13 / 49
An Introduction to Graph Mining
Graph Deﬁnition
Let G1 = (V1,E1,µ1, v1) and G2 = (V2,E2,µ2, v2)
Any bijective function f : ˆV1 → ˆV2, where ˆV1 ⊆ V1 and ˆV2 ⊆ V2 is
called edit path from G1 to G2.
Example: f = {u1 → v3, u2 → , ..., → v6}
u1 → v3: Substitution of node u1 ∈ V1 by node v3 ∈ V2
u2 → : Deletion of node u2 ∈ V1
→ v6: Insertion of v6 ∈ V2
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 14 / 49
An Introduction to Graph Mining
Frequent Subgraph
Frequent subgraphs:
support (subgraph) >=
minimum support
Usage:
Graph Classiﬁcation
Graph Clustering
Graph Indexing
Detection Algorithms:
Apriori-Based Approach
Pattern Growth Approach
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 15 / 49
An Introduction to Graph Mining
Apriori-Based Approach
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 16 / 49
An Introduction to Graph Mining
Pattern Growth Approach
A graph G is extended by adding new edge e.
Edge e may or may not introduce a new node to G.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 17 / 49
An Introduction to Graph Mining
Graph Compression
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 18 / 49
An Introduction to Graph Mining
Graph Compression
Portion of DNA
Easy to Analyze?
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 19 / 49
An Introduction to Graph Mining
Graph Compression
Si compresses the input
DNA.
Cluster hierarchy
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 20 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Decision Tree
Each branch corresponds to attribute value
Each leaf node assigns a classiﬁcation
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 21 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 22 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 23 / 49
An Introduction to Graph Mining
Graph Based Decision Tree
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 24 / 49
An Introduction to Graph Mining
Outline
From Data Mining To Graph Mining
Graph Algorithms
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 25 / 49
An Introduction to Graph Mining
Function Prediction in PPI Networks
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 26 / 49
An Introduction to Graph Mining
Formal Description of PPI Networks
Represented as an undirected graph
Node set V → Proteins
Edge set E → Direct interaction
∀v ∈ V is described by a description d(v) ∈ D
d(v) derived from the network structure
No additional information, such as the protein structure is available
∀v ∈ V optionally is annotated with a label l(v) ∈ L
Labels l(v) are sets of protein functions
E.g., metabolism, transcription, protein synthesis and etc
We assume there is a true labeling function λ that is l(v) = λ(v)
where l(v) is deﬁned
Task: Find a suitable λ(v) where l(v) is not deﬁned
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 27 / 49
An Introduction to Graph Mining
Related Works
1 Transductive approaches
Local: Majority Rule and its extensions
Global: Global Optimization and
Functional Clustering
2 Inductive approaches
Local: Topological Redundancies
Global: ? → Our Method
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 28 / 49
An Introduction to Graph Mining
Majority Rule
Local transductive method
Assumption: Two Interacting
proteins have something in
common (e.g., same function)
Predicted function: Most
common function(s) among
classiﬁed partners
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 29 / 49
An Introduction to Graph Mining
Majority Rule
Problem: Links unclassiﬁed-unclassiﬁed proteins completely
neglected
Solution: Global optimization methods
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 30 / 49
An Introduction to Graph Mining
Functional Clustering
Global transductive method
Assumption: Dense regions are a sign of
the common involvement in biological
process
Predict the function of unclassiﬁed protein
based on the cluster they belong to
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 31 / 49
An Introduction to Graph Mining
Our Method: A Global Description of Proteins
Global inductive approach
Node description
N nodes in the network numbered from 1 to N
Each node is described by an n-dimensional
vector
i’th component in the vector of node v gives the
length of shortest path between v and node i
Probelm: Large Graph → very high dimensional
descriptions
Solution: Reduce dimensionality by focusing on
shortest-path distance to a few "important" nodes
Feature selection problem
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 32 / 49
An Introduction to Graph Mining
Important Proteins
Deﬁnition: Node i is important if the shortest-path distance of
some node v to node i is likely to be relevant for v’s classiﬁcation
Feature fi denotes the shortest-path distance to node i
Anova based feature selection
For each function j, let Gj be the set of all proteins that have that
function j
Let ¯fij be the average fi value in Gj
Let var(fij ) the variance of the fi in Gj
Anova (analysis of variance) based relevancy measure:
Ai =
Varj [¯fij ]
Meanj [var(fij )]
(1)
A high Ai denotes a high relevance of feature fi
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 33 / 49
An Introduction to Graph Mining
Important Proteins
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 34 / 49
An Introduction to Graph Mining
Comparison of Learners
Random Forest performs best among all the learners in 13 out of
17 cases
Other 4 cases are all characterized by a very high class skew
Random Forest: Best candidate for learning from this type of data
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 35 / 49
An Introduction to Graph Mining
Different Number of Important Proteins
We select the 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80,
90, 100, 200 and 300 most important proteins.
Describe each protein using its distance to the n most important
proteins.
Using random forest for function prediction and record precision,
recall, F-measure and AUC.
In 4 Datasets for 17 functions.
The shape of the curves is qualitatively very similar in all cases.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 36 / 49
An Introduction to Graph Mining
Less than 10 Important Proteins
Bad Bad predictive performance.
They do not reach their maximum.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 37 / 49
An Introduction to Graph Mining
10-50 Important Proteins
There is usually a major improvement in all four metrics.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 38 / 49
An Introduction to Graph Mining
50-70 Important Proteins
For most of the functions, selecting 50-70 important proteins is
enough to obtain good classication results.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 39 / 49
An Introduction to Graph Mining
Conclusions
Graph Mining Domains
Introduction to Graph Algorithms
Graph Matching
Graph Compression
Graph based Decision Tree
Protein Function prediction
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 40 / 49
An Introduction to Graph Mining
Thanks!
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 41 / 49
An Introduction to Graph Mining
References
Cook, D. J. and Holder, L. B. 2006 Mining Graph Data. John Wiley
& Sons.
Rahmani, H., Blockeel, H. and Bender, A., Proc. 3rd Int.
Workshop in Machine Learn. Syst. Biol. (MLSB09) 2009 85 – 94.
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 42 / 49
An Introduction to Graph Mining
Transductive Learning
Task: Predict the label of all the nodes
Input: G = (V, E, d, l) with l a partial function
Output: Complete version G = (V, E, d, l ) with l a complete
function that is consistent with l
l should approximates λ by optimization criterion o
o expresses our assumption about λ
E.g., directly connected nodes tend to have similar labels
Number of {v1, v2} edges where l (v1) = l (v2) edges should be
minimal
Our assumption about λ is called bias of transductive learner
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 43 / 49
An Introduction to Graph Mining
Inductive Learning
Task: Learn a function f : D → L that maps a node description
d(v) onto its label l(v)
Input: G = (V, E, d, l) with l a partial function
Output: f : D → L such that f(d(v)) = l(v)
Note: f differs from l in that it maps D, not V, onto L
It can make prediction for node v that is not in the original network,
as long as d(v) is known
Biases
Transductive bias: Assumption expressed by optimization criterion
o
Description bias D
Inductive bias: Choice of learning algorithm that is used to learn f
from (d(v), l(v)) couples
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 44 / 49
An Introduction to Graph Mining
Global Optimization
Global transductive method
Links unclassiﬁed/unclassiﬁed proteins
also taken into account
Any probable function assignment to the
whole set of unclassiﬁed proteins is
considered
Counting number of interacting pairs
with no common functions
Select the function assignment with
lowest value
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 45 / 49
An Introduction to Graph Mining
Topological Approaches
Local inductive method
Node description d(v) is built based on the local neighborhood
Count number of patterns (e.g., graphlet) around the proteins
Make the signature vector for each protein
Assumption: Proteins with high similar signature vector have
same functions
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 46 / 49
An Introduction to Graph Mining
Topological Approaches
Some topological patterns (Number of considered patterns = 73)
Orbit: One of the previous patterns
Same orbit frequency → same function
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 47 / 49
An Introduction to Graph Mining
Topological Approaches
Local inductive method
Node description d(v) is built based on the local neighborhood
Count number of patterns (e.g., graphlet) around the proteins
Make the signature vector for each protein
Assumption: Proteins with high similar signature vector have
same functions
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 48 / 49
An Introduction to Graph Mining
Topological Approaches
Some topological patterns (Number of considered patterns = 73)
Orbit: One of the previous patterns
Same orbit frequency → same function
Hossein Rahmani (LIACS) An Introduction to Graph Mining 8 December 2009 49 / 49